The Websites Used to Train AI Identified by The Washington Post

June 29, 2023

IBL News | New York

Chatbots mimic human speech because the AI that powers them has ingested a huge amount of text, mostly scraped from the Internet. If they ace the bar exam it’s because it’s training data included thousands of practice sites.

The Washington Post analyzed those websites used to train AI, although companies like OpenAI didn’t disclose what dataset used.

The newspaper worked with researchers of the Allen Institute for AI and categorized the websites, with data from analytics firm Similarweb. Into a tree map of 11 categories.

It started looking inside Google’s C4 data set, which includes 15 million websites from journalism, entertainment, software development, medicine, and content creation, among other industries. Facebook’s LLaMa used it.

The three biggest sites were patents.google.com (which contains text from patents issued around the world), wikipedia.org, and scribd.com (a subscription-only digital library). Also, on the list: the notorious market for pirated e-books b-ok.org a, along with 27 other sites identified by the U.S. government as markets for piracy and counterfeits.

In the area of top business & industrial sites, these were some of the sites: fool.com, kickstarter.com, sec.gov, marketwired.com, city-data.com, patreon.com, myemail.constantcontact.com, finance.yahoo.com, prweb.com, entrepreneur.com, globalresearch.ca.

Top News sites: nytimes.com, latimes.com, theguardian.com, forbes.com, huffpost.com, washingtonpost.com, businessinsider.com, chicagotribune.com, theatlantic.com, aljazeera.com, RT.com (the Russian state-backed propaganda site), breitbart.com, and vdare.com (anti-immigration), among others.

Top Religious sites: patheos.com, gty.org, jewishworldreview.com, thekingdomcollective.com, biblehub.com, liveprayer.com, lds.org, wacriswell.com, wdtprs.com, bibleforums.org, etc.

Top Technology sites: instructables.com, ipfs.io, docs.microsoft.com, forums.macrumors.com, medium.com, makeuseof.com, sites.google.com, slideshare.net, s3.amazonaws.com, pcworld.com, sites.google.com, WordPress, Tumblr, Blogspot, Live Journal, etc.

Data sets used to train AI couldn’t access social networks like Facebook and Twitter, which prohibit scraping.

@kevinschaul and @dataviz_szuyu did all the hard work and built this great search tool for sites. A bunch of us already found their old personal blogs. Hope you’ll find the rankings as fascinating as I did https://t.co/xckLl15ZaS pic.twitter.com/7Q7zmzDC6w

— Nitasha Tiku @nitashatiku@mastodon.social (@nitashatiku) April 19, 2023

• Search Engine Land: Search the 15.7 million websites in Google’s C4 dataset

Latest News