<https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/>


AI chatbots have exploded in popularity over the past four months, stunning the 
public with their awesome abilities, from writing sophisticated term papers to 
holding unnervingly lucid conversations.

Chatbots cannot think like humans: They do not actually understand what they 
say. They can mimic human speech because the artificial intelligence that 
powers them has ingested a gargantuan amount of text, mostly scraped from the 
internet.

[Big Tech was moving cautiously on AI. Then came ChatGPT.]

This text is the AI’s main source of information about the world as it is being 
built, and it influences how it responds to users. If it aces the bar exam, for 
example, it’s probably because its training data included thousands of LSAT 
practice sites.

Tech companies have grown secretive about what they feed the AI. So The 
Washington Post set out to analyze one of these data sets to fully reveal the 
types of proprietary, personal, and often offensive websites that go into an 
AI’s training data.

To look inside this black box, we analyzed Google’s C4 data set, a massive 
snapshot of the contents of 15 million websites that have been used to instruct 
some high-profile English-language AIs, called large language models, including 
Google’s T5 and Facebook’s LLaMA.

The Post worked with researchers at the Allen Institute for AI on this 
investigation and categorized the websites using data from SimilarWeb, a web 
analytics company. About a third of the websites could not be categorized, 
mostly because they no longer appear on the internet. Those are not shown.

Tap on the boxes above to view top sites

We then ranked the remaining 10 million websites based on how many “tokens” 
appeared from each in the data set. Tokens are small bits of text used to 
process disorganized information — typically a word or phrase.

Wikipedia to Wowhead

The data set was dominated by websites from industries including journalism, 
entertainment, software development, medicine and content creation, helping to 
explain why these fields may be threatened by the new wave of artificial 
intelligence. The three biggest sites were patents.google.com No. 1, which 
contains text from patents issued around the world; wikipedia.org No. 2, the 
free online encyclopedia; and scribd.com No. 3, a subscription-only digital 
library. Also high on the list: b-ok.org No. 190, a notorious market for 
pirated e-books that has since been seized by the U.S. Justice Department. At 
least 27 other sites identified by the U.S. government as markets for piracy 
and counterfeits were present in the data set.

Some top sites seemed arbitrary, like wowhead.com No. 181, a World of Warcraft 
player forum; thriveglobal.com No. 175, a product for beating burnout founded 
by Arianna Huffington; and at least 10 sites that sell dumpsters, including 
dumpsteroid.com No. 183, that no longer appear accessible.

Jump to the dataset

Others raised significant privacy concerns. Two sites in the top 100, 
coloradovoters.info No. 40 and flvoters.com No. 73, had privately hosted copies 
of state voter registration databases. Though voter data is public, the models 
could use this personal information in unknown ways.


Content without consent

Top Business & Industrial sites:

fool.com

kickstarter.com

sec.gov

marketwired.com

city-data.com

myemail.constantcontact.com

finance.yahoo.com

prweb.com

entrepreneur.com

globalresearch.ca

Business and industrial websites made up the biggest category (16 percent of 
categorized tokens), led by fool.com No. 13, which provides investment advice. 
Not far behind were kickstarter.com No. 25, which lets users crowdfund for 
creative projects, and further down the list, patreon.com No. 2,398, which 
helps creators collect monthly fees from subscribers for exclusive content.

Kickstarter and Patreon may give the AI access to artists’ ideas and marketing 
copy, raising concerns the technology may copy this work in suggestions to 
users. Currently, artists receive no compensation or credit when their work is 
included in AI training data, and they have lodged copyright infringement 
claims against text-to-image generators Stable Diffusion, MidJourney and 
DeviantArt.

The Post’s analysis suggests more legal challenges may be on the way: The 
copyright symbol — which denotes a work registered as intellectual property — 
appears more than 200 million times in the C4 data set.

All the news

Top News sites:

nytimes.com

latimes.com

theguardian.com

forbes.com

huffpost.com

washingtonpost.com

businessinsider.com

chicagotribune.com

theatlantic.com

aljazeera.com


The News and Media category ranks third across categories. But half of the top 
10 sites overall were news outlets: nytimes.com No. 4, latimes.com No. 6, 
theguardian.com No. 7, forbes.com No. 8, and huffpost.com No. 9. 
(Washingtonpost.com No. 11 was close behind.) Like artists and creators, some 
news organizations have criticized tech companies for using their content 
without authorization or compensation.

Meanwhile, we found several media outlets that rank low on NewsGuard’s 
independent scale for trustworthiness: RT.com No. 65, the Russian state-backed 
propaganda site; breitbart.com No. 159, a well-known source for far-right news 
and opinion; and vdare.com No. 993, an anti-immigration site that has been 
associated with white supremacy.

Chatbots have been shown to confidently share incorrect information, but don’t 
always offer citations. Untrustworthy training data could lead it to spread 
bias, propaganda and misinformation — without the user being able to trace it 
to the original source.


Religious sites reflect a Western perspective

Top Religious sites:

patheos.com

gty.org

jewishworldreview.com

thekingdomcollective.com

biblehub.com

liveprayer.com

lds.org

wacriswell.com

wdtprs.com

bibleforums.org


Sites devoted to community made up about 5 percent of categorized content, with 
religion dominating that category. Among the top 20 religious sites, 14 were 
Christian, two were Jewish and one was Muslim, one was Mormon, one was 
Jehovah’s Witness, and one celebrated all religions.

The top Christian site, Grace to You (gty.org No. 164), belongs to Grace 
Community Church, an evangelical megachurch in California. Christianity Today 
recently reported that the church counseled women to “continue to submit” to 
abusive fathers and husbands and to avoid reporting them to authorities.

The highest ranked Jewish site was jewishworldreview.com No. 366, an online 
magazine for Orthodox Jews. In December, it published an article about Hanukkah 
that blamed the rise of antisemitism in the United States on “the far-right, 
fundamentalist Islam,” as well as “an African-American community influenced by 
the Black Lives Matter movement.”

Anti-Muslim bias has emerged as a problem in some language models. For example, 
a study published in the journal Nature found that OpenAI’s ChatGPT-3 completed 
the phrase “Two muslims walked into a …” with violent actions 66 percent of the 
time.

A trove of personal blogs

Top Technology sites:

instructables.com

ipfs.io

docs.microsoft.com

forums.macrumors.com

medium.com

makeuseof.com

sites.google.com

slideshare.net

s3.amazonaws.com

pcworld.com

Technology is the second largest category, making up 15 percent of categorized 
tokens. This includes many platforms for building websites, like 
sites.google.com No. 85, which hosts pages for everything from a Judo club in 
Reading England to a Catholic preschool in New Jersey.

The data set contained more than half a million personal blogs, representing 
3.8 percent of categorized tokens. Publishing platform medium.com No. 46 was 
the fifth largest technology site and hosts tens of thousands of blogs under 
its domain. Our tally includes blogs written on platforms like WordPress, 
Tumblr, Blogspot and Live Journal.

These online diaries ranged from professional to personal, like a blog called 
“Grumpy Rumblings,” co-written by two anonymous academics, one of whom recently 
wrote about how their partner’s unemployment affected the couple’s taxes. One 
of the top blogs offered advice for live-action role-playing games. Another top 
site, Uprooted Palestinians, often writes about “Zionist terrorism” and “the 
Zionist ideology.”

Social networks like Facebook and Twitter — the heart of the modern web — 
prohibit scraping, which means most data sets used to train AI cannot access 
them. Tech giants like Facebook and Google that are sitting on mammoth troves 
of conversational data have not been clear about how personal user information 
may be used to train AI models that are used internally or sold as products.

What the filters missed

Like most companies, Google heavily filtered the data before feeding it to the 
AI. (C4 stands for Colossal Clean Crawled Corpus.). In addition to removing 
gibberish and duplicate text, the company used the open source “List of Dirty, 
Naughty, Obscene, and Otherwise Bad Words,” which includes 402 terms in English 
and one emoji (a hand making a common but obscene gesture). Companies typically 
use high-quality datasets to fine-tune models, shielding users from some 
unwanted content.

While this kind of blocklist is intended to limit a model’s exposure to racial 
slurs and obscenities as it’s being trained, it also has been shown to 
eliminate some nonsexual LGBTQ content. As prior research has shown, a lot gets 
past the filters. We found hundreds of examples of pornographic websites and 
more than 72,000 instances of “swastika,” one of the banned terms from the list.

Story continues below advertisement

Meanwhile, The Post found that the filters failed to remove some troubling 
content, including the white supremacist site stormfront.org No. 27,505, the 
anti-trans site kiwifarms.net No. 378,986, and 4chan.org No. 4,339,889, the 
anonymous message board known for organizing targeted harassment campaigns 
against individuals.

We also found threepercentpatriots.com No. 8,788,836, a downed site espousing 
an anti-government ideology shared by people charged in connection with the 
Jan. 6, 2021, attack on the U.S. Capitol. And sites promoting conspiracy 
theories, including the far-right QAnon phenomenon and “pizzagate,” the false 
claim that a D.C. pizza joint was a front for pedophiles, were also present.

Is your website training AI?

A web crawl may sound like a copy of the entire internet, but it’s just a 
snapshot, capturing content from a sampling of webpages at a particular moment 
in time. C4 began as a scrape performed in April 2019 by the nonprofit 
CommonCrawl, a popular resource for AI models. CommonCrawl told The Post that 
it tries to prioritize the most important and reputable sites, but does not try 
to avoid licensed or copyrighted content.

The websites in Google’s C4 dataset

Search for a website

Page 
1 of 0

RankDomainPercent of 
all tokens

Page 
1 of 0

The Post believes it is important to present the complete contents of the data 
fed into AI models, which promise to govern many aspects of modern life. Some 
websites in this data set contain highly offensive language and we have 
attempted to mask these words. Objectionable content may remain.

Note: Some websites were unable to to be categorized and, in many cases, are no 
longer accessible.

While C4 is huge, large language models probably use even more gargantuan data 
sets, experts said. For example, the training data for OpenAI’s GPT-3, released 
in 2020, began with as much as 40 times the amount of web scraped data in C4. 
GPT-3’s training data also includes all of English language Wikipedia, a 
collection of free novels by unpublished authors frequently used by Big Tech 
companies and a compilation of text from links highly rated by Reddit users. 
(Reddit, a site regularly used in AI training models, announced Tuesday it 
plans to charge companies for such access.)

[Quiz: Did AI make this? Test your knowledge.]

Experts say many companies do not document the contents of their training data 
— even internally — for fear of finding personal information about identifiable 
individuals, copyrighted material and other data grabbed without consent.

As companies stress the challenges of explaining how chatbots make decisions, 
this is one area where executives have the power to be transparent.

About this story

For this story, The Post contacted researchers at Allen Institute for AI, who 
re-created Google’s C4 data set and provided The Post with its 15.7 million 
domains. The Post cleaned and analyzed this data in a few ways.

Many websites have separate domains for their mobile versions (i.e., 
“en.m.wikipedia.org” and “en.wikipedia.org”). We treated these as the same 
domain. We also combined subdomains aimed at specific languages, so 
“en.wikipedia.org” became “wikipedia.org.”

This left 15.1 million unique domains.

SimilarWeb helped The Post place two-thirds of them — about 10 million domains 
— into categories and subcategories. (The rest could not be categorized, often 
because they were no longer accessible.) We then manually checked the websites 
with the most tokens to make sure the categories made sense. We also combined 
many of the smallest subcategories.

Categorization is difficult and ambiguous, but we attempted to treat the data 
consistently to foster a general understanding of C4′s contents.

The researchers at Allen Institute for AI were Jesse Dodge, Yanai Elazar, Dirk 
Groeneveld and Nicole DeCario.

Illustration by Talia Trackim.

Editing by Kate Rabinowitz, Alexis Sobel Fitts and Karly Domb Sadof.

_______________________________________________
nexa mailing list
nexa@server-nexa.polito.it
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

Reply via email to