Bookmarked Inside the secret list of websites that make AI like ChatGPT sound smart (by By Kevin Schaul, Szu Yu Chen and Nitasha Tiku in the Washington Post)

The Washington Post takes a closer look at Google’s C4 dataset, which is comprised of the content of 15 million websites, and has been used to train various LLM’s. Perhaps also the one used by OpenAI for e.g. ChatGPT, although it’s not known what OpenAI has been using as source material.

They include a search engine, which let’s you submit a domain name and find out how many tokens it contributed to the dataset (a token is usually a word, or part of a word).

Obviously I looked at some of the domains I use. This blog is the 102860th contributor to the dataset, with 200.000 tokens (1/10000% of the total).


Screenshot of the Washington Post’s search tool, showing the result for this domain, zylstra.org.

12 reactions on “How Many Tokens From Your Blog Are In Google’s LLM?

  1. Rank Domain Tokens Percent ofall tokens145 github.com 16M 0.01%213,288 github.community 110k0.00007%^^ this might be a real problem.

  2. @jackyan maybe a diff with mobile browsers vs desktop? When I use the search it mentions the tokens before the percentage. If you follow the link to my blogpost there’s a screenshot of what I saw

Comments are closed.