Bookmarked Inside the secret list of websites that make AI like ChatGPT sound smart (by By Kevin Schaul, Szu Yu Chen and Nitasha Tiku in the Washington Post)

The Washington Post takes a closer look at Google’s C4 dataset, which is comprised of the content of 15 million websites, and has been used to train various LLM’s. Perhaps also the one used by OpenAI for e.g. ChatGPT, although it’s not known what OpenAI has been using as source material.

They include a search engine, which let’s you submit a domain name and find out how many tokens it contributed to the dataset (a token is usually a word, or part of a word).

Obviously I looked at some of the domains I use. This blog is the 102860th contributor to the dataset, with 200.000 tokens (1/10000% of the total).


Screenshot of the Washington Post’s search tool, showing the result for this domain, zylstra.org.

12 reactions on “How Many Tokens From Your Blog Are In Google’s LLM?

  1. Rank Domain Tokens Percent ofall tokens145 github.com 16M 0.01%213,288 github.community 110k0.00007%^^ this might be a real problem.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.