Bookmarked Inside the secret list of websites that make AI like ChatGPT sound smart (by By Kevin Schaul, Szu Yu Chen and Nitasha Tiku in the Washington Post)
The Washington Post takes a closer look at Google’s C4 dataset, which is comprised of the content of 15 million websites, and has been used to train various LLM’s. Perhaps also the one used by OpenAI for e.g. ChatGPT, although it’s not known what OpenAI has been using as source material.
They include a search engine, which let’s you submit a domain name and find out how many tokens it contributed to the dataset (a token is usually a word, or part of a word).
Obviously I looked at some of the domains I use. This blog is the 102860th contributor to the dataset, with 200.000 tokens (1/10000% of the total).
Screenshot of the Washington Post’s search tool, showing the result for this domain, zylstra.org.