David Speier is a freelance journalist who researches the German far right. In this thread on Mastodon he describes the work they’ve done to check statements from interviews with a former far right member, and to connect them to other source material (photos from events, other people, reports etc.). Of interest to me here is that they used Obsidian to map out people, groups, places, events and occurrences, to verify, to see overlaps and spot blind spots. Nice example of taking something that is inherently text and image based and use Obsidian to ferret out the connections and patterns. There are some topics that currently pop-up in my work in very different projects, and more purposefully teasing out the connections like in this example seems a useful notion.
In einer #Obsidian-Datenbank haben wir Kontaktpersonen, Gruppen, Orte und Ereignisse zusammengeführt. Mehr als 70 umfangreiche Belegdokumente untermauern die einzelnen Aussagen von „Michael“
Ted Chiang realises that corporates are best positioned to leverage the affordances of algorithmic applications, and that that is where the risk of the ‘runaway AIs’ resides. I agree that they are best positioned, because corporations are AI’s non-digital twin, and have been recognised as such for a decade.
Brewster Kahle said (in 2014) that corporations should be seen as the 1st generation AIs, and Charlie Stross reinforced it (in 2017) by dubbing corporations ‘Slow AI’ as corporations are context blind, single purpose algorithms. That single purpose being shareholder value. Jeremy Lent (in 2017) made the same point when he dubbed corporations ‘socio-paths with global reach’ and said that the fear of runaway AI was focusing on the wrong thing because “humans have already created a force that is well on its way to devouring both humanity and the earth in just the way they fear. It’s called the Corporation“. Basically our AI overlords are already here: they likely employ you. Of course existing Slow AI is best positioned to adopt its faster young, digital algorithms. It as such can be seen as the first step of the feared iterative path of run-away AI.
The doomsday scenario is … A.I.-supercharged corporations destroying the environment and the working class in their pursuit of shareholder value.
I’ll repeat the image I used in my 2019 blogpost linked above:
Your Slow AI overlords looking down on you, photo Simone Brunozzi, CC-BY-SA
LLMs usually require loads of training data, the bigger the better. This biases such training, as Maggie Appleton also pointed out, to western and English dominated resources. This paper describes creating a model for a group of 11 African languages that are underresourced online, and as a result don’t figure significantly in the large models going around (4 of the 11 have never been included in a LLM before). All the material is available on GitHub. They conclude that training a LLM with such lower resourced languages with the larger ones is less effective than taking a grouping of underresourced languages together. Less than 1GB of text can provide a competitive model! That sounds highly interesting for the stated reason: it allows models to be created for underresourced languages at relatively little effort. I think that is a fantastic purpose because it may assist in keeping a wide variety of languages more relevant and bucking the trend towards cultural centralisation (look at me writing here in English for a case in point). It also makes me wonder about a different group of use cases: where you have texts in a language that is well enough represented in the mainstream LLMs, but where the corpus you are specifically or only interested in is much smaller, below that 1GB threshold. For instance all your own written output over the course of your life, or for certain specific civic tech applications.
We show that it is possible to train competitive multilingual language models on less than 1 GB of text. .our model … is very competitive overall. … Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages.
I think it is a bit of a ‘well-duh’ thing but worth underlining in general conversation still. The name Large Language Model is somewhat misleading and a misnomer as it does not contain a model of how (a) language (theoritically) works. It e.g. doesn’t generate texts by following grammar rules. How LLMs can generate code from natural language prompts because they have been trained with sofware code without the theoretical underpinnings of programming languages leads to this by extension. Veres suggests using the term of Large Corpus Models. I think getting people to write LCMs and not LLMs will be impossible. I can however for myself highlight the difference by reading ‘Large Language usage Model’ everytime I see LLM. As the Corpus is one of language(s) in actual use.
We argue that the term language model is misleading because deep learning models are not theoretical models of language and propose the adoption of corpus model instead, which better reflects the genesis and contents of the model.
I very much enjoyed this talk that Maggie Appleton gave at Causal Islands in Toronto, Canada, 25-27 April 2023. It reminds me of the fun and insightful keynotes at Reboot conferences a long time ago, some of which shifted my perspectives longterm.
This talk is about the impact on how we will experience and use the web when generative algorithms create most of its content. Appleton explores the potential effects of that and the futures that might result. She puts human agency at the center when it comes to how to choose our path forward in experimenting and using ‘algogens’ on the web, and how to navigate an internet where nobody believes you’re human.
Appleton is a product designer with Ought, on products that use language models to augment and extend human (cognitive) capabilities. Ought makes Elicit, a tool that surfaces (and summarises) potentially useful papers for your research questions. I use Elicit every now and then, and really should use it more often.
An exploration of the problems and possible futures of flooding the web with generative AI content
The Washington Post takes a closer look at Google’s C4 dataset, which is comprised of the content of 15 million websites, and has been used to train various LLM’s. Perhaps also the one used by OpenAI for e.g. ChatGPT, although it’s not known what OpenAI has been using as source material.
They include a search engine, which let’s you submit a domain name and find out how many tokens it contributed to the dataset (a token is usually a word, or part of a word).
Obviously I looked at some of the domains I use. This blog is the 102860th contributor to the dataset, with 200.000 tokens (1/10000% of the total).
Screenshot of the Washington Post’s search tool, showing the result for this domain, zylstra.org.
The Washington Post let's you search a domain's contribution to Google's C4 dataset, used to train various LLM for various #generativeAI #AI
My blog contributed 200.000 tokens. Your's?