Bing Chat is connected to the internet, allowing internet searches when you ask the chatbot something. This includes Twitter. It then weaves those online finds into the texts it puts together off your prompt. Henk van Ess shows how quickly the content from a Twitter message gets incorporated (and changed if additional messages are available). With just three tweets he influenced Bing Chat output. This also opens a pathway for influence and dissemination of mis-info, especially since the recent quality changes over at Twitter. The feedback loop this creates (internet texts get generated based on existing internet texts, etc.) will easily result in a vicious circle (In her recent talk Maggie Appleton listed this as one of her possible futures, using a metaphor I can’t unsee, but which does describe it effectively: Human Centipede Epistemology)
Bing/ChatGPT’s rapid response to tweets has a double-edged sword. Bing quickly corrects itself based on tweets … But those with specific agendas or biases may attempt to abuse the system … We’ve seen it all before. This is similar to Google Bombing…
David Speier is a freelance journalist who researches the German far right. In this thread on Mastodon he describes the work they’ve done to check statements from interviews with a former far right member, and to connect them to other source material (photos from events, other people, reports etc.). Of interest to me here is that they used Obsidian to map out people, groups, places, events and occurrences, to verify, to see overlaps and spot blind spots. Nice example of taking something that is inherently text and image based and use Obsidian to ferret out the connections and patterns. There are some topics that currently pop-up in my work in very different projects, and more purposefully teasing out the connections like in this example seems a useful notion.
In einer #Obsidian-Datenbank haben wir Kontaktpersonen, Gruppen, Orte und Ereignisse zusammengeführt. Mehr als 70 umfangreiche Belegdokumente untermauern die einzelnen Aussagen von „Michael“
Author Steven Johnson has been working with Google and developed a prototype for Tailwind. Tailwind, an ‘AI first notebook’, is intended to bring an LLM to your own source material, and then you can use it to ask questions of the sources you give it. You point it to a set of resources in your Google Drive and what Tailwind generates will be based just on those resources. It shows you the specific source of the things it generates as well. Johnson explicitly places it in the Tools for Thought category. You can join a waiting list if you’re in the USA, and a beta should be available in the summer. Is the USA limit intended to reduce the number of applicants I wonder, or a sign that they’re still figuring things like GDPR for this tool? Tailwind is prototyped on PaLM API though, which is now generally available.
This, from its description, gets to where it becomes much more interesting to use LLM and GPT tools. A localised (not local though, it lives in your Google footprint) tool, where the user defines the corpus of sources used, and traceable results. As the quote below suggests a personal research assistant. Not just for my entire corpus of notes as I describe in that linked blogpost, but also on a subset of notes for a single topic or project. I think there will be more tools like these coming in the next months, some of which likely will be truly local and personal.
On the Tailwind team we’ve been referring to our general approach as source-grounded AI. Tailwind allows you to define a set of documents as trusted sources …, shaping all of the model’s interactions with you. … other types of sources as well, such as your research materials for a book or blog post. The idea here is to craft a role for the LLM that is … something closer to an efficient research assistant, helping you explore the information that matters most to you.
Ted Chiang realises that corporates are best positioned to leverage the affordances of algorithmic applications, and that that is where the risk of the ‘runaway AIs’ resides. I agree that they are best positioned, because corporations are AI’s non-digital twin, and have been recognised as such for a decade.
Brewster Kahle said (in 2014) that corporations should be seen as the 1st generation AIs, and Charlie Stross reinforced it (in 2017) by dubbing corporations ‘Slow AI’ as corporations are context blind, single purpose algorithms. That single purpose being shareholder value. Jeremy Lent (in 2017) made the same point when he dubbed corporations ‘socio-paths with global reach’ and said that the fear of runaway AI was focusing on the wrong thing because “humans have already created a force that is well on its way to devouring both humanity and the earth in just the way they fear. It’s called the Corporation“. Basically our AI overlords are already here: they likely employ you. Of course existing Slow AI is best positioned to adopt its faster young, digital algorithms. It as such can be seen as the first step of the feared iterative path of run-away AI.
The doomsday scenario is … A.I.-supercharged corporations destroying the environment and the working class in their pursuit of shareholder value.
I’ll repeat the image I used in my 2019 blogpost linked above:
Your Slow AI overlords looking down on you, photo Simone Brunozzi, CC-BY-SA
LLMs usually require loads of training data, the bigger the better. This biases such training, as Maggie Appleton also pointed out, to western and English dominated resources. This paper describes creating a model for a group of 11 African languages that are underresourced online, and as a result don’t figure significantly in the large models going around (4 of the 11 have never been included in a LLM before). All the material is available on GitHub. They conclude that training a LLM with such lower resourced languages with the larger ones is less effective than taking a grouping of underresourced languages together. Less than 1GB of text can provide a competitive model! That sounds highly interesting for the stated reason: it allows models to be created for underresourced languages at relatively little effort. I think that is a fantastic purpose because it may assist in keeping a wide variety of languages more relevant and bucking the trend towards cultural centralisation (look at me writing here in English for a case in point). It also makes me wonder about a different group of use cases: where you have texts in a language that is well enough represented in the mainstream LLMs, but where the corpus you are specifically or only interested in is much smaller, below that 1GB threshold. For instance all your own written output over the course of your life, or for certain specific civic tech applications.
We show that it is possible to train competitive multilingual language models on less than 1 GB of text. .our model … is very competitive overall. … Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages.
I think it is a bit of a ‘well-duh’ thing but worth underlining in general conversation still. The name Large Language Model is somewhat misleading and a misnomer as it does not contain a model of how (a) language (theoritically) works. It e.g. doesn’t generate texts by following grammar rules. How LLMs can generate code from natural language prompts because they have been trained with sofware code without the theoretical underpinnings of programming languages leads to this by extension. Veres suggests using the term of Large Corpus Models. I think getting people to write LCMs and not LLMs will be impossible. I can however for myself highlight the difference by reading ‘Large Language usage Model’ everytime I see LLM. As the Corpus is one of language(s) in actual use.
We argue that the term language model is misleading because deep learning models are not theoretical models of language and propose the adoption of corpus model instead, which better reflects the genesis and contents of the model.