Bookmarked Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages by Kelechi Ogueji, Yuxin Zhu, Jimmy Lin, 2021

LLMs usually require loads of training data, the bigger the better. This biases such training, as Maggie Appleton also pointed out, to western and English dominated resources. This paper describes creating a model for a group of 11 African languages that are underresourced online, and as a result don’t figure significantly in the large models going around (4 of the 11 have never been included in a LLM before). All the material is available on GitHub. They conclude that training a LLM with such lower resourced languages with the larger ones is less effective than taking a grouping of underresourced languages together. Less than 1GB of text can provide a competitive model! That sounds highly interesting for the stated reason: it allows models to be created for underresourced languages at relatively little effort. I think that is a fantastic purpose because it may assist in keeping a wide variety of languages more relevant and bucking the trend towards cultural centralisation (look at me writing here in English for a case in point). It also makes me wonder about a different group of use cases: where you have texts in a language that is well enough represented in the mainstream LLMs, but where the corpus you are specifically or only interested in is much smaller, below that 1GB threshold. For instance all your own written output over the course of your life, or for certain specific civic tech applications.

We show that it is possible to train competitive multilingual language models on less than 1 GB of text. .our model … is very competitive overall. … Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages.

Ogueji et al, 2021

John Caswell writes about the role of conversation, saying "conversation is an art form we’re mostly pretty rubbish at". New tools that employ LLM’s, such as GPT-3 can only be used by those learning to prompt them effectively. Essentially we’re learning to have a conversation with LLMs so that its outputs are usable for the prompter. (As I’m writing this my feedreader updates to show a follow-up post about prompting by John.)

Last August I wrote about articles by Henrik Olaf Karlsson and Matt Webb that discuss prompting as a skill with newly increasing importance.

Prompting to get a certain type of output instrumentalises a conversation partner, which is fine for using LLM’s, but not for conversations with people. In human conversation the prompting is less to ensure output that is useful to the prompter but to assist the other to express themselves as best as they can (meaning usefulness will be a guaranteed side effect if you are interested in your conversational counterparts). In human conversation the other is another conscious actor in the same social system (the conversation) as you are.

John takes the need for us to learn to better prompt LLM’s and asks whether we’ll also learn how to better prompt conversations with other people. That would be great. Many conversations take the form of the listener listening less to the content of what others say and more listening for the right time to jump in with what they themselves want to say. Broadcast driven versus curiosity driven. Me and you, we all do this. Getting consciously better at avoiding that common pattern is a win for all.

In parallel Donald Clark wrote that the race to innovate services on top of LLM’s is on, spurred by OpenAI’s public release of Chat-GPT in November. The race is indeed on, although I wonder whether those getting in the race all have an actual sense of what they’re racing and are racing towards. The generic use of LLM’s currently in the eye of public discussion I think might be less promising than gearing it towards specific contexts. Back in August I mentioned Elicit that helps you kick-off literature search based on a research question for instance. And other niche applications are sure to be interesting too.

The generic models are definitely capable to hallucinate in ways that reinforce our tendency towards anthropomorphism (which needs little reinforcement already). Very very ELIZA. Even if on occasion it creeps you out when Bing’s implementation of GPT declares its love for you and starts suggesting you don’t really love your life partner.

I associated what Karlsson wrote with the way one can interact with one’s personal knowledge management system the way Luhmann described his note cards as a communication partner. Luhmann talks about the value of being surprised by whatever person or system you’re communicating with. (The anthropomorphism kicks in if we based on that surprisal then ascribe intention to the system we’re communicating with).

Being good at prompting is relevant in my work where change in complex environments is often the focus. Getting better at prompting machines may lift all boats.

I wonder if as part of the race that Donald Clark mentions, we will see LLM’s applied as personal tools. Where I feed a more open LLM like BLOOM my blog archive and my notes, running it as a personal instance (for which the full BLOOM model is too big, I know), and then use it to have conversations with myself. Prompting that system to have exchanges about the things I previously wrote down in my own words. With results that phrase things in my own idiom and style. Now that would be very interesting to experiment with. What valuable results and insight progression would it yield? Can I have a salon with myself and my system and/or with perhaps a few others and their systems? What pathways into the uncanny valley will it open up? For instance, is there a way to radicalise (like social media can) yourself by the feedback loops of association between your various notes, notions and follow-up questions/prompts?

An image generate with Stable Diffusion with the prompt “A group of fashionable people having a conversation over coffee in a salon, in the style of an oil on canvas painting”, public domain

In the noisy chaotic phase that Twitter Inc. is going through, I downloaded my data from them 2 weeks ago. Meanwhile in the Fediverse newcomers mention they appreciate how nice, pleasant and conversational things are.

It’s good to note that that is how Twitter started out too. In my network I felt I was late joining Twitter, this because I was using Jaiku (a similar, better I might add, service based in Europe). Sixteen years on that can be seen as early user. My user ID is number 59923, registered on Tuesday December 12th, 2006. Judging by the time, 10:36am, I registered during my regular 10:30 coffee break.

One minute later I posted my first message. It had ID 994313, so my Tweet was just within the first million messages on Twitter (the current rate seems to be over 800 million Tweets per day!). That first message mentioned the tool I was going to benchmark Twitter against: Jaiku.

What followed that first message was like how it was the past 4 years using Mastodon. A bunch of gentle conversations.

Back then everyone was nice, as you tend to be in public e.g. walking through a small village. Over time Twitter conversations tended towards “I need to win this exchange, even if I agree with my counterpart”. Argumentative. Performance above conversation. Performing in front of your own followers by enacting a conversation with someone else. The general tone of voice on Twitter (apart from the actual toxicity) is somewhat like the difference of posture you take in a metropolis versus a village. In a village you greet passersby, project an aura of approachability etc. In an urban environment you tend to pretend to not see others, are pro-active in claiming your physical space, alert that others don’t push you aside or further down the queue etc. Urban behaviour easily looks aggressive, and at the very least unnecessarily rude, in a village.

The past few weeks saw a massive influx of people from Twitter. Which is good. I also noticed that it felt a bit like city folk descending on some backwater. The general tone of voice, directness or terseness in phrasing, reflecting the character limit on Twitter, in contrast with the wider limits in Mastodon-village which allows both for more nuance and for, yes, politeness.
The contrast was felt both ways, as newcomers commented on how nice the conversations were, a breath of fresh air etc.

Quantitative changes, like a rising number of people using a specific communication channel, leads to qualitative changes. It did on Twitter. It will on Mastodon, despite the differences. In the fediverse some of that effect will be buffered by the tools individual users have on hand (blocking, blocking instances, moving instance or run your own, participate from your own website, e.g.). Meaning one can choose to ‘live’ in the middle of the metropolis, or on its outskirts where not many much frequent. But the effect will be there, also because there will be more tools built from other starting principles than the current tree of fediverse applications on top of the underlying ActivityPub protocol. Some will be counter those that underpin e.g. Mastodon, others will be aligned. But change it will.

It’s nice out here, but do regularly check the back of the package for the best-by date.

In reply to Open Web Search project kicked off by Djoerd Hiemstra

I’m looking forward to following this project, Djoerd! It sounds sort-of IndieWeb like. Where e.g. MicroSub decouples feed fetching from feed reading and Micropub writing from posting, this project decouples index building from the search. Within IndieWeb that allows the creation of a variety of personal tools, to read and write the web. I’ve long been musing about personal search engines and personal agents and crawlers without putting anything into action. I’m curious to see if this project will actually deliver some of the things I dreamt of over time, by enabling personal tools for search.

A new EU project … [in which] … the key idea is to separate index construction from the search engines themselves, where the most expensive step to create index shards can be carried out on large clusters while the search engine itself can be operated locally. …[including] an Open-Web-Search Engine Hub, [where anyone can] share their specifications of search engines and pre-computed, regularly updated search indices. … that would enable a new future of human-centric search without privacy concerns.

Djoerd Hiemstra