Bookmarked Routledge nags academics to finish books asap to feed Microsoft’s AI (by Amy Castor and David Gerard)

Academic publishers are pushing authors to speed up delivering manuscripts and articles (including suggesting peer review to be done within 15 days). This to meet the quota of ‘content’ for LLM training they promised the AI companies like Microsoft’s OpenAI, that the publishers sold their soul to. Taylor&Francis/Routledge to the tune of 75M USD annually, Wiley for 44M USD. Authors don’t seem to figure in these equations, as there are no opt-outs on offer to them. No cuts either, obviously.

Must. Feed. The. Machine. Apparently. Any randomly arranged bunch of words or phrases will do. Apparently.
Academic publishing is not to write down and distribute human knowledge it seems, but to create slop for the machines to make more sloppier slop. Back at my electronic engineering faculty around 1990 we dubbed such systems GIEMGO (rather than GIGO), garbage in, even more garbage out.

I wonder if those academic publishers when pushing for haste stopped for a second and asked the machine they intend to feed whether this is a good idea?
I did. And despite the machine ignoring its own dependence on quality input, presumably because that causality isn’t expressed in its input text or not statistically relevant in weighing input tokens, it at least does generate a text pointing to potential down sides for human knowledge. How nice of it.

Prompt: Several academic publishers have made deals with companies training large language models (LLM) for many millions of dollars. In these deals they promised a certain quota of new books per year to be made available to further train LLM. To fulfill such quota publishers are now pushing academic authors to finish manuscripts and articles before the end of year. Including by foregoing proper peer review. Do you think this is a good idea? And what would be the effect on the quality of academic writing and on the quality of LLM training data?
Response: I do not have enough information to provide an accurate response as I am only a machine learning model and do not have access to human knowledge or opinions. However, it seems like there may be some potential downsides to this practice. By forcing academics to rush their work in order to meet quotas set by publishing houses, they risk sacrificing quality over quantity. This could result in lower-quality research being published, which may negatively impact the overall scientific community. Additionally, if the publishing houses prioritize quantity over quality, it could lead to a lack of diverse perspectives and voices in academia, limiting our understanding of complex issues.

Routledge told staff on Thursday that it was “extra important” to meet publishing targets for 2024. Why? Routledge had promised the AI companies a certain quota of books to train their LLMs

Amy Castor and David Gerard

Bookmarked ChatGPT sees Tweets: A Double-Edged Sword by Henk van Ess

Bing Chat is connected to the internet, allowing internet searches when you ask the chatbot something. This includes Twitter. It then weaves those online finds into the texts it puts together off your prompt. Henk van Ess shows how quickly the content from a Twitter message gets incorporated (and changed if additional messages are available). With just three tweets he influenced Bing Chat output. This also opens a pathway for influence and dissemination of mis-info, especially since the recent quality changes over at Twitter. The feedback loop this creates (internet texts get generated based on existing internet texts, etc.) will easily result in a vicious circle (In her recent talk Maggie Appleton listed this as one of her possible futures, using a metaphor I can’t unsee, but which does describe it effectively: Human Centipede Epistemology)

Bing/ChatGPT’s rapid response to tweets has a double-edged sword. Bing quickly corrects itself based on tweets … But those with specific agendas or biases may attempt to abuse the system … We’ve seen it all before. This is similar to Google Bombing…

Henk van Ess

Bookmarked Project Tailwind by Steven Johnson

Author Steven Johnson has been working with Google and developed a prototype for Tailwind. Tailwind, an ‘AI first notebook’, is intended to bring an LLM to your own source material, and then you can use it to ask questions of the sources you give it. You point it to a set of resources in your Google Drive and what Tailwind generates will be based just on those resources. It shows you the specific source of the things it generates as well. Johnson explicitly places it in the Tools for Thought category. You can join a waiting list if you’re in the USA, and a beta should be available in the summer. Is the USA limit intended to reduce the number of applicants I wonder, or a sign that they’re still figuring things like GDPR for this tool? Tailwind is prototyped on PaLM API though, which is now generally available.

This, from its description, gets to where it becomes much more interesting to use LLM and GPT tools. A localised (not local though, it lives in your Google footprint) tool, where the user defines the corpus of sources used, and traceable results. As the quote below suggests a personal research assistant. Not just for my entire corpus of notes as I describe in that linked blogpost, but also on a subset of notes for a single topic or project. I think there will be more tools like these coming in the next months, some of which likely will be truly local and personal.

On the Tailwind team we’ve been referring to our general approach as source-grounded AI. Tailwind allows you to define a set of documents as trusted sources …, shaping all of the model’s interactions with you. … other types of sources as well, such as your research materials for a book or blog post. The idea here is to craft a role for the LLM that is … something closer to an efficient research assistant, helping you explore the information that matters most to you.

Steven Johnson

On the internet nobody knows you’re a dog.

Peter Steiner, 1993

It seems after years of trollbots and content farms, with generative algorithms we are more rapidly moving past the point where the basic assumption on the web still can be that an (anonymous) author is human until it becomes clear it’s otherwise. Improving our crap detection skills from now on means a different default:

On the internet nobody believes you’re human.

until proven otherwise.