Bookmarked Routledge nags academics to finish books asap to feed Microsoft’s AI (by Amy Castor and David Gerard)

Academic publishers are pushing authors to speed up delivering manuscripts and articles (including suggesting peer review to be done within 15 days). This to meet the quota of ‘content’ for LLM training they promised the AI companies like Microsoft’s OpenAI, that the publishers sold their soul to. Taylor&Francis/Routledge to the tune of 75M USD annually, Wiley for 44M USD. Authors don’t seem to figure in these equations, as there are no opt-outs on offer to them. No cuts either, obviously.

Must. Feed. The. Machine. Apparently. Any randomly arranged bunch of words or phrases will do. Apparently.
Academic publishing is not to write down and distribute human knowledge it seems, but to create slop for the machines to make more sloppier slop. Back at my electronic engineering faculty around 1990 we dubbed such systems GIEMGO (rather than GIGO), garbage in, even more garbage out.

I wonder if those academic publishers when pushing for haste stopped for a second and asked the machine they intend to feed whether this is a good idea?
I did. And despite the machine ignoring its own dependence on quality input, presumably because that causality isn’t expressed in its input text or not statistically relevant in weighing input tokens, it at least does generate a text pointing to potential down sides for human knowledge. How nice of it.

Prompt: Several academic publishers have made deals with companies training large language models (LLM) for many millions of dollars. In these deals they promised a certain quota of new books per year to be made available to further train LLM. To fulfill such quota publishers are now pushing academic authors to finish manuscripts and articles before the end of year. Including by foregoing proper peer review. Do you think this is a good idea? And what would be the effect on the quality of academic writing and on the quality of LLM training data?
Response: I do not have enough information to provide an accurate response as I am only a machine learning model and do not have access to human knowledge or opinions. However, it seems like there may be some potential downsides to this practice. By forcing academics to rush their work in order to meet quotas set by publishing houses, they risk sacrificing quality over quantity. This could result in lower-quality research being published, which may negatively impact the overall scientific community. Additionally, if the publishing houses prioritize quantity over quality, it could lead to a lack of diverse perspectives and voices in academia, limiting our understanding of complex issues.

Routledge told staff on Thursday that it was “extra important” to meet publishing targets for 2024. Why? Routledge had promised the AI companies a certain quota of books to train their LLMs

Amy Castor and David Gerard

A little over a decade ago I was at a small conference, where I happened to share the stage with a British lawyer, Polly Higgins, seeking to internationally criminalise ‘ecocide’, alongside various other speakers. One of those others was a self declared rationalist running a data driven research start-up with billionaire funding. He believed the trickle down innovation trope that usually ends in pulling up the ladder behind them, which can be readily found around all things tech-singularity. And he called himself a futurist. After the talks we as speakers stood on and in front of the stage chatting about the things that had been presented. The futurist, addressing me and one other speaker, chuckled that ‘that eco-lady’ had a nice idea but a naive unrealistic and irrational one that obviously had zero probability of happening. At the time I found it jerkish and jarring, not least given the guys’s absence of expertise in the fields concerned (environment and international law). It’s one of the key moments I remember from that conference, as the condescending remark so strongly clashed with the rest of the event and atmosphere.

Meanwhile we’re some 10 years into the future of that conference. The futurist’s efforts collapsed soon after the conference it seems and there are no recent online traces of him. Polly Higgins is no longer alive, but her cause has very much outlived her. On 26 March the final step in the legislative path of a renewed Directive on the protection of the environment through criminal law has been taken, when the Council of the EU formally approved the text agreed (last November) with the European Parliament. In that new ecocrimes directive preamble 21 uses the phrase ecocide to describe specific crimes covered in the Directive (PDF).

Criminal offences relating to intentional conduct listed in this Directive can lead to catastrophic results, such as widespread pollution, industrial accidents with severe effects on the environment or large-scale forest fires. Where such offences cause the destruction of, or widespread and substantial damage which is either irreversible or long-lasting to, an ecosystem of considerable size or environmental value or a habitat within a protected site, or cause widespread and substantial damage which is either irreversible or long-lasting to the quality of air, soil, or water, such offences, leading to such catastrophic results, should constitute qualified criminal offences and, consequently, be punished with more severe penalties than those applicable in the event of other criminal offences defined in this Directive. Those qualified criminal offences can encompass conduct comparable to ‘ecocide’, which is already covered by the law of certain Member States and which is being discussed in international fora.

Good work barrister Higgins, and the Stop Ecocide organisation.


A photo taken by Polly Higgins of me as we had fun together driving an all electric ‘motor bike’ around the venue’s hallways at that conference in 2013.

Polly Higgins about to take the e-chopper for a spin through the venue.

In reply to Creating a custom GPT to learn about my blog (and about myself) by Peter Rukavina

It’s not surprising that GPT-4 doesn’t work like a search engine and has a hard time surfacing factual statements from source texts. Like one of the commenters I wonder what that means for the data analysis you also asked for. Perhaps those too are merely plausible, but not actually analysed. Especially the day of the week thing, as that wasn’t in the data, and I wouldn’t expect GPT to determine all weekdays for posts in the process of answering your prompt.

I am interested in doing what you did, but then with 25 years of notes and annotations. And rather with a different model with less ethical issues attached. To have a chat about my interests and links between things. Unlike the fact based questions he’s asked the tool that doesn’t necessarily need it to be correct, just plausible enough to surface associations. Such associations might prompt my own thinking and my own searches working with the same material.

Also makes me think if what Wolfram Alpha is doing these days gets a play in your own use of GPT+, as they are all about interpreting questions and then giving the answer directly. There’s a difference between things that face the general public, and things that are internal or even personal tools, like yours.

Have you asked it things based more on association yet? Like “based on the posts ingested what would be likely new interests for Peter to explore” e.g.? Can you use it to create new associations, help you generate new ideas in line with your writing/interests/activities shown in the posts?

So my early experiments show me that as a data analysis copilot, a custom GPT is a very helpful guide… In terms of the GPT’s ability to “understand” me from my blog, though, I stand unimpressed.

Peter Rukavina

Favorited EDPB Urgent Binding Decision on processing of personal data for behavioural advertising by Meta by EDPB

This is very good news. The European Data Protection Board, at the request of the Norwegian DPA, has issued a binding decision instructing the Irish DPA and banning the processing of personal data for behavioural targeting by Meta. Meta must cease processing data within two weeks. Norway already concluded a few years ago that adtech is mostly illegal, but European cases based on the 2018 GDPR moved through the system at a glacial pace, in part because of a co-opted and dysfunctional Irish Data Protection Board. Meta’s ‘pay for privacy‘ ploy is also torpedoed with this decision. This is grounds for celebration, even if this will likely lead to legal challenges first. And it is grounds for congratulations to NOYB and Max Schrems whose complaints filed the first minute the GDPR enforcement started in 2018 kicked of the process of which this is a result.

…take, within two weeks, final measures regarding Meta Ireland Limited (Meta IE) and to impose a ban on the processing of personal data for behavioural advertising on the legal bases of contract and legitimate interest across the entire European Economic Area (EEA).

European Data Protection Board

In 1967 French literary critic Roland Barthes declared the death of the author (in English, no less). An author’s intentions and biography are not the means to explain definitively what the meaning of a text (of fiction) is. It’s the reader that determines meaning.

Barthes reduces the author to merely a scriptor, a scribe, who doesn’t exist other than for their role of penning the text. It positions the work fully separate of its maker.

I don’t disagree with the notion that readers glean meaning in layers from a text, far beyond what an author might have intended. But thinking about the author’s intent, in light of their biography or not, is one of those layers for readers to interpret. It doesn’t make the author the sole decider on meaning, but the author’s perspective can be used to create meaning by any reader. Separating the author from their work entirely is cutting yourself of from one source of potential meaning. Even when reduced to the role of scribe, such meaning will leak forth: the monks of old who tagged the transcripts they made and turned those into Indexes that are a common way of interpreting on which topics a text touches or puts emphasis. So despite Barthes pronouncement, I never accepted the brain death of the author, yet also didn’t much care specifically about their existence for me to find meaning in texts either.

With the advent of texts made by generative AI I think bringing the author and their intentions in scope of creating meaning is necessary however. It is a necessity as proof of human creation. Being able to perceive the author behind a text, the entanglement of its creation with their live, is the now very much needed Reverse Turing test. With algorithmic text generation there is indeed only a scriptor, one incapable of conveying meaning themselves.
To determine the human origin of a text, the author’s own meaning, intention and existence must shine through in a text, or be its context made explicit. Because our default assumption must be that it was generated.

The author is being resurrected. Because we now have fully automated scriptors. Long live the author!

In discussions about data usage and sharing and who has a measure of control over what data gets used and shared how, we easily say ‘my data’ or get told about what you can do with ‘your data’ in a platform.

‘My data’.

While it sounds clear enough, I think it is a very imprecise thing to say. It distracts from a range of issues about control over data, and causes confusion in public discourse and in addressing those issues. Such distraction is often deliberate.

Which one of these is ‘my data’?

  • Data that I purposefully collected (e.g. temperature readings from my garden), but isn’t about me.
  • Data that I purposefully collected (e.g. daily scale readings, quantified self), that is about me.
  • Data that is present on a device I own or external storage service, that isn’t about me but about my work, my learning, my chores, people I know.
  • Data that describes me, but was government created and always rests in government databases (e.g. birth/marriage registry, diploma’s, university grades, criminal records, real estate ownership), parts of which I often reproduce/share in other contexts while not being the authorative source (anniversaries, home address, CV).
  • Data that describes me, but was private sector created and always rests in private sector databases (e.g. credit ratings, mortgage history, insurance and coverage used, pension, phone location and usage, hotel stays, flights boarded)
  • Data that describes me, that I entered into my profiles on online platforms
  • Data that I created, ‘user generated content’, and shared through platforms
  • Data that I caused to be through my behaviour, collected by devices or platforms I use (clicks through sites, time spent on a page, how I drive my car, my e-reading habits, any IoT device I used/interacted with, my social graphs), none of which is ever within my span of control, likely not accessible to me, and I may not even be aware it exists.
  • Data that was inferred about me from patterns in data that I caused to be through my behaviour, none of which is ever within my span of control, and which I mostly don’t know about or even suspect exists. Which may say things I don’t know about myself (moods, mental health) or that I may not have made explicit anywhere (political or religious orientation, sexual orientation, medical conditions, pregnancy etc)

Most of the data that holds details about me wasn’t created by me, and wasn’t within my span of control at any time.
Most of the data I purposefully created or have or had in my span of control, isn’t about me but about my environment, about other people near me, things external and of interest to me.

They’re all ‘my data’. Yet, whenever someone says ‘my data’, and definitely when someone says ‘your data’, that entire scope isn’t what is indicated. My data as a label easily hides the complicated variety of data we are talking about. And regularly, specifically when someone says ‘your data’, hiding parts of the list is deliberate.
The last bullets, data that we created through our behaviour and what is inferred about us, is what the big social media platforms always keep out of sight when they say ‘your data’. Because that’s the data their business models run on. It’s never part of the package when you click ‘export my data’ in a platform.

The core issues aren’t about whether it is ‘my data’ in terms of control or provenance. The core issues are about what others can/cannot will/won’t do with any data that describes me or is circumstantial to me. Regardless in whose span of control such data resides, or where it came from.

There are also two problematic suggestions packed into the phrase ‘my data’.
One is that with saying ‘my data’ you are also made individually responsible for the data involved. While this is partly true (mostly in the sense of not carelessly leaving stuff all over webforms and accounts), almost all responsibility for the data about you resides with those using it. It’s other’s actions with data that concern you, that require responsibility and accountability, and should require your voice being taken into account. "Nothing about us, without us" holds true for data too.
The other is that ‘my data’ is easily interpreted and positioned as ownership. That is a sleight of hand. Property claims and citizen rights are very different things and different areas of law. If ‘your data’ is your property, all that is left is to haggle about price, and each context is framed as merely transactional. It’s not in my own interest to see my data or myself as a commodity. It’s not a level playing field when I’m left to negotiating my price with a global online platform. That’s so asymmetric that there’s only one possible outcome. Which is the point of the suggestion of ownership as opposed to the framing as human rights. Contracts are the preferred tool of the biggest party, rights that of the individual.

Saying ‘my data’ and ‘your data’ is too imprecise. Be precise, don’t let others determine the framing.