Bookmarked Large Language Models are not Models of Natural Language: they are Corpus Models. (PDF) by Csaba Veres (2022)

I think it is a bit of a ‘well-duh’ thing but worth underlining in general conversation still. The name Large Language Model is somewhat misleading and a misnomer as it does not contain a model of how (a) language (theoritically) works. It e.g. doesn’t generate texts by following grammar rules. How LLMs can generate code from natural language prompts because they have been trained with sofware code without the theoretical underpinnings of programming languages leads to this by extension. Veres suggests using the term of Large Corpus Models. I think getting people to write LCMs and not LLMs will be impossible. I can however for myself highlight the difference by reading ‘Large Language usage Model’ everytime I see LLM. As the Corpus is one of language(s) in actual use.

We argue that the term language model is misleading because deep learning models are not theoretical models of language and propose the adoption of corpus model instead, which better reflects the genesis and contents of the model.

Csaba Veres, 2022