In practice

Memorising or summarising?

10 January 2024

The New York Times’ lawsuit against OpenAI may disrupt the disruptors.

Since the inception of large language models (LLMs) such as the ChatGPT models, Google’s Bard and Meta’s LLaMA, there has been controversy around the training of these models.

In short, the owners of copyrighted content that can be found on the internet regularly raise concerns that these models have been trained on copyright-protected content.

While a number of lawsuits are in train, and many more have no doubt been settled, The New York Times’ suit against OpenAI is different. That is partly because The New York Times is a significantly-resourced litigant not to be lightly dismissed, but mostly because of the case it intends to run.

The New York Times suit¹, among other things, alleges that OpenAI’s GPT4 model was trained on copyright-protected material, and subsequently produced answers which were either exactly the same as the copyrighted material, or so close that it makes no difference. In other words, the LLM memorised the material on which it was trained and spat it out verbatim, rather than summarising it.

In support of this, The New York Times filed an exhibit which put forward examples of responses to prompts which were basically identical to material they owned.

This matters because previous suits have largely conceded that the LLMs have summarised, rather than memorised, copyrighted material. The New York Times alleges the software didn’t transform or alter the material on which it was trained, rather it just coughed up that same material.

OpenAI counter by postulating that the problem lies in how the questions were asked – that is, by questioning the AI in a certain way, its answers will indeed replicate the material on which it was trained because that is basically what it was asked to do. Thus, the more specific the question, the more likely the answer is to resemble training material, since the ways in which the AI can answer have been limited.

Why does this matter? Because if software like ChatGPT simply summarises the material on which it was trained, and produces an answer which is the product of its internal algorithms, it isn’t just copying content. However, if all it does is memorise the material on which it was trained, and pops out a copy in response to a question, that might well be violating copyright.

The chances are good that this suit, like many others, will result in a settlement whereby the content owners are compensated for the use of their material. If it doesn’t settle, however, a court might finally grapple with the question of just how far the tech companies can go when training their software. That will have significant implications for AI in general, as well as legal tools based on it; prudent practitioners will keep an eye on these proceedings.

Footnotes
¹ As reported by Isha Marathe in ALM Law.com January 8 2024