Is the biggest threat to ChatGPT - ChatGPT?

Author: Leif Pettersson, ArkivIT.

In the past six months, the usage of generative AI solutions like ChatGPT has experienced significant growth. What these different solutions have in common is their utilization of large language models (LLMs). In practice, this means they primarily rely on the internet, as well as other textual sources, to gather vast amounts of data that they use to provide answers to our questions. For example, GPT-3 employs a model with 175 billion parameters. Now, there is a considerable number of similar solutions, all of which retrieve data from the same sources. It is highly likely that, if it hasn’t already begun, we will start publishing texts generated by generative AI.

At the end of May this year, a group of researchers published a paper titled “The Curse of Recursion: Training on Generated Data Makes Models Forget,” which raises some concerns. It turns out that AI-generated texts can quickly corrupt the whole concept of generative AI, as the quality of the answers provided by these solutions deteriorates rapidly due to previous answers from the same models. The researchers refer to this phenomenon as “model collapse.” One of the authors of the paper, Ila Shumailov, attempts to explain the phenomenon in a way that a layperson (like myself) might understand.

The crux of the problem is that generative AI apparently struggles to handle unusual or peculiar data. It prioritizes “common” data and misunderstands or misinterprets more unconventional or less popular data. Shumailov provides a highly simplified and hypothetical example:

Let’s say we have a dataset consisting of 100 pictures of cats. Out of these cats, 10 have blue fur, while the remaining 90 have yellow fur. The model learns that yellow cats are much more common than blue cats and represents the blue cats as more yellow than they actually are. As a result, the blue cats appear somewhat greenish and are published. Now, in the next iteration, the data is retrieved again as new training data, and so on. The blue cats will be increasingly represented as more and more yellow over time. Consequently, minority data is lost, and this is what the researchers refer to as a collapse of the model. Another way to understand this phenomenon is to liken it to a JPEG image. If it is repeatedly saved, artifacts will increasingly appear, and the image becomes more blurred and pixelated.

Preventing this from happening over time seems to be very challenging. The solutions that have already retrieved their data will have clear advantages over future solutions. However, even today’s solutions will be affected because they will need to refresh their data to capture new information that has emerged since their initial data retrieval. It is important to understand that the models used in generative AI are not “live” from the sources but rather stored in isolation from them.

There are already signs that more data is being retrieved from the Internet Archive. Somehow, we need to start protecting the information created by humans from being contaminated by information created by machines. One way is to archive the valuable information to ensure its protection and usefulness for future AI solutions. Hopefully, research will find a solution for this, but until then, we can only hope that this is taken seriously and that we do not publish AI-generated information on the web.

More reading: