(Rita Matulionite, Macquarie University)
Sydney, Nov 9 (The Conversation) As artificial intelligence (AI) reaches the peak of its popularity, researchers warn that the industry may be running out of training data – the fuel that runs powerful AI systems. This may slow down the development of AI models, especially large language models, and may even change the trajectory of the AI revolution.
But given the amount of data on the web, why is potential data loss an issue? And is there any way to deal with the risk?
Why is high quality data important for AI?
To train powerful, accurate, and high-quality AI algorithms, we need a lot of data. For example, ChatGPT was trained on 570 gigabytes of text data, or approximately 300 billion words.
Similarly, the static propagation algorithm (which is behind many AI image-generating apps like DALL-E, Lensa, and MidJourney) was trained on the LION-5B dataset containing 5.8 billion image-text pairs. If an algorithm is trained on an insufficient amount of data, it will produce inaccurate or low-quality outputs.
The quality of training data is also important. Low-quality data such as social media posts or blurry photos is easy to obtain, but not enough to train high-performance AI models.
Text taken from social media platforms may be biased, misleading, or contain disinformation or illegal content that may be replicated by the model. For example, when Microsoft tried to train its AI bot using Twitter content, it learned to generate racist and misogynistic output.
That’s why AI developers look for high-quality content like text from books, online articles, scientific papers, Wikipedia, and some filtered web content. Google Assistant was trained on 11,000 romance novels taken from self-publishing site Smashwords to make it more conversational.
Do we have enough data?
The AI industry is training AI systems on large datasets, which is why we now have high-performance models like ChatGPT or DALL-E3. Additionally, research shows that online data stocks are growing much slower than the datasets used to train AI.
In a paper published last year, a group of researchers predicted that we would run out of high-quality text data before 2026 if current AI training trends continued. They also estimated that low-quality language data would be exhausted between 2030 and 2050, and low-quality image data would be exhausted between 2030 and 2060.
According to accounting and consulting group PwC, AI could contribute up to US$15.7 trillion to the world economy by 2030. But running out of usable data could slow down its growth.
Should we be worried?
Although the above points may worry some AI fans, the situation may not be as bad as it seems. Many things are difficult to say right now about how AI models will develop in the future, as well as some ways to overcome the risks of lack of data.
There is an opportunity for AI developers to improve algorithms so that they can more efficiently use the data they already have.
It is likely that in the coming years they will be able to train high-performance AI systems using less data and possibly less computational power. This will also help in reducing the carbon footprint of AI.
Another option is to use AI to create synthetic data to train the system. In other words, developers can easily generate curated data tailored to their particular AI models.
Many projects are already using synthetic content, often derived from data-generating services like Mostly AI. This will become more common in the future.
Developers are also exploring content outside the free online space, such as content held by large publishers and offline repositories. Think of the millions of texts published before the Internet. When made available digitally, they can provide a new source of data for AI projects.
News Corp, one of the world’s largest news content owners (most of whose content is behind paywalls), recently said it was negotiating content deals with AI developers. Such deals would force AI companies to pay for training data – whereas they have until now mostly taken it for free from the Internet.
Content creators have sued some companies like Microsoft, OpenAI and Stability AI against unauthorized use of their content to train AI models. Being compensated for their work could help restore some of the power imbalance that exists between creatives and AI companies.