AI is going to be stuck? Data for training large models may be exhausted by 2026

Source: "Tencent Technology", Author: Jinlu

Focus on:

  1. The latest boom in generative artificial intelligence requires the support of super-large models, and large models need to be trained with massive data, so data is becoming more and more precious.
  2. The researchers believe that the demand for data will increase dramatically, and the high-quality text data that can be used to train large models may be exhausted in 2026. A data scramble is kicking off.
  3. In the United States, there have been many copyright infringement cases against model builders, and OpenAI, Stability AI, Midjourney, and Meta have all become defendants.
  4. Artificial intelligence companies are exploring new data sources, including signing data copyright agreements with other companies, collecting data through user interactions with their tools, and trying to use internal data from corporate customers.

Image source: Generated by Unbounded AI

Not so long ago, analysts were openly speculating whether artificial intelligence (AI) would lead to the downfall of Adobe, a developer of software for creatives. New tools like Dall-E 2 and MidTrik, which generate images based on prompt text, seem to make Adobe's image-editing capabilities redundant. Just in April of this year, the financial news website Seeking Alpha also published an article entitled "Will Artificial Intelligence be an Adobe Killer?"

But in reality, the facts are far from the analysts' assumptions. Adobe used its database of hundreds of millions of stock photos to build its own suite of artificial intelligence tools called Firefly. Firefly has been used to create more than 1 billion images since its launch in March, said company executive Dana Rao. By avoiding mining the internet for images like its competitors, Adobe sidesteps the deepening copyright disputes currently plaguing the industry. Adobe stock has risen 36 percent since Firefly launched.

A data scramble is kicking off

Adobe's victory over the so-called "Doomslayer" underscores broader implications of the race for dominance in the fast-growing market for artificial intelligence tools. The very large models powering the latest wave of so-called "generative artificial intelligence" rely on vast amounts of data. Previously, model builders mostly scraped data (often without permission) from the internet. Now, they're finding new sources of data to sustain this frenzied training regime. At the same time, companies with vast amounts of new data are weighing how best to profit from it. A data scramble is kicking off.

The two basic elements of an artificial intelligence model are data sets and processing power. The system is trained on data sets, and the model detects the relationship between the internal and external of these data sets through processing power. To some extent, these two basic elements are interchangeable: a model can be improved by taking in more data or adding more processing power. The latter, however, is becoming increasingly difficult amid a shortage of specialized AI chips, leading model builders to double down on finding data.

Research firm Epoch AI believes that the demand for data will increase so dramatically that the high-quality text available for training may be exhausted by 2026. It is reported that the latest artificial intelligence models of the two technology giants, Google and Meta, have been trained on more than 1 trillion words. By comparison, the total number of English words on the online encyclopedia Wikipedia is about 4 billion.

It's not just the size of the dataset that matters. The better the data, the better the models trained on it will perform. Russell Kaplan of the data startup Scale AI points out that text-based models are ideally trained on long, well-written, factually accurate works. Models fed this information are more likely to produce similarly high-quality outputs.

Likewise, AI chatbots give better answers when asked to explain their work step-by-step, increasing the need for resources such as textbooks. Dedicated information sets also become more valuable, as they allow models to be “fine-tuned” for more niche applications. Microsoft, which acquired software code repository GitHub in 2018 for $7.5 billion, has used it to develop an artificial intelligence tool for writing code.

Data copyright lawsuits surge, AI companies are busy signing licensing agreements

As demand for data grows, access to data gaps becomes increasingly tricky, and content creators are now demanding compensation for material absorbed by AI models. There have been numerous copyright infringement cases brought against model builders in the United States. A group of writers, including comedian Sarah Silverman, is suing OpenAI, developer of the artificial intelligence chatbot ChatGPT, and Facebook parent company Meta. Additionally, a group of artists has similarly sued Stability AI and Midjourney, two companies working on text-to-image tools.

The upshot of all this is a flurry of deals as AI companies race to acquire data sources. In July, OpenAI signed a deal with The Associated Press to gain access to the agency's news archives. More recently, the company also expanded its deal with image library provider Shutterstock, with which Meta also has a deal.

Earlier in August, reports surfaced that Google was in talks with record label Universal Music to license artists' voices to help develop artificial intelligence tools for songwriting. Asset manager Fidelity said the company had been approached by a number of technology companies requesting access to its financial data. Rumor has it that the AI Lab is approaching the BBC for its image and film archives. Another target of interest is JSTOR, a digital library of scholarly journals.

These information holders are leveraging their greater bargaining power. Reddit, a forum, and Stack Overflow, a question-and-answer site popular with programmers, have both raised the cost of accessing their data. Both sites are particularly valuable because users “like” answers, helping the model know which ones are the most relevant. Social media site X (formerly Twitter) has taken steps to limit the ability of bots to scrape information on the site, and now anyone who wants to access its data will have to pay. X boss Elon Musk is planning to use the data to build his own artificial intelligence business.

Therefore, model builders are working to improve the quality of the data they already have. Many AI labs employ armies of data annotators to perform tasks such as labeling images and rating answers. Some of these jobs are so complex that they even require a master's or PhD candidate with a life sciences major. But most of those jobs are mundane and are being outsourced to cheap labor in countries like Kenya.

AI companies also collect data through user interactions with their tools. Many of these tools have some form of feedback mechanism, whereby the user indicates which outputs were useful. Firefly's text-to-image generator allows users to choose from four options. Google's chatbot, Bard, also offers three answers.

Users can give ChatGPT a thumbs up when it replies to a query. This information can be fed back as input into the underlying models, forming what Douwe Kiela, co-founder of startup Contextual AI, calls a “data flywheel.” A stronger signal of the quality of a chatbot's answers is whether users copy text and paste it elsewhere, he added. Analyzing this information helps Google rapidly improve its translation tools.

Explore new fields, and the internal data of enterprise customers become sweet pastries

However, there is one source of data that remains largely untapped: the information that exists within tech companies' enterprise customers. Many businesses unknowingly possess a wealth of useful data, from call center records to customer spending records. This information is especially valuable because it can help fine-tune models for specific business purposes, such as helping call center workers answer customer questions or helping business analysts find ways to boost sales.

However, taking advantage of this abundant resource is not easy. Roy Singh, an analyst at consultancy Bain & Company, notes that historically, most companies pay little attention to the huge but unstructured data sets that will prove most useful for training AI tools. This data is often spread across multiple systems and hidden on company servers rather than in the cloud.

Unlocking this information will help businesses tailor AI tools to better meet their specific needs. Both tech giants, Amazon and Microsoft, now offer tools to help other businesses better manage unstructured data sets, as does Google. Christian Kleinerman of database company Snowflake said the field is booming as clients look to "break down data silos."

Startups are also flocking to this new field. In April this year, Weaviate, a database company focused on artificial intelligence, raised $50 million at a valuation of $200 million. Just a week later, rival PineCone raised $100 million at a $750 million valuation. Earlier this month, another database startup, Neon, also raised $46 million. Clearly, the scramble for data has only just begun.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)