Gate Alpha 2nd Points Carnival Round 4 Hot Launch! Trade to Share $30,000 MORE & Alpha Points
Trade $MORE to unlock Listing Airdrops + $300K Points Prize Pool!
💰 Total Airdrop Volume: $30,000 MORE, Limited slots—first come, first served!
✅ Total Points: 2 Alpha Points per trade—accumulate points to share the $300K prize pool!
🔥Trade the Hottest On-Chain Assets First
For more information: https://www.gate.com/campaigns/1342alpha?pid=X&c=MemeBox&ch=vxDB0fQ5
"Stealing" data, the dark side of AI big models
Original source:
A start-up company called "one stroke two strokes" publicly denounced the former leader of education and training "Xueersi", saying that it "stealed" the data it had worked so hard to save by "scraping the database".
The origin of the story is that in mid-April this year, "Pen Shen Composition" (a product of the Strike Two Strike Company) found that there were a large number of regular abnormal accesses to the server interface, resulting in a rapid increase in the load on the server.
The number of visits far exceeds the daily average. Bishen Composition revealed to Deep AI that the usual daily visits are about a few hundred or a few thousand, but in those few days it increased to more than 500,000 per day. Within a week, their data was crawled 2.58 million times.
By consulting the server logs, Pen God Composition found that a single IP crawled their database with high density through the "crawler" technology. The search words for each visit of this IP are related to the composition, and the system will return 30 compositions per page. Each visit uses the search words to turn back from the first page page by page, basically collecting the same topic in the library. All the compositions have been captured.
According to industry insiders, under normal circumstances, ordinary users will not do this. **This kind of search-style access to the database is also known as "scraping the library". **
Penshen Composition believes that the behind-the-scenes manipulator of "Paku" is its partner Xueersi.
Not long after the "Parking Library" incident, Penshen Composition found that Xueersi was developing a large mathematical model MathGPT, and said that it would launch an "AI assistant" in the near future, one of which is composition.
There is no definite conclusion on whether there is any connection between the two incidents of Penshen Composition being "picked up" and Hexueersi developing the "Composition AI Assistant".
But Bishen Composition believes that its rights have been violated. It sent a lawyer's letter to the other party and made the matter public, trying to get an explanation. Xueersi gave a public response, saying that the use of Penshen's material content complied with the contract requirements, and that its self-developed MathGPT model and "composition AI assistant" did not use any data from Penshen's composition.
In this incident, it is not only the composition material that is worth discussing. What does data mean for large models?
Both sides insist on their own opinions
Let's first briefly introduce the composition of the pen god.
This company was established in 2017. The product "Pen God" is an artificial intelligence-assisted writing software, which can be regarded as a product of AI+education. At the beginning, "Pen God" was oriented to content creation platforms and related tool manufacturers, and later it went deep into the vertical field, using AI to teach students to write essays, so there was "Pen God Composition".
You can simply understand: it is in the education industry, it is aimed at the student group, it uses artificial intelligence technology, and it solves the scene of writing essays.
AI writing has a lot in common with ChatGPT, which is popular today. They all involve technologies such as natural language processing, semantic analysis and prediction, and machine learning. Song Jiawei, the founder of Penshen Composition, has served as a senior system architect for Sony and CTO of Singulato.
As early as five years ago, Song Jiawei said that he was considering how to apply pre-trained language model technologies such as bert or GPT-2 to applications. At that time, GPT was not out of the circle, and it was not as well-known as it is today.
After starting to do AI composition, Penshen Composition officially entered the education track, stepping into the same river as Xueersi, the leader of education and training.
According to the introduction of Penshen, in December 2020, Penshen Composition and Xueersi reached a cooperation. ** Penshen Composition provides Xueersi with a "Benshen Composition Model Essay Material Service Interface", which is used in Xueersi related services, and the fee is settled according to the number of calls. For this reason, Penshen Composition has opened up a service interface for Xueersi. **
In other words, Xueersi can use the composition materials in the Penshen Composition database and pay for them.
Composition materials are a core asset in this transaction and the cornerstone of the business model of Penshen Composition. In fact, the Composition of Pen God started from the point of material at the earliest. It featured the "one-click material search" function back then. Users can search for keywords, and the system can automatically match materials. The resources range from ancient poetry classics, official documents, to modern web articles. During the writing process, the system can also push material in real time.
These materials are not from the Internet, but from Penshen's own database. Through the intelligent identification, translation and matching of AI technology, Penshen can feed back suitable materials to users' search behavior.
When the amount of these composition materials is large enough, the quality is high enough, and the matching is accurate enough, it will have a certain commercial value and can even be sold externally. This is the reason for the cooperation with Xueersi.
The problem is that these materials risk being "stolen", especially if some interfaces are opened.
According to the introduction of Deep AI in the penshen composition, they limited the scope of cooperation with Xueersi, "We open the interface to allow them to call our data and display it in their own APP, but the contract does not include storage data. Or permissions for AI algorithms. Data should only be available to their users, not stored on their machines.”
It is equivalent to, **When a user initiates a search on the product side of Xueersi, the composition template invoked comes from Penshen Composition, and Xueersi cannot store it by itself. **
The abnormal call in mid-April made Pen God Composition think that it was beyond the scope of normal business cooperation. "Their actions triggered our defense mechanisms, which led us to discover this."
Bishen Zuowen stated that they checked the access logs in the background and found that the illegal access was initiated by a single IP through "crawler" technology. "We already have this IP address."
Liu Ran, the CEO of a domestic artificial intelligence start-up company, analyzed Deep AI. This method of exhaustively enumerating keywords must be to obtain the data in the library. This is a very obvious behavior.
Penshen Composition revealed to Deep AI that after the incident, they verified with the operation staff of Xueersi, and the other party directly admitted that the algorithm team of Xueersi was crawling the data and using it for their own use. However, for this statement, Deep AI has not yet been confirmed by Xueersi.
The former partner suddenly turned into a barbarian at the door, which made Bishen Composition very angry and sent lawyer letters many times.
Xueersi said in its public response on June 13 that its call to the Penshen composition interface did not exceed the scope of the contract between the two parties, and the use of the Penshen material content complied with the contract requirements, and was not used for anything other than the contract. for any purpose. Xueersi specifically emphasized that its self-developed MathGPT large model and "composition AI assistant" did not use any data from Penshen Composition.
The two sides insist on their own opinions, and there is no conclusion yet. According to the article of Pen God, this case may become "the first case of AI large-scale model data being stolen".
A question worth exploring is what does data mean for large models?
Computing power, algorithms, and data are the three core elements of artificial intelligence for machine learning.
In order to improve computing power, many technology companies are spending a lot of money to snatch Nvidia's GPU. On the algorithm side, some major companies at home and abroad have made the algorithm open source, which greatly reduces the threshold for model development.
On the data side, barriers have always existed. Where to find high-quality data is a key issue.
Large generative AI models need to use a large amount of diverse data for training to improve the generalization and generation capabilities of the model. Different models may use different data sources. General large models such as ChatGPT use a lot of public data, such as various news websites, books, scientific papers, web pages, etc. For large models in some vertical fields, it is necessary to find targeted corpora and data sets.
The person in charge of the large-scale model of a leading technology company in China told Deep AI that ChatGPT actually uses a lot of non-public data. Many of the public data on the Internet are of very poor quality, and there is a threshold for high-quality data. Data acquisition and cleaning are facing great challenges. **
TAL CTO Tian Mi publicly stated on May 4, "Many fields have data barriers and industry know-how, and large models still need to be deeply integrated with domain knowledge, plus enough domain data to train domain experts. Model."
As Tian Mi said, the large domain model should be deeply integrated with domain knowledge. In the field of AI composition, composition materials are important data for training machines.
As early as 2019, Penshen began to collect data purposefully to train its own composition corpus, covering famous quotes, poems, official documents, and online languages. They use the method of training machines to simulate manual labels to label each corpus.
In the vertical corpus, only when the data is tagged can accurate content push be carried out based on vector matching, semantic analysis and prediction of the user's current content creation.
Liu Ran told Deep AI that building a model requires a lot of verified data, and if the data has been sorted out, it can save a lot of human work. Compositions organized by Penshen Composition may be used as marked data.
This process is continuous and lengthy. Bishen Composition said that in the six years since their establishment, they have accumulated more than 5 million composition materials in total, and the monthly correction volume exceeds 30,000. These composition materials are manually reviewed, screened and submitted, labeled, graded, and data corrected, and finally accumulated.
These data can not only be presented in the form of materials on the APP page, but also be used to train algorithms in the background. Therefore, when cooperating with other companies to open interfaces, Penshen Composition has added a special article in the agreement-no "caching, storage, calculation and training as corpus".
Bishen Composition believes that Xueersi has "stolen" the data, and speculates that Xueersi uses the data for the training and development of the large mathematical model MathGPT and the Xueersi learning machine "Composition AI Assistant". But that seems hard to prove.
Liu Ran believes that normally speaking, composition data should have some restrictions set in advance, such as not accepting high concurrency, encrypting the data, and it should be possible to track the whereabouts and uses of the data. However, he also believes that composition data is not as critical as user key behavior data.
"You can let AI learn what is a good composition, and then let it generate according to these standards. But I don't think that much data is actually needed. Tens of thousands of high-quality compositions should be enough." He said.
**Can you stand up? **
Penshen Composition took a tough attitude, and issued two announcements in succession, demanding an apology from Xueersi, and at the same time claiming a compensation of 1 yuan. It even wants to label this incident as "the first case of AI large model data theft".
Lawyer Liu Honglin, director of Shanghai Mankiw Law Firm, told Deep AI that the self-built corpus or material library of Bishen Composition itself has intellectual property rights. However, whether it is a work under the Copyright Law depends on whether the originality meets the relevant criteria.
"If Penshen Composition has enough evidence to prove that Xueersi has maliciously grabbed their data, then it can initiate an intellectual property infringement or unfair competition lawsuit." He said.
In addition, Bishen Composition has a cooperation agreement with Xueersi. If the respect and authorization of intellectual property rights are agreed, they can also protect their rights and interests through contract breaches.
It is worth noting that many of the compositions in the Penshen Composition Material Library are submitted by users. Pen God Composition claims that it receives 300,000 essay submissions every month. Therefore, before determining whether it is an infringement, it is necessary to clarify the intellectual property rights of these materials.
According to Liu Honglin's analysis, it depends on how the creator (contributor) of the essay and the penshen composition agree on intellectual property rights. If the user authorizes the intellectual property rights of Penshen Composition at the time of submission, then Penshen Composition will enjoy the corresponding rights and interests.
Deep AI inquired about the user service agreement of Pen God Composition, and found that there is such a clause: the content published by the user on Pen God Composition (including but not limited to comments, comments, notes), grants Pen God Composition a free and irrevocable non-exclusive license .
What Liu Ran couldn't figure out was why Bishen Composition cooperated with Xueersi. "If it were me, I would definitely not cooperate with Xueersi, because we are in a strong competitive relationship." He believes, "In the era of large models, there is no chance to just provide a composition database."
According to analysis by industry insiders, Xueersi has traffic, scenes, and popularity, especially in terms of user-oriented front-end products, Xueersi has greater advantages than Pen God Composition. However, the work of collecting data and building a material library at the back end is time-consuming and laborious, and it is difficult to see results in the short term. For Xueersi, it is most convenient to directly access the ready-made material library. Penshen Composition achieved commercial monetization by selling access to the material library.
But for a start-up company like Pen God Composition, such cooperation is like a rose with thorns. Because Chinese giants may enter your territory at any time, and even form direct competition at the business level. **
AI correcting composition is a very important function of Penshen Composition. As early as three years ago, TAL (the parent company of Xueersi) also launched the "Chinese and English Composition Correction Solution", which realized intelligent Chinese and English composition correction through AI.
Now, AI composition modification is just the tip of the iceberg of TAL's huge AI product matrix. In its latest product introduction, Chinese composition correction is a module of Chinese and English dictation correction. TAL has greater ambitions, and its tentacles have already extended to all aspects of AI+education.
For a company like Pen God Composition, where are the barriers to competition and how to confront the giants are very real problems. The accelerated involution of the artificial intelligence industry and the intensification of homogeneous competition will escalate the confrontation between startups and giants.
Grabbing data may just be the tip of the iceberg in a new round of competition.