🎉 [Gate 30 Million Milestone] Share Your Gate Moment & Win Exclusive Gifts!
Gate has surpassed 30M users worldwide — not just a number, but a journey we've built together.
Remember the thrill of opening your first account, or the Gate merch that’s been part of your daily life?
📸 Join the #MyGateMoment# campaign!
Share your story on Gate Square, and embrace the next 30 million together!
✅ How to Participate:
1️⃣ Post a photo or video with Gate elements
2️⃣ Add #MyGateMoment# and share your story, wishes, or thoughts
3️⃣ Share your post on Twitter (X) — top 10 views will get extra rewards!
👉
OpenAI also struggles with data! The company admits that the use of crawlers to limit itself is difficult to dispel public suspicion
Source: "Science and Technology Innovation Board Daily"
Edit Song Ziqiao
Data, computing power, and algorithms are regarded as the three core elements of generative AI, and it is difficult to say which is more important.
However, for star companies like OpenAI, computing power is basically an economic issue. Big companies hoard a large amount of expensive hardware by virtue of their "money ability", and the problem of data scarcity is even more of a headache. way always puts them in a moral crisis.
Taking OpenAI as an example, its behavior of capturing public data to train AI models has long been controversial. **According to the latest report from foreign technology media Insider, OpenAI recently admitted that it has launched a web crawler robot named GPTBot, which is used to crawl and collect data for large-scale model training. **
OpenAI is suspected of being a "data thief"
A web crawler is a computer program that simulates the behavior of a human (network user) and automatically browses and collects network information. The web crawler can save the data it visits, and the data grabber analyzes and reuses the data, infers the preferences of Internet users, and then pushes them to the matching user groups.
**It's unclear how long OpenAI's crawler bots have been lurking online, and some suspect OpenAI has been secretly collecting everyone's online data for months or years. **
Faced with such "accusations", OpenAI actively defended itself. The company stated that GPTBot will strictly abide by the rules of any paywall, will not capture information that requires payment, and will not collect data that can be traced to personally identifiable people.
In addition, OpenAI has launched a method to block GPTbot. Users can modify their robots.txt file, or block their IP addresses to deny access by crawlers. The company also recently announced a deal with The Associated Press in which OpenAI will pay for AP content needed for the AI's training data.
The Lost Trust
As a means of data collection, crawler technology itself has no distinction between legal and illegal. **However, OpenAI's initiative to set limits on its crawler tools does not seem to be able to restore the public's trust in this big model company. **
Neil Clarke, editor-in-chief of the veteran sci-fi magazine "Clarkworld" and winner of the Hugo Award, said: "OpenAI and other large-scale model companies have repeatedly demonstrated that they do not respect the rights of authors, artists and other creative people. based largely on the copyrighted work of others."
He also gave an example, CCBot is another crawler robot operated by the Common Crawl organization. Common Crawl is currently the main supplier of training data for artificial intelligence models. "As far as I know, no one has successfully asked Common Crawl to delete data," Clark said. "I tried and got no response."
On the other hand, when it comes to pulling against big corporations, ordinary people are mostly at a disadvantage. As Clark said, since OpenAI is willing to pay for the data of big companies like (Associated Press), why doesn't it pay for other people's information? "I asked OpenAI about this, but got no response."
However, Clark itself stands on the opposite side of OpenAI. The "Clark World" he founded is facing a flood of AI-generated content. Clark has pointed out that after ChatGPT opened late last year, AI-generated spam submissions surged, and the cost of detecting such works was high, and the journal temporarily suspended the call for manuscripts.
Conclusion
Previously, OpenAI has been sued by multiple parties for copyright issues, including the class action promoted by Clarkson Law Firm and best-selling authors such as Paul Tremblay and Mona Awad. Celebrities sue in their real names.
With the further iteration of generative AI technology, similar disputes will only increase.
Large companies are more likely to become the target of public criticism. Even if they dare to take responsibility, it is not easy to achieve full compliance with data acquisition. Due to the huge amount of parameters, large models need to be trained and deployed with the help of technologies such as distributed computing and cloud services, which increases the risk of data theft, tampering, misuse or leakage.
How to balance the protection of personal privacy and encourage technological innovation, and how to find the optimal path between enterprise survival and compliant production are already issues that every company dedicated to generative AI cannot avoid.