OpenAI also struggles with data! The company admits that the use of crawlers to limit itself is difficult to dispel public suspicion

2023-08-10 01:57:17

Source: "Science and Technology Innovation Board Daily"

Edit Song Ziqiao

Image source: Generated by Unbounded AI

Data, computing power, and algorithms are regarded as the three core elements of generative AI, and it is difficult to say which is more important.

However, for star companies like OpenAI, computing power is basically an economic issue. Big companies hoard a large amount of expensive hardware by virtue of their "money ability", and the problem of data scarcity is even more of a headache. way always puts them in a moral crisis.

Taking OpenAI as an example, its behavior of capturing public data to train AI models has long been controversial. **According to the latest report from foreign technology media Insider, OpenAI recently admitted that it has launched a web crawler robot named GPTBot, which is used to crawl and collect data for large-scale model training. **

OpenAI is suspected of being a "data thief"

A web crawler is a computer program that simulates the behavior of a human (network user) and automatically browses and collects network information. The web crawler can save the data it visits, and the data grabber analyzes and reuses the data, infers the preferences of Internet users, and then pushes them to the matching user groups.

**It's unclear how long OpenAI's crawler bots have been lurking online, and some suspect OpenAI has been secretly collecting everyone's online data for months or years. **

Faced with such "accusations", OpenAI actively defended itself. The company stated that GPTBot will strictly abide by the rules of any paywall, will not capture information that requires payment, and will not collect data that can be traced to personally identifiable people.

In addition, OpenAI has launched a method to block GPTbot. Users can modify their robots.txt file, or block their IP addresses to deny access by crawlers. The company also recently announced a deal with The Associated Press in which OpenAI will pay for AP content needed for the AI's training data.

The Lost Trust

As a means of data collection, crawler technology itself has no distinction between legal and illegal. **However, OpenAI's initiative to set limits on its crawler tools does not seem to be able to restore the public's trust in this big model company. **

Neil Clarke, editor-in-chief of the veteran sci-fi magazine "Clarkworld" and winner of the Hugo Award, said: "OpenAI and other large-scale model companies have repeatedly demonstrated that they do not respect the rights of authors, artists and other creative people. based largely on the copyrighted work of others."

He also gave an example, CCBot is another crawler robot operated by the Common Crawl organization. Common Crawl is currently the main supplier of training data for artificial intelligence models. "As far as I know, no one has successfully asked Common Crawl to delete data," Clark said. "I tried and got no response."

On the other hand, when it comes to pulling against big corporations, ordinary people are mostly at a disadvantage. As Clark said, since OpenAI is willing to pay for the data of big companies like (Associated Press), why doesn't it pay for other people's information? "I asked OpenAI about this, but got no response."

However, Clark itself stands on the opposite side of OpenAI. The "Clark World" he founded is facing a flood of AI-generated content. Clark has pointed out that after ChatGPT opened late last year, AI-generated spam submissions surged, and the cost of detecting such works was high, and the journal temporarily suspended the call for manuscripts.

Conclusion

Previously, OpenAI has been sued by multiple parties for copyright issues, including the class action promoted by Clarkson Law Firm and best-selling authors such as Paul Tremblay and Mona Awad. Celebrities sue in their real names.

With the further iteration of generative AI technology, similar disputes will only increase.

Large companies are more likely to become the target of public criticism. Even if they dare to take responsibility, it is not easy to achieve full compliance with data acquisition. Due to the huge amount of parameters, large models need to be trained and deployed with the help of technologies such as distributed computing and cloud services, which increases the risk of data theft, tampering, misuse or leakage.

How to balance the protection of personal privacy and encourage technological innovation, and how to find the optimal path between enterprise survival and compliant production are already issues that every company dedicated to generative AI cannot avoid.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes

Reward
1
Comment
Share

Comment

0/400

No comments

Topic
Gate 2025 Q2 Report Released
33k Popularity
Bitcoin Whale Moves
5k Popularity
Altcoin Season Update
12k Popularity
4Gate Derivatives Volume Hits New High
16k Popularity
5Crypto Legislation Voting Week
6k Popularity
6MicroStrategy Buys More Bitcoin
2k Popularity
7BTC Hits New High
95k Popularity
8My Gate Moments
27k Popularity
9VIP Exclusive Airdrop Carnival
27k Popularity
10Trump Tariff Hikes
18k Popularity

sitemap