Inequality in the AI model: Chinese training costs twice as much as English!

Source: Ifanr

Author: Mo Chongyu

Recently, X (formerly Twitter) user @Dylan Patel showed a study from Oxford University: By studying the language of GPT-4 and most other common LLMs, the study found that the cost of LLM (Large Language Model) inference is very different. big.

Among them, English input and output are much cheaper than other languages. The cost of Simplified Chinese is about 2 times that of English, the cost of Spanish is 1.5 times that of English, and the cost of Burmese Shan is 15 times that of English.

The principle can be traced back to a paper published by Oxford University on arXiv in May this year.

Lexical is the process of converting natural language text into a sequence of tokens, which is the first step in language model processing text. In the calculation of LLM computing power cost, the more tokens, the higher the cost of computing power.

Undoubtedly, under the trend of commercialization of generative AI, the cost of computing power will also be grafted on to users. Many current AI services are billed according to the number of words that need to be processed.

The paper shows that after analyzing 17 lemmatization methods, the researchers found that the length of the same text is converted into lemma sequences in different languages. The length is totally fair.

For example, according to OpenAI's GPT3 tokenizer, if you tokenize "your love", only two tokens are needed in English, while eight tokens are required in Simplified Chinese. Even though Simplified Chinese text has only 4 characters and English text has 14 characters.

From the pictures exposed by X user @Dylan Patel, it can also be seen intuitively that it takes 17 tokens (tokens) for LLM to process a sentence of English, and 198 tokens (tokens) for LLM to process a sentence of Burmese with the same meaning. This means that Burmese will be 11 times more expensive to process than English.

There are many similar situations. Aleksandar Petrov's website provides many related icons and data. Interested friends may wish to click "Enter to view the differences between languages.

There is also a similar page on OpenAI's official website, explaining how the API lemmatizes a piece of text, and displays the total number of tokens in the text. The official website also mentions that a lemma usually corresponds to about 4 characters in an English text, and 100 lemmas equal about 75 words.

Thanks to the short length of English lexical sequences, English is the biggest winner in the cost-effectiveness of generative artificial intelligence pre-training, leaving other language users far behind, indirectly creating an unfair situation.

Among other things, this difference in token sequence length can lead to unfair processing latency (some languages take more time to process the same content) and unfair modeling of long sequence dependencies (some languages can only process shorter text).

To put it simply, users of certain languages need to pay higher costs, suffer greater delays, and obtain poorer performance, thereby reducing their fair access to language technology opportunities, which indirectly leads to English-speaking users and An AI divide forms between the rest of the world's language usage.

From the cost of output alone, the cost of Simplified Chinese is twice that of English. With the in-depth development of the AI field, Simplified Chinese, which is always "one step away", is obviously not friendly. Under the balance of superimposed factors such as cost, non-English-speaking countries are also trying to develop their own native language models.

Taking China as an example, as one of the first domestic giants to explore AI, on March 20, 2023, Baidu officially launched the generative AI Wenxin Yiyan.

Subsequently, batches of excellent large-scale models, such as Alibaba's Tongyi Qianwen large-scale model and Huawei's Pangu large-scale model, emerged one after another.

Among them, the NLP large model in Huawei's Pangu large model is the industry's first Chinese large model with 100 billion parameters, which has 110 billion dense parameters and is trained with 40TB of massive data.

As the Deputy Secretary-General of the United Nations, Amina Mohamed, once warned at the UN General Assembly, if the international community does not act decisively, the digital divide will become "the new face of inequality".

In the same way, with the rapid development of generative AI, the AI gap is likely to become a new round of "new faces of inequality" worthy of attention.

Fortunately, the domestic technology giants that are usually "disgusted" have already taken action.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)