Chinese large models cost more money than English ones. Is this actually determined by the underlying principles of AI?

Source: Future Technology Power

Author: Li Xinshuai

The use of AI tools such as ChatGPT is becoming more and more common. When interacting with AI, we know that differences in input prompt words will affect the output results. So, if prompt words with the same meaning are expressed in different languages, will the results be significantly different? In addition, the input and output of prompt words are directly linked to the calculation amount behind the model. Therefore, are there natural differences or "unfairness" between different languages in terms of AI output and cost consumption? How does this "unfairness" arise?

It is understood that what is behind the hint is actually not a text, but a token. After receiving the prompt words entered by the user, the model converts the input into a token list for processing and prediction, and converts the predicted tokens into the words we see in the output. That is, token is the basic unit of language model processing and generating text or code. It can be noticed that various manufacturers will declare how many token contexts their models support, rather than the number of supported words or Chinese characters.

Factors affecting Token calculation

First of all, a token does not correspond to an English word or a Chinese character, and there is no specific conversion relationship between the token and the word. For example, according to the token calculation tool released by OpenAI, the word hamburger is decomposed into ham, bur and ger, a total of 3 tokens. In addition, if the same word has a different structure in two sentences, it will be recorded as a different number of tokens.

How the specific token is calculated mainly depends on the tokenization method used by the manufacturer. Tokenization is the process of splitting input and output text into tokens that can be processed by a language model. This process helps the model handle different languages, vocabularies, and formats. Behind ChatGPT is a tokenization method called "Byte-Pair Encoding" (BPE).

At present, the number of tokens a word is decomposed into is related to its pronunciation and structure in the sentence. And the calculation differences between different languages seem to be large.

Take the Chinese "hamburger" corresponding to "hamburger" as an example. These three Chinese characters are counted as 8 tokens, that is, they are broken down into 8 parts.

Source: Screenshot of OpenAI official website

Take another paragraph to compare the "unfairness" of Chinese and English language token calculations.

The following is a sentence from the OpenAI official website: You can use the tool below to understand how a piece of text would be tokenized by the API, and the total count of tokens in that piece of text. This sentence has a total of 33 tokens.

Source: screenshot of OpenAI official website

The corresponding Chinese is: You can use the tool below to understand how the API tokenizes a piece of text and the total number of tokens in the piece of text. A total of 76 tokens.

Source: screenshot of OpenAI official website

Chinese and English languages are naturally "unfair" on AI

It can be seen that the number of Chinese tokens with the same meaning is more than twice that of English. The "unfairness" between Chinese and English in training and reasoning may be due to the fact that a single word in Chinese can express multiple meanings, and the language composition is relatively flexible. Chinese also has profound cultural connotations and rich contextual meanings, which is extremely It greatly increases the ambiguity and processing difficulty of language; English has a relatively simple grammatical structure, which makes English easier to process and understand than Chinese in some natural language tasks.

Chinese needs to process more tokens, the more memory and computing resources the model consumes, and of course the greater the cost.

At the same time, although ChatGPT can recognize multiple languages including Chinese, the data sets it uses for training are mostly English texts. When processing non-English languages, it may face challenges in language structure, grammar, etc., thus affecting the output effect. A recent article titled "Do multilingual language models perform better in English?" "Do Multilingual Language Models Think Better in English?" mentioned in the paper that when non-English languages are translated into English, the output results are better than the results of directly using non-English languages as prompt words.

For Chinese users, it seems that translating Chinese into English first and then interacting with AI seems to be more effective and cost-effective. After all, using OpenAI’s GPT-4 model API costs at least $0.03 for every 1,000 tokens input.

Due to the complexity of the Chinese language, AI models may face challenges in using Chinese data for accurate training and reasoning, and increase the difficulty of applying and maintaining Chinese models. At the same time, for companies that develop large-scale models, it may have to bear greater costs due to the need for additional resources to make large-scale models in Chinese.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)