The world's most powerful long text model, which can read 350,000 Chinese characters at a time: Baichuan2-192K is online

2023-11-01 06:04:35

Reading books with large models has never been so fast.

Image source: Generated by Unbounded AI

Domestic large-scale model start-ups are creating new records at the forefront of technology.

On October 30, Baichuan Intelligence officially released the Baichuan2-192K long window large model, which increased the length of the large language model (LLM) context window to 192K token.

This is equivalent to having the large model process about 350,000 Chinese characters at a time, which is 14 times longer than GPT-4 (32K token, about 25,000 characters) and 4.4 times longer than Claude 2.0 (100K token, about 80,000 characters).

In other words, Baichuan2-192K can read a copy of Three-Body Problem 2 in one sitting, making it the largest model with the longest processing context window in the world. In addition, it also significantly outperforms its competitors in multiple dimensions such as text generation quality, contextual understanding, and Q&A ability.

What can a large model that can understand very long texts at a time? Baichuan Intelligent made a simple demonstration.

Upload a PDF file of the entire "Three-Body Problem 2: Dark Forest", and the Baichuan model is 300,000 words. Next, if you ask any questions about the novel, the model can give a concise and precise answer.

Sometimes we turn to AI for help, not to use their imagination, but to extract accurate information. With Baichuan2-192K, we can quickly decipher dozens or even hundreds of pages of contract documents, and let the AI quickly give a concise summary, rounding off is quantum speed reading:

So what if I suddenly get a new assignment and have a bunch of files to read?

You can directly package and upload it together, and the Baichuan model can easily integrate five news articles into one.

As the content that the large model can understand becomes longer, the more and more directions it will be applied in. As we all know, the ability to model long text is a prerequisite for the application of many scenarios. This time, Baichuan has taken the lead in the industry.

From tens of thousands of words to hundreds of thousands of words, leading startups are rushing to seize the "long window"

If you pay attention to the application of large models in the direction of text understanding, you may notice a phenomenon: at the beginning, the texts used to evaluate the ability of the model may be some financial reports and technical reports, which usually range from a dozen to dozens of pages, and the number of words is usually tens of thousands of words. But then, the test text gradually evolved into several hours of meeting minutes, or hundreds of thousands of words of novels, and the competition became more and more intense and difficult.

At the same time, large model companies that claim to be able to understand longer contexts are gaining traction. For example, some time ago, Anthropic, the company behind Claude, which claimed to be able to realize a 100K token context window, has received billions of dollars in financing from Microsoft and Google, pushing the large model arms race to a new level.

Why are these companies challenging long texts?

First of all, from an application perspective, many workers who use large models to improve productivity inevitably have to deal with long texts, such as lawyers, analysts, consultants, etc., and the larger the context window, the wider the range of things these people can do with large models; Secondly, from a technical point of view, the more information the window can hold, the more information the model can refer to when generating the next word, the less likely it is that "hallucinations" will occur, and the more accurate the information will be, which is a necessary condition for the implementation of large model technology. Therefore, while trying to improve the performance of the model, companies are also competing to see who can make the context window larger and thus put it into more application scenarios.

As you can see from some of the examples shown earlier, Baichuan2-192K excels in both text generation quality and contextual understanding. And, in addition to these qualitative results, we can also see this in some quantitative evaluation data.

Baichuan2-192K: The longer the file, the more obvious the advantage

In the evaluation of text generation quality, a very important metric is called "confusion": when we take high-quality documents that conform to human natural language habits as the test set, the higher the probability of the model generating the Chinese version of the test set, the smaller the confusion of the model, and the better the model.

The test set used to test the perplexity of the Baichuan large model is called PG-19. This dataset was produced by DeepMind researchers and was made using materials from Project Gutenberg books, so PG-19 has book-quality quality.

The test results are shown in the figure below. As you can see, in the initial phase (to the left of the horizontal axis, when the context length is shorter), the confusion level of Baichuan2-192K is at a low level. As the length of the context increases, its advantages become more apparent, and even the confusion continues to decrease. This suggests that Baichuan2-192K is better able to maintain book-level text generation quality in long contexts.

In terms of contextual comprehension, Baichuan2-192K's performance is also very impressive.

This competency is assessed using Long, the authoritative long-window text comprehension benchmark. Long is a list released by the University of California, Berkeley and other universities for the evaluation of long window models, which mainly measures the model's ability to remember and understand the content of long windows, and the higher the model score, the better.

As you can see from the evaluation results in the graph below, Baichuan2-192K has been able to maintain consistent high performance as the context length increases, even after the window length exceeds 100K. In contrast, Claude 2's overall performance drops dramatically after a window length of more than 80K.

In addition, the model has been tested on Dureader, NarrativeQA, TriviaQA, LSHT and other evaluation sets of long text Q&A and abstracts in Chinese and English. The results show that the Baichuan 2-192K also performs well, outperforming other models in most long text evaluation tasks.

In short, the longer the content processed, the better the relative performance of Baichuan's large model.

**192K super long context, how did Baichuan do it? **

It is a consensus in the AI industry that expanding the context window can effectively improve the performance of large models, but the ultra-long context window means higher computing power requirements and greater memory pressure.

In order to alleviate this pressure, some compromise methods have emerged in the industry, such as making the model smaller; Let the model actively abandon the previous text by sliding the window, etc., and only retain the attention mechanism for the latest input; By downsampling the context or RAG (Retrieval Enhanced Generation), the attention mechanism that only retains some of the input, and so on.

Although these methods can increase the length of the context window, they all damage the performance of the model to varying degrees. In other words, they sacrifice the performance of other aspects of the model in exchange for the length of the context window, such as the inability of the model to answer complex questions based on full-text information, and the difficulty of considering answers across multiple texts.

The Baichaun2-192K ** released by Baichuan this time achieves a balance between window length and model performance through the ultimate optimization of algorithms and engineering, and achieves the simultaneous improvement of window length and model performance**.

In terms of algorithms, Baichuan Intelligent proposes an extrapolation scheme for dynamic position coding of RoPE and ALiBi, which can carry out different degrees of Attention-mask dynamic interpolation of ALiBi_mask of different resolutions, which can enhance the modeling ability of the model to rely on long sequences while ensuring the resolution.

In terms of engineering, on the basis of the self-developed distributed training framework, Baichuan Intelligent integrates all the advanced optimization technologies on the market, including tensor parallelism, flow parallelism, sequence parallelism, recomputation and offload functions, etc., to create a comprehensive set of 4D parallel distributed solutions. This solution can automatically find the most suitable distributed strategy according to the specific load situation, which greatly reduces the memory occupation in the long window inference process.

Fight the battle of large models, be fast

Founded in April this year, Baichuan Intelligence can almost be said to be a large-scale model startup with the fastest technology iteration in the industry. In just half a year since its establishment, the company has released four open-source and free commercial models, Baichuan-7B/13B and Baichuan2-7B/13B, as well as two closed-source models, Baichuan-53B and Baichuan2-53B.

On average, a new large model is released every month.

The Baichuan series of large models integrate intention understanding, information retrieval, and reinforcement learning technologies, combined with supervised fine-tuning and human intention alignment, and perform well in the fields of knowledge question answering and text creation. These large models are also favored in the industry because of their capabilities: the cumulative number of downloads of the Baichuan series of open source models in major open source communities has exceeded 6 million; Baichuan 2 is ahead of Llama 2 in all dimensions, leading the development of China's open source ecosystem.

On August 31, Baichuan Intelligent took the lead in passing the "Interim Measures for the Management of Generative Artificial Intelligence Services", and was the only large-scale model company founded this year among the first batch of 8 companies. On September 25, Baichuan Intelligent opened the Baichuan API interface, officially entered the To B field, and started the commercialization process.

It can be said that from technology research and development to landing, the speed of Baichuan is fast enough.

The just-released Baichuan2-192K has officially started the closed beta test and will be open to core partners in the form of API calls. Baichuan said that it has reached cooperation with financial media and law firms, and applied Baichuan2-192K's leading long context capabilities to specific scenarios such as media, finance, and law, and will soon be provided to enterprise users in the form of API calls and privatized deployment.

After being fully opened in the form of APIs, Baichuan2-192K can be deeply integrated with a large number of vertical scenarios, play a role in people's work, life, and learning, and help industry users greatly improve efficiency. Baichuan2-192K can process and analyze hundreds of pages of materials at a time, which is a huge help for real-world scenarios such as long-form document summarization, long-form document review, long-form article or report writing, and complex programming assistance.

Previously, Wang Xiaochuan, founder and CEO of Baichuan Intelligence, had revealed that in the second half of this year, Baichuan will launch a 100-billion-level large model, and it is expected that there will be a C-end super application deployment next year.

Faced with the gap with OpenAI, Wang Xiaochuan admitted that there is indeed a gap between us and OpenAI in terms of ideals, OpenAI's goal is to explore the ceiling of intelligence, and they even hope to design a technology that connects 10 million GPUs together. However, in terms of application, we are going faster than the United States, and the application and ecological experience accumulated in the Internet era can make us go faster and further, so the concept of Baichuan to make a large model is called "One step slower on the ideal, three faster steps on the ground".

From this point of view, Baichuan2-192K is an extension of this concept, and the world's longest context window will undoubtedly accelerate the process of Baichuan intelligent large model technology.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes

Reward
1
Comment
Share

Comment

0/400

No comments

Topic
#Show My Alpha Points
19k Popularity
#SOL Futures Reach New High
2k Popularity
#ETH ETF Sees 12 Weeks of Inflows
2k Popularity
#Crypto Market Rebound
170k Popularity
#Growth Points Draw Round 12
27k Popularity

sitemap