The large-scale domestic voice dialogue model is here: Li Kaifu participates in all things, bilingual and multi-modal in Chinese and English, open source and commercially available

2023-09-03 01:34:12

Source: Qubit

The first Chinese-English bilingual voice dialogue open source model is here!

In the past few days, a paper on speech-text multimodal large-scale models appeared on arXiv, and the name of 01.ai, a large-scale model company owned by Kai-Fu Lee, appeared in the signed company.

This paper proposes a Chinese-English bilingual commercially available dialogue model LLaSM, which supports both recording and text input, and there is no problem with "mixed doubles":

The paper believes that "voice chat" is a more convenient and natural way of interaction between AI and humans, not just through text input.

Using a large model, some netizens are already imagining the scene of "writing code while lying down and talking".

This research comes from LinkSoul.AI, Peking University and 01Wanwu. It has been open sourced, and you can also try it directly in Hugging Face.

Let's see how it works.

Supports text and voice input, can also be played on mobile phones

According to the researchers, LLaSM is the first open source and commercially available dialogue model that supports Chinese and English bilingual speech-text multi-modal dialogue.

So, let’s take a look at its voice text input and Chinese and English bilingual capabilities.

First of all, let’s have a Chinese-English cultural collision, let him comment on Li Bai in English:

It's okay, it correctly stated Li Bai's dynasty. If you don’t understand English, it’s no problem to translate it directly into Chinese:

Next, try a Chinese-English mixed question, and add a "fried food" in Chinese, and the model output is also good:

Let’s try the model again and let it conduct some evaluations to see which one is more powerful, Li Bai or Du Fu.

It can be seen that the model gave a very neutral evaluation after thinking for a while, and also has the basic "common sense of handling water" of the large model (manual dog head)

Of course, it can be played not only on computers, but also on mobile phones.

Let's try typing "Suggest me a recipe" with voice:

It can be seen that the model accurately outputs a recipe of "Eggplant Cheese", but I don't know if it is good or not.

However, when we tried it, we also found that this model sometimes had bugs.

For example, sometimes it doesn't "understand human speech" very well.

When asked to output mixed Chinese and English content, it will pretend not to understand and output English:

When asked in mixed Chinese and English if he wanted to listen to "Taylor Swift's Red", the model went straight to a big bug and output a sentence over and over again, even unable to stop...

Overall, when faced with questions or requests that are mixed in Chinese and English, the model output capability is still not very good.

But separately, its ability to express both Chinese and English is pretty good.

So, how is such a model realized?

What new model did you make?

Judging from the trial play, LLaSM has two main features: One supports Chinese and English input, and the other supports dual voice and text input.

To achieve these two points, some adjustments need to be made in the architecture and training data respectively.

Architecturally, LLaSM integrates the current speech recognition model and the large language model.

LLaSM consists of three parts, including the automatic speech recognition model Whisper, the modal adapter and the large model LLaMA.

Among them, Whisper is responsible for receiving the original speech input and outputting the vector representation of speech features; the modality adapter is responsible for aligning speech and text embedding; LLaMA is responsible for understanding speech and text input instructions and generating responses.

The training of the model is divided into two stages. The first stage trains the modality adapter, freezes the encoder and the large model, that is, lets it learn voice and text alignment; the second stage freezes the encoder, trains the modality adapter and the large model , to learn multi-modal dialogue capabilities.

On the training data, the researchers compiled a data set LLaSM-Audio-Instructions containing 199,000 dialogues and 508,000 speech-text samples.

Among the 508,000 speech-text samples, there are 80,000 Chinese speech samples and 428,000 English speech samples.

Researchers mainly use text-to-speech technology to generate voice packets for these data sets based on data sets such as WizardLM, ShareGPT and GPT-4-LLM, while filtering out invalid conversations.

This is also currently the largest Chinese and English speech text instruction following data set, but it is still being sorted out. According to the researchers, it will be open sourced after it is sorted out.

However, the paper does not yet compare its output effects with other speech models or text models.

about the author

This paper is from LinkSoul.AI, Peking University and 01Wanwu.

The co-authors Yu Shu and Siwei Dong are both from LinkSoul.AI, and previously worked at Beijing Zhiyuan Artificial Intelligence Research Institute.

LinkSoul.AI is an AI start-up company that has previously launched the first open source Llama 2 large Chinese language model.

As a large-scale model company under Kai-Fu Lee, Zero One Wanwu also contributed to this research. The Hugging Face homepage of the author Wenhao Huang shows that he graduated from Fudan University.

Paper address:

Demo site:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.