How Popular Deconstructed Language Large Models Work

Compile: OneFlow

作宇|Tim Lee、Sean Trott

Image source: Generated by Unbounded AI

How exactly does the large language model work internally? This article explains with minimal math and terminology.

The author of this article, Tim Lee, used to work in the technology media Ars Technica. He recently launched a Newsletter "Understanding AI", which mainly discusses the working principle of artificial intelligence. Sean Trott is an assistant professor at the University of California, San Diego, where he studies human language understanding and language models. (The following content is compiled and published by OneFlow after authorization, please contact OneFlow to obtain authorization for reprinting. Original text:

When ChatGPT launched last fall, it caused a stir in the tech industry and around the world. At the time, machine learning researchers had been trying to develop language large models (LLMs) for years, but the general public didn't pay much attention or realize how powerful they had become.

Today, almost everyone has heard of LLMs and tens of millions of people have used them, but not many understand how they work. You may have heard that LLMs are trained to "predict the next word", and they require a lot of text to do this. However, explanations usually stop there. The details of how they predict the next word are often treated as an esoteric puzzle.

One reason for this is that these systems were developed in a different way. Typical software is written by human engineers who provide the computer with clear, step-by-step instructions. In contrast, ChatGPT is built on a neural network trained using billions of language words.

Therefore, no one on earth fully understands the inner workings of LLM. Researchers are hard at work trying to understand these models, but it's a slow process that takes years, if not decades, to complete.

However, experts do know quite a bit about how these systems work. The goal of this article is to open up this knowledge to a broad audience. We will endeavor to explain what is known about the inner workings of these models without getting into technical jargon or advanced mathematics.

We'll start by explaining word vectors, which are a surprising way for language models to represent and reason about language. Then, we'll dive into Transformers, the cornerstone of building models like ChatGPT. Finally, we explain how these models are trained and explore why good performance can be achieved with huge amounts of data.

word vector

To understand how language models work, you first need to understand how they represent words. Humans use sequences of letters to represent English words, such as CAT for cats. Language models use a long list of numbers called word vectors. For example, here's one way to represent a cat as a vector:

[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, …, 0.0002]

(note: the full vector length is actually 300 numbers)

Why use such a complicated notation? Here's an analogy, Washington DC is located at 38.9 degrees north latitude and 77 degrees west longitude, which we can represent in vector notation:

• The coordinates of Washington DC are [38.9, 77]

• The coordinates of New York are [40.7, 74]

• The coordinates of London are [51.5, 0.1]

• The coordinates of Paris are [48.9, -2.4]

This is useful for reasoning about spatial relationships. You can see that New York is very close to Washington DC because 38.9 is closer to 40.7 and 77 is closer to 74 in the coordinates. Likewise, Paris is very close to London. But Paris is a long way from Washington, DC.

Language models take a similar approach: each word vector represents a point in a "word space" where words with similar meaning are located closer to each other. For example, the closest words to cats in the vector space include dog, kitten, and pet. A major advantage of representing words as vectors of real numbers (as opposed to strings of letters like "CAT") is that numbers can do operations that letters cannot.

Words are too complex to be represented in only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. Humans cannot imagine spaces with such high dimensions, but computers can reason about them and produce useful results.

Researchers have been studying word vectors for decades, but the concept really gained traction in 2013, when Google announced the word2vec project. Google analyzed millions of documents collected from Google News to find out which words tend to appear in similar sentences. Over time, a trained neural network learns to place words of similar categories (such as dog and cat) adjacent in the vector space.

Google's word vectors also have another interesting feature: you can use vector operations to "infer" words. For example, Google researchers take the largest (biggest) vector, subtract the large (big) vector, and add the small (small) vector. The word closest to the resulting vector is the smallest (smallest) vector.

You can use vector operations for an analogy! In this example, the relationship between big and biggest is similar to the relationship between small and smallest. Google's word vectors capture many other relationships:

• Swiss is to Switzerland like Cambodian is to Cambodia. (Country of Citizenship)

• Paris and France are similar to Berlin and Germany. (capital)

• Immoral and moral are similar to possible and impossible. (antonym)

• Mouse (rats) and mice (plural of mice) are similar to dollars (dollars) and dollars (plural of dollars). (plural form)

• Men and women are like kings and queens. (gender role)

Because these vectors are constructed from the way people use language, they reflect many of the biases that exist in human language. For example, (doctor) minus (man) plus (woman) equals (nurse) in some word embedding models. Reducing this bias is a novel field of research.

Nonetheless, word embeddings are a useful foundation for language models, as they encode subtle but important relational information between words. If a language model learns something about cats (for example, it sometimes goes to the vet), the same thing will likely apply to kittens or dogs. If the model has learned about the relationship between Paris and France (for example, that they share a language), then it is likely that the relationship between Berlin and Germany and Rome and Italy will be the same.

The meaning of the word depends on the context

Simple word embedding schemes like this fail to capture an important fact of natural language: words often have multiple meanings.

For example, the word "bank" could refer to a financial institution or a river bank. Or consider the following sentences:

• John picks up a magazine (John picks up a magazine).

• Susan works for a magazine (Susan works for a magazine).

In these sentences, the meaning of "magazine" is related but different. John picked up a physical magazine, and Susan worked for an agency that published physical magazines.

When a word has two unrelated meanings, linguists call them homonyms. When a word has two closely related meanings, like "magazine," linguists call it polysemy.

Language models like ChatGPT are able to represent the same word with different vectors depending on the context in which the word occurs. There is a vector for "bank (financial institution)" and a vector for "bank (river bank)". There is a vector for "magazine (the entity publication)" and a vector for "magazine (the publication institution)". As you might expect, the language model uses vectors that are more similar for the meanings of polysemous words and less similar for the meanings of homophones.

So far we haven't explained how language models do this - will get into that soon. However, we are detailing these vector representations, which are important for understanding how language models work.

Traditional software is designed to work with explicit data. If you ask a computer to calculate "2+3", there is no ambiguity about what 2, +, or 3 means. But ambiguity in natural language goes far beyond homonyms and polysemy:

• In "the customer asked the mechanic to fix his car", does "his" refer to the customer or the mechanic?

• In “the professor urged the student to do her homework”, does “her” refer to the professor or the student?

• In "fruit flies like a banana," is "flies" a verb (referring to a fruit that flies through the sky like a banana) or a noun (referring to fruit flies that like bananas)?

People resolve this type of ambiguity depending on the context, but there are no simple or clear rules. Rather, it requires an understanding of what is actually going on in the world. You need to know that mechanics usually fix customers' cars, students usually do their own homework, and fruit usually doesn't fly.

Word vectors provide a flexible way for language models to represent the exact meaning of each word in the context of a particular paragraph. Now let's see how they do this.

Convert word vectors to word predictions

The GPT-3 model behind the original version of ChatGPT consists of dozens of neural network layers. Each layer takes as input a sequence of vectors—one for each word in the input text—and adds information to help clarify the meaning of that word and better predict words that might come next.

Let's start with a simple example.

Each layer of LLM is a Transformer. In 2017, Google introduced this neural network structure for the first time in a milestone paper.

At the bottom of the graph, the input text to the model is "John wants his bank to cash the" and these words are represented as word2vec-style vectors and passed to the first Transformer. This Transformer determines that both wants and cash are verbs (these two words can also be nouns). We denote this additional context in red text in parentheses, but the model actually stores this information by modifying the word vectors in a way that is difficult for humans to interpret. These new vectors are called hidden states and are passed to the next Transformer.

The second Transformer adds two more contextual information: it clarifies that bank refers to a financial institution (financial institution) rather than a river bank, and that his refers to the pronoun of John. The second Transformer produces another set of hidden state vectors that reflect all the information the model has previously learned.

The graph above depicts a purely hypothetical LLM, so don't get too attached to the details. Real LLMs tend to have more layers. For example, the most powerful version of GPT-3 has 96 layers.

Research shows that (the first few layers focus on understanding the grammar of the sentence and resolving the ambiguities shown above. The later layers (not shown above to keep the diagram size manageable) are dedicated to high-level understanding of the entire paragraph.

For example, when LLM "reads" a short story, it seems to remember all sorts of information about the story's characters: gender and age, relationships with other characters, past and current locations, personalities and goals, and more.

The researchers don't fully understand how LLMs keep track of this information, but logically, information must be passed between layers by modifying the hidden state vectors. The vector dimension in modern LLM is extremely large, which is conducive to expressing richer semantic information.

For example, the most powerful version of GPT-3 uses word vectors with 12288 dimensions, that is, each word is represented by a list of 12288 numbers. This is 20 times larger than the word2vec scheme proposed by Google in 2013. You can think of all these extra dimensions as a kind of "scratch space" that GPT-3 can use to record the context of each word. Informative notes made by earlier layers can be read and modified by later layers, allowing the model to gradually deepen its understanding of the entire text.

So, suppose we change the diagram above to describe a 96-layer language model to interpret a 1000-word story. Level 60 might include a vector for John, with a vector denoted "(Protagonist, male, married to Cheryl, Donald's cousin, from Minnesota, currently in Boise, trying to find him lost wallet)" in parentheses. Again, all of these facts (and possibly more) would be encoded in a list of 12288 numbers corresponding to the word John. Or, some information in that story might be encoded in a 12288-dimensional vector for Cheryl, Donald, Boise, wallet, or other words.

The goal of this is to have the 96th and last layer of the network output a hidden state that contains all the necessary information to predict the next word.

Attention Mechanism

Now let's talk about what happens inside each Transformer. Transformer has two processes when updating the hidden state of each word of the input paragraph:

  1. During the attention step, the vocabulary "looks around" for other words that have a relevant context and share information with each other.

  2. In the feed-forward step, each word "thinks" about the information gathered in the previous attention step and tries to predict the next word.

Of course, it is the network that performs these steps, not individual words. But we state it this way to emphasize that Transformer uses words as the basic unit of this analysis, not entire sentences or paragraphs. This approach enables LLM to take full advantage of the massively parallel processing capabilities of modern GPU chips. It also helps LLM scale to long paragraphs containing thousands of words. These two aspects are the challenges faced by early language models.

You can think of the attention mechanism as a matching service between words. Each word makes a checklist (called a query vector) that describes the characteristics of the words it is looking for. Each word also makes a checklist (called a keyvector) describing its own characteristics. The neural network finds the best matching word by comparing each key vector with each query vector (by computing the dot product). Once a match is found, it passes relevant information from the word that produced the key vector to the word that produced the query vector.

For example, in the previous section we showed a hypothetical Transformer model that found that "his" refers to "John" in part of the sentence "John wants his bank to cash the". Internally, the process might go something like this: a query vector for "his" might be effectively represented as "I'm looking for: nouns that describe men". A key vector for "John" might be effectively expressed as "I am a noun that describes a male". The network will detect that these two vectors match, and transfer information about the "John" vector to the "his" vector.

Each attention layer has several "attention heads", which means that this information exchange process happens multiple times (in parallel) on each layer. Each attention head focuses on a different task:

• An attention head may match pronouns to nouns, as we discussed earlier.

• Another attention header might handle parsing the meaning of polysemy words like "bank".

• A third attention head might link two-word phrases like "Joe Biden."

Attention heads such as these often operate sequentially, with the result of an attention operation in one attention layer becoming the input for an attention head in the next layer. In fact, each of the tasks we just enumerated may require multiple attention heads, not just one.

The largest version of GPT-3 has 96 layers, and each layer has 96 attention heads, so every time a new word is predicted, GPT-3 will perform 9216 attention operations.

A real world example

In the above two sections, we showed idealized versions of how attention heads work. Now let's look at the research on the inner workings of real language models.

Last year, researchers at Redwood Research studied GPT-2, the predecessor of ChatGPT, for the passage "When Mary and John went to the store, John gave a drink to (when Mary and John went to the store, John gave a drink to) "The process of predicting the next word.

GPT-2 predicts that the next word is Mary. The researchers found that three types of attention heads contributed to this prediction:

• Three attention heads, which they call the Name Mover Head, copy information from the Mary vector to the final input vector (the vector for the word to). GPT-2 uses the information in this rightmost vector to predict the next word.

• How does the neural network decide that Mary is the correct word to copy? Reversing the calculation process of GPT-2, the scientists discovered a set of four attention heads they called the Subject Inhibition Head (Subject Inhibition Head), which marked the second John vector, preventing the name moving head from copying John the name.

• How does the subject suppression head know that John should not be copied? The team extrapolated further and discovered two attention heads they called Duplicate Token Heads. They mark the second John vector as a duplicate copy of the first John vector, which helps the subject suppress the head to decide that John should not be copied.

In short, these nine attention heads allow GPT-2 to understand that "John gave a drink to John" does not make sense, and instead chooses "John gave a drink to Mary (John gave Mary a drink)".

This example shows how difficult it can be to fully understand LLM. A Redwood team of five researchers published a 25-page paper explaining how they identified and validated these attention heads. Even with all this work, however, we're still a long way from a full explanation of why GPT-2 decided to predict "Mary" as the next word.

For example, how does the model know that the next word should be someone's name and not some other type of word? It is easy to imagine that in similar sentences, Mary would not be a good next predictor. For example, in the sentence "when Mary and John went to the restaurant, John gave his keys to (when Mary and John went to the restaurant, John gave the keys to)", logically, the next word should be "the valet (representing parking attendant)".

Assuming enough research is done by computer scientists, they can reveal and explain other steps in GPT-2's reasoning process. Eventually, they might be able to fully understand how GPT-2 decided that "Mary" was the most likely next word in the sentence. But it can take months or even years of extra effort to understand how a word is predicted.

The language models behind ChatGPT—GPT-3 and GPT-4—are larger and more complex than GPT-2, and they are capable of more complex reasoning tasks than the simple sentences the Redwood team studied. Therefore, the work of fully explaining these systems will be a huge project, and it is unlikely that humans will complete it in a short time.

Feedforward step

After the attention head transfers information between word vectors, the feedforward network "thinks" about each word vector and tries to predict the next word. At this stage, no information is exchanged between words, and the feed-forward layer analyzes each word independently. However, feed-forward layers have access to any information previously copied by the attention heads. The following is the feedforward layer structure of the largest version of GPT-3.

The green and purple circles represent neurons: they are mathematical functions that compute a weighted sum of their inputs.

The feed-forward layer is powerful because of its large number of connections. We draw this network using three neurons as the output layer and six neurons as the hidden layer, but the feed-forward layer of GPT-3 is much larger: 12288 neurons in the output layer (corresponding to the model's 12288-dimensional word vector ), the hidden layer has 49152 neurons.

So in the largest version of GPT-3, the hidden layer has 49152 neurons, each neuron has 12288 input values (so each neuron has 12288 weight parameters), and there are also 12288 output neurons, each Neurons have 49152 input values (thus 49152 weight parameters per neuron). This means that each feedforward layer has 49152*12288+12288*49152=1.2 billion weight parameters. And there are 96 feedforward layers, a total of 1.2 billion*96=116 billion parameters! This is equivalent to nearly two-thirds of the parameter volume of GPT-3 with 175 billion parameters.

In a 2020 paper (in, researchers from Tel Aviv University found that feed-forward layers work by pattern matching: each neuron in the hidden layer matches a specific pattern in the input text. Below is a 16-layer version Some neurons in GPT-2 match the pattern:

• Neurons in layer 1 match word sequences ending in "substitutes".

• Neurons in layer 6 match word sequences that are military-related and end in "base" or "bases".

• Neurons in layer 13 match sequences that end with a time range, such as "between 3pm and 7pm" or "from 7pm on Friday until".

• Neurons in layer 16 match sequences associated with the TV show, such as "original NBC daytime version, archived" or "time delay increased viewership for this episode by 57 percent."

As you can see, in later layers the schema becomes more abstract. Early layers tend to match specific words, while later layers match phrases that fall into broader semantic categories, such as TV shows or time intervals.

This is interesting because, as mentioned earlier, the feed-forward layer can only check one word at a time. So when classifying the sequence "Original NBC daytime release, archived" as "TV-related", it only has access to vectors for the word "archived", not words like NBC or daytime. It can be inferred that the reason why the feed-forward layer can judge that "archived" is part of the TV-related sequence is because the attention head previously moved the contextual information into the "archived" vector.

When a neuron matches one of the patterns, it adds information to the word vector. While this information is not always easy to interpret, in many cases you can think of it as a tentative prediction of the next word.

Inference of Feedforward Networks Using Vector Operations

Recent research from Brown University (shows an elegant example of how feed-forward layers can help predict the next word. We previously discussed Google's word2vec research showing that analogical reasoning can be done using vector operations. For example, Berlin - Germany + France = Paris.

The Brown University researchers found that feed-forward layers sometimes use this accurate method to predict the next word. For example, they studied GPT-2 responses to the following prompts: "Question: What is the capital of France? Answer: Paris. Question: What is the capital of Poland? Answer:"

The team studied a version of GPT-2 with 24 layers. After each layer, the Brown University scientists probed the model, looking at its best guess for the next token. In the first 15 layers, the highest likelihood guess is a seemingly random word. Between layers 16 and 19, the model starts predicting that the next word is Polish—incorrectly, but getting closer. Then at tier 20, the highest likelihood guess becomes Warsaw—the correct answer, and remains the same for the last four tiers.

Researchers at Brown University found that a 20th feed-forward layer converts Poland to Warsaw by adding a vector that maps country vectors to their corresponding capitals. When adding the same vector to China, the answer gets Beijing.

A feed-forward layer in the same model uses vector operations to convert lowercase words to uppercase words, and words in the present tense to their past tense equivalents.

Attention layer and feedforward layer have different functions

So far, we have seen two practical examples of GPT-2 word prediction: the attention head helps predict that John will give Mary a drink; the feed-forward layer helps predict that Warsaw is the capital of Poland.

In the first case, Mary comes from a user-provided prompt. But in the second case, Warsaw did not appear in the prompt. Instead, GPT-2 had to "remember" that Warsaw was the capital of Poland, and this information was learned from the training data.

When the Brown University researchers disabled the feed-forward layer that converts Poland to Warsaw, the model no longer predicted that the next word was Warsaw. But interestingly, if they then added the sentence "The capital of Poland is Warsaw" at the beginning of the prompt, GPT-2 was able to answer the question again. This may be because GPT-2 uses an attention mechanism to extract the name Warsaw from the cue.

This division of labor manifests itself more broadly: the attention mechanism retrieves information from earlier parts of the cue, while the feed-forward layer enables the language model to "remember" information that did not appear in the cue.

In fact, a feed-forward layer can be thought of as a database of information the model has learned from the training data. Early feed-forward layers are more likely to encode simple facts related to specific words, such as "Trump often comes after Donald". Later layers encode more complex relationships like "add this vector to convert a country to its capital.

Language model training method

Many early machine learning algorithms required human-labeled training examples. For example, the training data might be photos of dogs or cats with artificial labels (“dog” or “cat”). The need for labeled data makes it difficult and expensive for one to create datasets large enough to train powerful models.

A key innovation of LLMs is that they do not require explicitly labeled data. Instead, they learn by trying to predict the next word in a text passage. Almost any written material is suitable for training these models -- from Wikipedia pages to news articles to computer code.

For example, an LLM might take the input "I like my coffee with cream and (I like my coffee with cream and)" and try to predict "sugar (sugar)" as the next word. A freshly initialized language model is terrible at this, because each of its weight parameters—GPT-3’s most powerful version is as high as 175 billion parameters—starts with essentially a random number initially.

But as the model sees more examples -- hundreds of billions of words -- those weights gradually adjust to make better predictions.

Let's use an analogy to illustrate how this process works. Say you're taking a shower, and you want the water to be just the right temperature: not too hot, not too cold. You have never used this faucet before, so you adjust the direction of the faucet handle at will, and feel the temperature of the water. If it's too hot or too cold, you'll turn the handle in the opposite direction, and the less adjustments you'll make to the handle as you get closer to the proper water temperature.

Now, let's make a few changes to this analogy. First, imagine that there are 50,257 taps, each of which corresponds to a different word, such as "the", "cat", or "bank". Your goal is to only let water flow from the tap that corresponds to the next word in the sequence.

Second, there's a bunch of interconnected pipes behind the faucet, and a bunch of valves on those pipes. So if water is coming out of the wrong faucet, you can't just adjust the knob on the faucet. You send an army of clever squirrels to track down every pipe, adjusting every valve they find along the way.

This gets complicated, and since the same pipe often supplies multiple faucets, careful thought is required on how to determine which valves to tighten and loosen, and by how much.

Obviously, this example becomes ridiculous when taken literally. Building a pipeline network with 175 billion valves is neither realistic nor useful. But thanks to Moore's Law, computers can and do operate at this scale.

So far, all parts of the LLM discussed in this article—the neurons in the feed-forward layer and the attention heads that pass context information between words—are implemented as a series of simple mathematical functions (mainly matrix multiplication ), whose behavior is determined by an adjustable weight parameter. Just like the squirrel in my story controls the flow of water by loosening the valve, the training algorithm controls the flow of information through the neural network by increasing or decreasing the weight parameters of the language model.

The training process is divided into two steps. Do a "forward pass" first, turning on the water and checking that the water is coming from the correct tap. The water is then shut off for a "backwards pass," in which the squirrels race down each pipe, tightening or loosening valves. In digital neural networks, the squirrel's role is played by an algorithm called backpropagation, which "walks backwards" through the network, using calculus to estimate how much each weight parameter needs to be changed.

Doing this—forward-propagating an example, then back-propagating to improve the network's performance on that example-requires tens of billions of mathematical operations. And the training of a large model like GPT-3 needs to repeat this process billions of times-for every word of every training data. OpenAI estimates that training GPT-3 requires more than 300 billion teraflops of calculations -- something that would take dozens of high-end computer chips to run for months.

Amazing performance of GPT-3

You might be surprised at how well the training process works. ChatGPT can perform a variety of complex tasks — writing articles, making analogies, and even writing computer code. So, how does such a simple learning mechanism produce such a powerful model?

One reason is scale. It's hard to overemphasize the sheer number of examples a model like GPT-3 sees. GPT-3 is trained on a corpus of about 500 billion words. By comparison, the average human child encounters about 100 million words before the age of 10.

Over the past five years, OpenAI has continuously increased the size of its language models. In a widely circulated 2020 paper (reporting that the accuracy of their language models has a power-law relationship with the size of the model, the size of the dataset, and the amount of computation used for training, some trends even spanning more than seven orders of magnitude” .

The larger the model size, the better it performs on tasks involving language. But only if they increase the amount of training data by a similar factor. And to train larger models on more data, more computing power is needed.

In 2018, OpenAI released the first large model GPT-1. It uses a 768-dimensional word vector, a total of 12 layers, and a total of 117 million parameters. A few months later, OpenAI released GPT-2, the largest version of which has 1600-dimensional word vectors, 48 layers, and a total of 1.5 billion parameters. In 2020, OpenAI released GPT-3, which has a 12288-dimensional word vector, 96 layers, and a total of 175 billion parameters.

This year, OpenAI released GPT-4. The company hasn't released any architectural details, but it's widely believed in the industry that GPT-4 is much bigger than GPT-3.

Not only did each model learn more facts than its smaller predecessor, but it also showed better performance on tasks that required some form of abstract reasoning.

For example, consider the following story: A bag full of popcorn. There is no chocolate in the bag. However, the label on the bag said "chocolate" instead of "popcorn." Sam found the bag. She had never seen the bag before. She couldn't see what was in the bag. She read the label.

As you can probably guess, Sam believes the bag contains chocolate and is surprised to find that it contains popcorn.

Psychologists call this study of the ability to reason about the mental states of others "Theory of Mind." Most people have this ability from the beginning of elementary school. Experts are divided on whether theory of mind applies to any nonhuman animal, such as chimpanzees, but the general consensus is that it is central to human social cognition.

Earlier this year, Stanford University psychologist Michal Kosinski published a study (examining the ability of LLMs to solve theory of mind tasks). He read various language models a story like the one just quoted, and then asked them to complete a sentence, Like "she believes the bag is full", the correct answer is "chocolate", but an immature language model might say "popcorn" or something.

GPT-1 and GPT-2 failed this test. But the first version of GPT-3, released in 2020, was nearly 40 percent correct, a level of performance Kosinski compared to a three-year-old child. The latest version, GPT-3, released in November last year, improved the accuracy of the above questions to about 90%, which is comparable to that of a seven-year-old child. GPT-4 correctly answered about 95 percent of theory-of-mind questions.

"Given that there is neither evidence in these models that ToM (mentalizing) was intentionally engineered nor studies demonstrating that scientists knew how to achieve it, it is likely that this ability arose spontaneously and autonomously. This is the linguistic ability of the models A by-product of constant enhancement," Kosinski wrote.

It's worth noting that researchers don't all agree that these results prove theory of mind: for example, small changes to the false belief task led to a large drop in GPT-3 performance (while GPT-3's performance on other tasks that measure theory of mind More erratic (as Sean writes in it, the successful performance could be attributed to a confounding factor in the task—a "clever Hans," referring to a horse named Hans who appeared to be can complete some simple intellectual tasks, but actually just rely on unconscious cues given by people)" effect, but it appears on the language model instead of the horse.

Nonetheless, GPT-3 approaches human performance on several tasks designed to measure theory of mind, which was unimaginable just a few years ago, and this is consistent with the fact that larger models generally perform better on tasks requiring advanced reasoning point of view are consistent.

This is just one of many examples where language models have shown to spontaneously develop advanced reasoning abilities. In April, researchers at Microsoft published a paper (saying that GPT-4 showed early, tantalizing signs of general artificial intelligence — the ability to think in a complex, human-like way.

For example, one researcher asked GPT-4 to draw a unicorn using an obscure graphics programming language called TiKZ. GPT-4 responded with a few lines of code, which the researchers then fed into the TiKZ software. The resulting images, while crude, clearly show that GPT-4 has some understanding of what a unicorn looks like.

The researchers thought GPT-4 might have somehow memorized the unicorn-drawing code from the training data, so they gave it a follow-up challenge: They modified the unicorn code to remove the horns , and moved some other body parts. Then they asked GPT-4 to put the unicorn's horn back on. GPT-4 responded by placing the head angles in the correct position:

Even though the authors' test version was trained entirely on text and did not contain any images, GPT-4 still appears to be able to accomplish this task. Still, GPT-4 apparently learned to reason about the unicorn's body shape after being trained on large amounts of written text.

Currently, we have no real understanding of how LLMs accomplish such feats. Some people think that examples like this show that the model is starting to really understand the meaning of the words in its training set. Others insist that language models are just "random parrots" (merely repeating increasingly complex sequences of words without actually understanding them.

This debate points to a deep philosophical debate that may not be resolved. Nonetheless, we think it is important to focus on the empirical performance of models such as GPT-3. If a language model can consistently get the correct answers on a particular type of question, and the researcher is confident that confounding factors can be ruled out (for example, by ensuring that the language model was not exposed to those questions during training), then it does not matter how it understands language. Exactly the same as in humans, this is an interesting and important result.

Another possible reason why training next-lemma prediction works so well is that language itself is predictable. The regularities of language are often (though not always) linked to the regularities of the physical world. Therefore, when a language model learns the relationship between words, it is usually also implicitly learning the relationship that exists in the world.

Furthermore, prediction may be the basis of biological intelligence as well as artificial intelligence. According to philosophers such as Andy Clark, the human brain can be thought of as a "prediction machine" whose main task is to make predictions about our environment and then use those predictions to successfully navigate the environment. Prediction is critical to both biological intelligence and artificial intelligence. Intuitively, good predictions go hand in hand with good representations — accurate maps are more likely to help people navigate better than incorrect ones. The world is vast and complex, and making predictions helps organisms efficiently navigate and adapt to this complexity.

A major challenge in building language models has traditionally been figuring out the most useful ways to represent different words, especially since the meaning of many words depends heavily on context. The next-word prediction method allowed the researchers to sidestep this thorny theoretical conundrum by turning it into an empirical problem.

It turns out that language models are able to learn how human language works by figuring out the best next word predictions if we give enough data and computing power. The downside is that the resulting inner workings of the system are not yet fully understood by humans.

Note:

  1. Technically speaking, word fragments of LLM become lemmas, but we will ignore this implementation detail to keep this article within a manageable length (refer to the article "Revealing the Working Principle of GPT Tokenizer").

  2. Feedforward networks are also known as multi-layer perceptrons. Computer scientists have been studying this type of neural network since the 1960s.

  3. Technically, after the neuron has calculated the weighted sum of the inputs, it passes the result to the activation function. This article will ignore this implementation detail, for a complete explanation of how neurons work, check out:

  4. If you want to learn more about backpropagation, check out Tim's 2018 explanation of how neural networks work.

  5. In practice, training is usually done in batches for computational efficiency. So the software might do a forward pass on 32000 tokens before backpropagating.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)