What language modeling is all about?
Discover the secrets of classical language modeling and learn how GPT-2 predicts text, handles tokenization, and adjusts creativity with temperature.
In recent years, large language models have gained enormous popularity and now, they can be found almost everywhere. Tools like ChatGPT, Google Gemini, or Claude are very useful, so more and more people start to use them. It’s important to have a basic understanding of how they work under the hood.
In my previous article, I walked through process of fine tuning a large language model for extractive question answering. That task, however, did not involve classical language modeling, which is a foundation of whole natural language processing field. This time I would like to explain how all those wonderful tools work using GPT-2 model as an example.
What is classical language modeling?
Classical language modeling is a core concept of natural language processing, where the primary goal is to predict the next part of a given sequence by generating a probability distribution for each possible character or word. Historically, there were many approaches to solve that problem. One of the earliest attempts was to use n-gram models, which create a statistical model by counting occurrences of all combinations in a large corpus of text. Later, as deep learning gained popularity, attempts were made to use neural networks, especially recurrent ones like GRU or LSTM. Then, in 2017, transformer architecture first appeared and completely dominated the field of natural language processing, quickly becoming considered as state of the art approach in language related tasks.
GPT overview
GPT-2 is a transformer model that leverages self-attention mechanism for better natural language understanding, which belongs to generative pre-trained transformer models family developed by OpenAI. It was introduced in 2019 and might seem a bit outdated in the rapidly evolving world of artificial intelligence, especially when compared to recent very large models like GPT-4 or GPT-4o. However, all models from GPT family use only the decoder of transformer architecture, which allows them to predict the next word in the sequence, so the general mechanism of text generation in such models remains the same.
As you might already know, the newest models are proprietary and they are not available on Hugging Face, nor in Bumblebee library. That’s why I’ll use GPT-2 model.
Installation
If you want to follow this article, I highly recommend using Livebook, since it’s a very convenient way to work with code, but a regular mix project will be sufficient too.
The only required libraries are Axon, Bumblebee, and Nx. Axon provides an abstraction for implementing neural networks, Bumblebee makes it possible to interact with the models available on the Hugging Face repository and Nx is responsible for all math operations that take place under the hood. However, if you want to accelerate inference with models you might find EXLA helpful.
Model download
First things first, model and tokenizer need to be downloaded to your local drive. To do that you have to specify which model and tokenizer you want to use.
checkpoint = {:hf, "gpt2”}
{:ok, %{model: model, params: params}} = Bumblebee.load_model(checkpoint)
{:ok, tokenizer} = Bumblebee.load_tokenizer(checkpoint)
Model size depends on chosen variant, but usually, it takes a while to finish.
If you want to find out how big the model is, you can simply inspect how many params are there.
Text tokenization
Ok, now it’s time to prepare input text that you want to complete. I’ll start with a simple unfinished sentence.
text = "My name is John and my main"
Since computers and neural networks don’t understand plain text, we need to convert that to numbers. Each model has an associated vocabulary, which contains the most common words from used text corpus and because the collection of words is potentially infinite, it is very often limited to tens of thousands of the most popular words. In case of GPT-2 there are 50257 tokens in the vocabulary. If a word from outside the vocabulary occurs in the provided text, it will be replaced with a special token for unknown words.
input = Bumblebee.apply_tokenizer(tokenizer, text)
As a result, we got a map with several keys, but the most important one is called “input_ids". Let’s see what’s that.
%{
"attention_mask" => #Nx.Tensor<
u32[1][7]
EXLA.Backend<host:0, 0.502782931.1833566222.97129>
[
[1, 1, 1, 1, 1, 1, 1]
]
>,
"input_ids" => #Nx.Tensor<
u32[1][7]
EXLA.Backend<host:0, 0.502782931.1833566222.97128>
[
[3666, 1438, 318, 1757, 290, 616, 1388]
]
>,
"token_type_ids" => #Nx.Tensor<
u32[1][7]
EXLA.Backend<host:0, 0.502782931.1833566222.97130>
[
[0, 0, 0, 0, 0, 0, 0]
]
>
}
To verify if this is exactly what we provided above we can check what each token means.
Bumblebee.Tokenizer.decode(tokenizer, [3666, 1438, 318, 1757, 290, 616, 1388]))
"My name is John and my main”
Sometimes, while playing with tokenizers you may find weird characters like Ġ or similar. Don’t worry - it might happen, because tokenizers operate on byte level and might cut words in unexpected places, but as long as decoded text looks correct you should be good to go.
Model inference
With tokenized text we are ready to try GPT-2. To do so, simply pass information about model, downloaded weights and prepared input.
output = Axon.predict(model, params, input)
After a second you should get a response from model, but it will look different from those known for example from ChatGPT. Let’s take a closer look on that.
%{
cache: #Axon.None<...>,
logits: #Nx.Tensor<
f32[1][7][50257]
EXLA.Backend<host:0, 0.502782931.1833566222.99355>
[
[
[-33.07360076904297, -32.33494186401367, -35.238014221191406, -34.77516555786133, -33.86668014526367, -34.452186584472656, -33.024147033691406, -33.58883285522461, -32.04574203491211, -34.41610336303711, -34.191070556640625, -30.15697479248047, -30.67748260498047, -30.273983001708984, -31.89931297302246, -34.27805709838867, -33.321781158447266, -33.531585693359375, -34.08525466918945, -34.172367095947266, -34.3142204284668, -34.81998062133789, -34.67438888549805, -34.668212890625, -34.99346160888672, -31.478322982788086, -33.271759033203125, -34.910186767578125, -33.903289794921875, -34.02199172973633, -32.871337890625, -34.703975677490234, -32.76276779174805, -33.283931732177734, -33.50212097167969, -33.37453079223633, -33.678306579589844, -33.35165023803711, -33.23114013671875, -33.39592742919922, -32.423004150390625, -33.56220626831055, -33.300479888916016, -33.525787353515625, -33.25764465332031, -33.98344802856445, -33.430992126464844, -33.69162368774414, ...],
...
]
]
>,
cross_attentions: #Axon.None<...>,
hidden_states: #Axon.None<...>,
attentions: #Axon.None<...>
}
Similarly to tokenizer, model also returned a map with few keys. The one we are interested in is called “logits”. If we look on its shape we will see some known values.
%{logits: logits} = output
Nx.shape(logits)
{1, 7, 50257}
We passed 1 input with 7 tokens and vocabulary size was 50257. So what each of this value means? Is it a probability? Well, probability distribution must sum up to one and all probabilities must be in range <0; 1>. If you look closely you will find values less than 0 or much greater than 1, so no, it is not a valid probability distribution yet, but it will be in a second.
Logits are raw, unnormalized output from last layer, but they can be converted to probability distribution by applying softmax function. It is implemented in Axon.Activations module and to use it we can simply invoke:
probabilities = Axon.Activations.softmax(logits)
#Nx.Tensor<
f32[1][7][50257]
EXLA.Backend<host:0, 0.502782931.1833566222.99361>
[
[
[6.76328840199858e-4, 0.0014156418619677424, 7.765422924421728e-5, 1.233609509654343e-4, 3.060045710299164e-4, 1.7039061640389264e-4, 7.10616703145206e-4, 4.040131170768291e-4, 0.0018903894815593958, 1.7665114137344062e-4, 2.2123150120023638e-4, 0.012497768737375736, 0.007426408119499683, 0.011117738671600819, 0.0021884902380406857, 2.0280058379285038e-4, 5.27684751432389e-4, 4.2781655793078244e-4, 2.4592471891082823e-4, 2.2540823556482792e-4, 1.9559766224119812e-4, 1.1795457976404577e-4, 1.3644086720887572e-4, 1.3728612975683063e-4, 9.916831913869828e-5, 0.003334097098559141, 5.547520122490823e-4, 1.077801498468034e-4, 2.950044290628284e-4, 2.6198531850241125e-4, 8.27941345050931e-4, 1.3246315938886255e-4, 9.228921262547374e-4, 5.480401450768113e-4, 4.406095831654966e-4, 5.005709826946259e-4, 3.694345650728792e-4, 5.121563444845378e-4, 5.777493352070451e-4, 4.899742198176682e-4, 0.0012963087065145373, 4.149150918237865e-4, 5.390456644818187e-4, 4.3030441156588495e-4, 5.626375204883516e-4, 2.722803328651935e-4, 4.730911459773779e-4, 3.645473625510931e-4, 1.2901266745757312e-4, 3.226939879823476e-4, ...],
...
]
]
>
Let’s check what happens when softmax is applied.
If we take a look on its formula we will notice the exponentiation of Euler’s number to each logit, what solves problem with negative numbers. Then, each exponentiated value is divided by the sum of them all, what ensures that they all will be smaller than 1 and their sum will be exactly 1. This transformation allows us to interpret current values as probabilities, representing model’s confidence about every token.
That means that now, we have access to seven probability distributions, one for each provided token. We can inspect them and see five tokens with highest probability.
{top_probabilities, top_tokens} = Nx.top_k(probabilities, k: 5)
input["input_ids"]
|> Nx.to_flat_list()
|> Enum.with_index()
|> Enum.each(fn {token, idx} ->
top_probabilities = Nx.to_list(top_probabilities[0][idx])
top_tokens = Nx.to_list(top_tokens[0][idx])
decoded_token = Bumblebee.Tokenizer.decode(tokenizer, token)
IO.puts("#{token}\t#{decoded_token}")
[top_probabilities, top_tokens]
|> Enum.zip()
|> Enum.each(fn {probability, token} ->
rounded_probability = Float.round(probability, 5)
decoded_token = Bumblebee.Tokenizer.decode(tokenizer, token)
IO.puts("#{rounded_probability}\t#{token}\t#{decoded_token}")
end)
IO.puts("")
end)
3666 My
0.01317 198 \n
0.0125 11 ,
0.01112 13 .
0.00743 12 -
0.00675 262 the
1438 name
0.84368 318 is
0.05219 338 's
0.01769 11 ,
0.01551 373 was
0.00672 290 and
318 is
0.0106 1757 John
0.00843 3700 James
0.00795 3271 David
0.0072 3899 Michael
0.00584 509 K
1757 John
0.04787 13 .
0.04535 11 ,
0.02295 290 and
0.02268 31780 Doe
0.01133 327 C
290 and
0.75795 314 I
0.03328 616 my
0.02476 428 this
0.01809 356 we
0.01027 340 it
616 my
0.1461 1438 name
0.12624 1641 family
0.0542 2802 mother
0.05075 2988 father
0.04649 3656 wife
1388 main
0.12841 3061 goal
0.11022 1693 job
0.04686 2962 focus
0.04646 4007 purpose
0.04367 2328 concern
This way you can see model predictions for all provided tokens, based on preceding sequence. Of course, the most interesting prediction is the last one, since this part was unknown and will be appended to existing text. That means that for sequence “My name is John and my main” model predicted following words: “goal”, “job”, “focus”, “purpose”, “concern” and all of them seem reasonable. Great! All we have to do is to pick one token, append to text and repeat this process again. But how to decide which token to choose? Usually they are drawn according to the returned probability distribution, so chance that word “goal” will be chosen is about 12.8%, for “job” it’s 11% and so on.
Temperature
You might have heard about temperature in context of large language models. But what’s that and how does it work? By manipulating the temperature we can adjust the probability distribution, affecting the behavior of the model.
Can you spot the difference? The lower temperature is, the less random response model will produce and the greater it is, the more unpredictable and creative it becomes. To keep distribution unchanged temperature must be set to 1.
Summary
And that’s it! It’s important to have a basic knowledge of how the increasingly popular large language models work. I hope this brief introduction helped you to understand that subject and got you interested in this topic.
FAQ
What is language modeling?
Language modeling is a fundamental concept in natural language processing (NLP) that involves predicting the next word or character in a sequence by generating a probability distribution over possible continuations.
How does classical language modeling work?
Classical language modeling uses statistical methods like n-grams to predict the next word based on the previous words. Modern approaches use deep learning models, such as recurrent neural networks (RNNs) and transformers.
What is a transformer model?
A transformer model is a deep learning architecture that uses self-attention mechanisms to process input data. It's highly effective for NLP tasks, enabling models like GPT to understand and generate human-like text.
What is GPT-2?
GPT-2 is a generative pre-trained transformer model developed by OpenAI. It uses the transformer architecture to predict the next word in a sequence, allowing it to generate coherent and contextually relevant text.
How does tokenization work in GPT-2?
Tokenization converts text into numerical data that models can process. GPT-2 uses a vocabulary of 50,257 tokens to represent words and subwords, transforming input text into a sequence of these tokens.
What is the role of logits in GPT-2?
Logits are the raw outputs from the model's final layer before applying a softmax function to generate probabilities. They indicate the model's confidence in predicting each token in the vocabulary.
How does temperature affect text generation in GPT-2?
Temperature controls the randomness of predictions. Lower temperatures result in more deterministic outputs, while higher temperatures increase creativity by allowing less probable tokens.
How do you use GPT-2 for text generation?
To generate text with GPT-2, input text is tokenized, passed through the model to obtain logits, and then probabilities are computed. The model selects the next token based on these probabilities, and this process repeats to generate extended text.
What libraries are needed for working with GPT-2 in Elixir?
The essential libraries for using GPT-2 in Elixir are Axon, Bumblebee, and Nx. Axon handles neural networks, Bumblebee interacts with models from Hugging Face, and Nx manages mathematical operations.