Five general steps

The process from a user’s input - or prompt, as they are generally called - to an AI generated output goes via a seemingly simple five step process. Using the prompt and the context of the conversation, the model tries to predict the next word that would follow the given input. Once it has decided on what word to add, it repeats the process many times over, continuously looking at the original input plus the new words it has added, until it eventually decides to add a stop command. This process is adequately known as next-word prediction.

Next-word prediction can be broken down into five general steps: tokenization, vectorization, embedding, attention transforming, and output generation and selection. Using the tabs below, you can go through each step, as well as some important notes on the selection of the next word. This first animation will show you the general cycle of next-word prediction. The animations in each step will show in more detail what happens during each step, explaining the different elements in this first animation.

1. Tokenization

The first step to next word prediction is to cut up the input into smaller chunks that the model understands. These smaller chunks are called tokens. Whilst the process is called next word predictions, in actuality, tokens can be more than just words, such as parts of words, punctuation, or maybe even collections of words. In the case of ChatGPT’s first version, the model had a dictionary of over 50,000 tokens it could use to cut up the prompt.

2. Vectorization

As computers work in numerical space, each token gets turned into a list of values known as vectors. This vector is basically a coordinate for where the token exists in the multidimensional table that encodes human languages, created during the pre-training phase of machine learning. Each value in this list represents one of the dimensions of that table. To highlight the complexity of human language: ChatGPT‘s first version used vectors with a length of over 12,000 values.

3. Embedding

All vectors of the tokens in the input/prompt, as well as a few additional ones representing certain features of the text such as the position of each token in the input or the similarity between two words, are combined to create a table that is known as an embedding. This embedding is the numerical representation of the input and is used to capture certain semantic and syntactic information.

4. Neural Network

The embedding can now be fed forward through the transformer neural network. For each vectorized token (also the ones earlier in the input that already have tokens following them), the transformer utilizes its attention mechanism to try and predict the next token, taking into consideration the context of the conversation by placing different weights to different tokens in an input. The next predicted token after the final vectorized token in the prompt gets saved, and the transformer feeds the embedding through to a new layer, where the process is repeated, although slightly altered based on the output of the previous layer. In ChatGPT’s first version, this process was repeated 96 times.

5. Output generation and selection

In the final layer of the transformer neural network, a list of vectors is created, alongside a list of probabilities that represent how likely each vector (which is a numerical representation of a token) would follow the input in natural language. The probabilities in this list of possibilities is moderated by fine-tuning, and does not necessarily reflect the true probabilities you would get based only on the raw data the model was trained on. This list is back-translated to readable words, and eventually one of the tokens is chosen to be added (the higher the probability, the likelier).

The process then gets repeated from step 1, until the stop command is chosen to be added by the neural network.

Notes on output selection

As can be seen in the animations, LLMs do not simply focus on one token, but rather create a list of possibilities, and are not guaranteed to pick the most probable token to add. This probabilistic approach to language generation makes LLMs, and specifically chatbots such as ChatGPT, feel more human by providing more surprising or unexpected answers.

Which token gets chosen is dependent on a number of factors. For instance, there are so-called model ‘hyperparameters’, which are settings for the model as a whole, that dictate how likely it is that the model can forgo the more probable token. This includes settings such as word differentiation, which will make words that are already often used less likely to be chosen again. Of particular importance is the hyperparameter known as ‘temperature’. The higher the temperature of a model is, the more likely it is that words with a lower probability of occurring naturally still get selected as output. This setting usually ranges from 0 to 1, with most generative AI text models having it set around a value of 0.8 so that there are more varied responses. However, for most generative AI models this setting is not available to users, so knowing exactly how random the responses can be is unclear.

Additionally, developers can put in guardrails to protect against harmful or offensive language and information that can slip through the fine-tuning. These guardrails are not directly embedded into the AI model, but rather act as a final check to make sure language that wasn’t trained out through fine-tuning still is kept in check. Often these guardrails are updated frequently, and later on used as a basis for further refinement of the AI model when it is updated.