A Simple Overview of the LLM Training Steps:๐ก
Unsupervised Pretraining:
High quantity, low quality data The model is trained to predict the next token for trillions of tokens. Produces what is called the foundation or base model.
Supervised Finetuning:
Low quantity, high quality {prompt, response} Enables the model to be finetuned for dialogue - turning the base model into a chatbot Often referred to as instruction tuning
Reinforcement Learning from Human Feedback (RLHF): - lots of innovation going on here (will cover DPO, PTO, and KTO soon)
This is a two-step process:
a. Train a reward model to act as a scoring function:
This model will take in a prompt + response and provide a score of how good it is. Human labelers are asked to pick good vs. bad responses and this data is used to train a model.
b. Optimize LLM to generate responses for which the reward model will give high scores.
Use an iterative procedure to update a part of the model such that:
- Produces outputs with higher score
- Outputs that are not too far away from the SFT model from Step 2
- Outputs that aren't getting worse a text completion
Specifically for this phase it is better to think of this as learning an optimal strategy/policy for predicting a probability distribution over tokens and we want to tweak this distribution to produce higher quality text, here the:
The policy is a language model that takes in a prompt and returns a probability distribution over text. The action space of this policy is all the tokens corresponding to the vocabulary of the language model (~50k tokens) The observation space: distribution of possible input token sequences The reward Model is a combination of the preference model(score higher) and a constraint on policy shift(don't change too much, get worse at text completion).
RLHF Learning Resources:
Ready to start building?โ
Check out the Quickstart tutorial, and begin building amazing apps with the free trial of Weaviate Cloud (WCD).
Don't want to miss another blog post?
Sign up for our bi-weekly newsletter to stay updated!
By submitting, I agree to the Terms of Service and Privacy Policy.