InstructLab Training: Help Your AI Models Grow Up Quickly

80% of AI models never see the real world. InstructLab helps us train models fast with less data, time, and resources.

InstructLab Training: Help Your AI Models Grow Up Quickly

80% of AI models never make it to the real world.

Training AI models is slow and difficult. It requires a ton of data, specialized systems, time and expertise.

And if our models never make it to production, our customers (and business) will see zero value. Like a kid that never leaves the house.

Fortunately, InstructLab is an easy way of helping our models grow up. And fast.

Why do we use InstructLab?

InstructLab helps us train AI models quickly. You need less data, time, and can do it on your local machine. You don’t have to be a data scientist.

With a few lines of text, we can add skills and knowledge to AI models.

And since InstructLab makes contributing to models easy, models can improve rapidly. What open source did for software development, InstructLab aims to do for model tuning.

How do we use InstructLab?

InstructLab is a school that helps our AI models learn and grow. This takes place in three steps.

  1. Contribute Knowledge. We provide knowledge and skills in Q&A format. This serves as an example for what the model will learn.
  2. Generate Data. InstructLab generates many examples similar to the ones we provided. We don’t have to find huge datasets ourselves.
  3. Train the Model. InstructLab takes the generated data and uses it to train the model. We’ll also check our model to make sure it has learned correctly.

What You’ll Need

  • M1/M2/M3 Mac or Linux system
  • ~25GB of free disk space

Step 1: Install the ilab CLI

First, we’ll install the necessary packages to run InstructLAB.

For Linux:

sudo dnf install gcc gcc-c++ make git python3.11 python3.11-devel

For macOS:

xcode-select --install

Then, create a new directory to store the files the ilab CLI needs when running

mkdir instructlab && cd instructlab

Last, we’ll enter our virtual environment and install the InstructLab CLI. This command should work for most, but there are also alternative installations here.

python3 -m venv --upgrade-deps venv
source venv/bin/activate
pip install instructlab

Verify that InstructLab was installed correctly. From your virtual environment (venv), run the InstructLab cli.

ilab

You should see something like this.

Usage: ilab [OPTIONS] COMMAND [ARGS]...

CLI for interacting with InstructLab.

Finally, we’ll initialize your Instructlab repository so we can create new knowledge and skills.

ilab config init --non-interactive

Step 2: Download the Model

First, we’ll download a model to use for our training.

ilab model download

This downloads a model called Merlinite from HuggingFace. The Merlinite model is a small, pre-trained model by IBM. It’s perfect for local development.

Once we have our model, we'll chat with it to see what it knows.

ilab model chat

Let’s ask a question about InstructLab.

What is InstructLab?

The response is well formatted – but wrong. And like a child, sometimes our AI models make things up when they don’t know. We call these hallucinations.

Step 3: Add new knowledge to the model

We’ll add knowledge to our model. This is done via a taxonomy. Taxonomy is a tree structure that organizes things – in our case knowledge and skills.

Quit the chat and open the taxonomy folder.

exit
ls taxonomy

Here you’ll find a few things. But there are three important directories.

  • Knowledge - factual knowledge
  • Foundational skills - basic skills our model has: math, coding, language skills, reasoning
  • Compositional skills - skills that combine knowledge and foundational skills to answer complex questions. Ex. Writing an email or an earnings report

To teach our model new knowledge, add a new directory and download our knowledge file.

mkdir taxonomy/knowledge/technical_manual/instructlab
wget https://code-like-the-wind.s3.us-east-2.amazonaws.com/qna.yaml -P taxonomy/knowledge/technical_manual/instructlab

Here's what the file looks like.

seed_examples:
  - question: 'What is InstructLab?'
    answer: |
      InstructLab is ....
task_description: "To teach a language model about InstructLab."
document:
  repo: https://github.com/tolarewaju3/knowledge
  commit: 69a24c7612d5bce06de2f575017664b1953b8921
  patterns:
    - instructlab-knowledge.md

There’s alot going on here. But when we break it down, there are three main components of these question and answer files.

  1. seed_examples - example questions and answers to help our model learn
  2. document - the source of our knowledge contribution, including the repo, commit, and files to look for 
  3. task_description - explains what the purpose of the file is

This is all we need to teach our model what InstructLab is. Check to make sure our format is correct.

ilab taxonomy diff

Step 4: Train the Model

We’re now ready to start training our model. First, we’ll generate data based on our questions and answers.

ilab data generate

This could take awhile. I’m using an M2 Mac and it took about 25 minutes.

You’ll see the teacher model generating different permutations of your examples. Don’t worry if they’re not exactly as you would have written them. 

After that finishes, we’ll train our model.

ilab train

The teacher model will take the generated data and train our “student” model. This will also take some time.

If you’re on a mac, you’ll need to convert our model into a form that can be run.

ilab convert

Step 5: Test our new model

To test our new model, we need to chat with it again.

There should be a folder named <your_model_name>-trained. Your trained model is inside this folder with a .gguf extension. Start this model.

ilab model serve --model-path instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab-Q4_K_M.gguf

Wait until the model has started.

INFO 2024-08-15 13:55:26,335 server.py:219: server After application startup complete see http://127.0.0.1:8000/docs for API.

In a different window, go to your instructlab folder, enter your virtual environment and start the CLI.

cd instructlab
source venv/bin/activate
ilab

Chat with your model.

ilab chat -m instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab-Q4_K_M.gguf

Ask the same question again.

What is InstructLab?

You should get a much more correct answer. Congratulations! With no specialized hardware and less than a few hours, your AI model has grown up.

Best Practices

For best results, provide 5-8 examples of skills and knowledge. The perk of InstructLAB is that we don’t need a ton of data, but more data is still better for our models.

Recap

InstructLAB makes it easy to teach our AI models new skills and knowledge. We add a few examples to a taxonomy, generate synthetic data, and use that to train our model.

This simple process allows anyone to contribute to a model, which rapidly improves the quality of our models.

Useful Links