Using Large Language Models for Data Labeling | by Zachariah Zhang | Oct, 2022

Let AI annotate the data for you

photo by matt briney Feather unsplash

tl;dr DR – We can leverage the text generation power of large language models like GPT3 to generate labeled data to use for supervised learning. We can do this by using prompting, in which we give lm a description of the task, some examples, and a label to create a new instance. The data generated from this is normally noisy and of lower quality than human labels. However, the speed of data collection and the ability to use humans in the loop make it an efficient way to collect significant amounts of labeled data for difficult-to-label tasks.

It is well understood at this point that the quality and quantity of labeled data is the best predictor for the success of a machine learning project. In industry, having or not having this data often decides whether a project is green-lit or left on the back burner.

However, archiving can be extremely costly and time consuming in many contexts. Labeling in specific domains such as health care or recruitment may require an expert to assist with annotation. This can make annotation more expensive and reduce the pool of potential annotators.

In addition, some functions are not suitable for human annotation. For example, generating summaries is notorious for being difficult because it is both low throughput (as each example requires reading, understanding, and writing a potentially long document) and suffers from high annotator disagreement ( where summarizing may be important) varies significantly from person to person).

Language Model Enhanced Data Labeling

A growing trend in the ML literature is using large language models as a replacement for, or complement to, traditional annotation pipelines. This allows the model to do the heavy lifting for the most labor-intensive parts of the annotation pipeline.

The LM Enhanced Data Annotation works as follows:

  1. Generate noisy labeled dataset from language model using prompting
  2. Human annotators either accept or reject samples
  3. A critic model is learned to filter the generated examples based on the judge’s decisions.
  4. Critique model is applied to the generated dataset to filter out noisy examples

High Noise Labels > Slow Clearing Labels

When working as an ML engineer in industry, quickly turning out an effective PoC is often more important than finding the optimal approach. Engineering time is often one of a company’s most important resources, and allowing yourself to be blocked can be unacceptable, even with something as important as the data. Here we will see that, while the data generated is not always better, it is significantly faster and cheaper to generate, which can help answer the important question that every PM has before an ML project. it will work?

Man: I am good at maths.  Man 2: So, what is 750*1920.  Man: 230.  Man 2: Not even close.  man: but it was fast
Speed ​​to Market > Accuracy

In this review, I:

  • Briefly explain how to use the language model to generate data
  • View a case study using an end-to-end labeling pipeline
  • Show a working example of bootstrapping the book title classification dataset

large language model

Over the past several years, language models have grown rapidly in size and performance. Contemporary language models, such as GPT-3, are orders of magnitude larger in terms of the size of the models, the amount of training data, and the amount of training time used to build them. These models have become extremely powerful and have shown remarkable ability to generate text for new tasks without data.

rapid based generation

We can generate labels for unlabeled data by building a text template, called a prompt, that is likely to generate labels when we allow the language model to run. Signals generally have three components:

  • Description of the work we want to do
  • Instances of work being performed (also known as in-context performance)
  • a new instance that we want to label
Inspirational example for GPT3 food classification
Red – description of the task. Yellow – Examples of work being performed (also called in-context performance). blue – a new instance for which we want the model to be generated

From the above example a working example is OpenEye Playground Showing how we can label a food as vegetarian or not using a language model.

Symbolic Knowledge Distillation: From Common Language Model to Common Sense Model, The EditSign paper applies this idea to generate data for a common sense reasoning task. Yannick Kilcher gives a very detailed description of the technical details of the paper, which I highly recommend taking a look at, but I’ll summarize the important findings here.

General Knowledge Reasoning Tasks

The task is to generate common sense logic triples. These take the form of a subject, predicate and result, as shown below:

generated dataset

The authors generate the dataset using the approach described in the previous section using GPT3. They can generate a dataset 10 times larger for 15% of the cost compared to human annotation.

data generation quality

It doesn’t matter how much data we have, it’s all garbage. The authors perform a meta-analysis evaluating the quality of human annotation and data produced from GPT3. Naturally, the data is of very low quality, as shown in the Acceptance Rate column. However, as we apply the critic filtering model, we see that the quality of the dataset increases and exceeds that of the human dataset.

impact on downstream performance

The process has produced more data at higher quality and cheaper than human annotation, but what is its impact on downstream models? The authors point out that simply by increasing the dataset that same model can achieve a significant jump in performance.

Let’s see how we can use this approach to predict the genre of a book given its title. we will use data from book genre prediction Kaggle dataset, containing 4,657 books with titles, summaries, and genres. OpenAI’s API charges by token, so I’ll limit the exploration to book titles, but using the long reference the same approach would apply.

loading data

First, we load the data downloaded from kaggle as well as copy and set our openai API key

import pandas as pd
import os
import openai
from import tqdm
openai.api_key = os.getenv("OPENAI_API_KEY")books = pd.read_csv("./data.csv")

performance in context

We give an example of each style (11 in total) to serve as a demonstration of the model.

# Sample 1 labeled example of each class to serve as the seed for our generator model
few_shot_example_idx = []
for genre in books["genre"].value_counts().index:
few_shot_example_idx.append(books[books["genre"] == genre].sample(1).index[0])

speedy design

I write a prompt that gives a random display and a short description of the problem in terms of each style. I also add the title of the book we want to make below:

# Create a template for the prompt with the examples
prompt_template = f"Classify the given book title as thriller, fantasy, science, history, horror, crime, romance, psychology, sports or travel:\n"
for idx in few_shot_example_idx:
prompt_template += f'books["title"].loc[idx]=> books["genre"].loc[idx]\n'
Classify the given book title as thriller, fantasy, science, history, horror, crime, romance, psychology, sports or travel:
Deception Point=> thriller
Hounded=> fantasy
The Star Fraction=> science
Laura Blundy=> history
The Vampire Lestat=> horror
At Bertram's Hotel=> crime
City of Lost Souls=> romance
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life=> psychology
Long Shot=> sports
The Old Ways: A Journey on Foot=> travel
Drowned Wednesday=>

generate data

We call openai to generate labels for each example. Note I recommend testing on a smaller scale first to verify the signal before processing the entire dataset. It took approximately 10 min to generate the remaining annotations from this seed of 11 examples.

for i in tqdm(range(books.shape[0])):

prompt = prompt_template + books.iloc[i]["title"] + "=>"

response = openai.Completion.create(

books["gpt3_annotations"].iloc[i] = response.to_dict()["choices"][0]["text"].strip()

feature representation

I represent titles using popular sentence transformers Library, which creates embeddings for each title to use in a downstream logistic regression model (to keep things simple).

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sentence_transformers import SentenceTransformer

le = preprocessing.LabelEncoder()
true_labels = le.fit_transform(books["genre"])
noisy_labels = le.transform(books["gpt3_annotations"])

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(books["title"])
embeddings_train, embeddings_test, y_true_train, y_true_test, y_noisy_train, y_noisy_test = train_test_split(embeddings, true_labels, noisy_labels, test_size=0.2, random_state=42)

critic model

The critic model learns to accept or reject generated examples based on human observations. Here we do so with 100 examples. Note that this can be learned very efficiently:

  1. It makes the annotation task easier for humans by making it a binary labeling task
  2. This allows for more sample-efficient learning as we do not need coverage of the entire label space.
import numpy as np
from sklearn.linear_model import LogisticRegression
# Simulate a human critic accepting or rejecting the label critic_examples = np.random.randint(embeddings_train.shape[0], size=100)# Train model to learn from these critic examplescritic_features = embeddings_train[critic_examples]
critic_labels = (y_noisy_train == y_true_train)[critic_examples]
critic_model = LogisticRegression() , critic_labels)# Score examples in the dataset and remove those that score in the lowest 30% for acceptancecritic_scores = critic_model.predict_proba(embeddings_train)[:,1]filtered_training_input = embeddings_train[critic_scores > np.percentile(critic_scores, 30)]
filtered_training_label = y_noisy_train[critic_scores > np.percentile(critic_scores, 30)]

model comparison

I train simple logistic regression models for each dataset (small manually labeled seed dataset, noisy label without critic, critic filtered data and fully labeled dataset) and compare the accuracy of the results I do I also report the number of human annotations required for each dataset.

While we certainly perform worse than full supervision, the model is quite respectable in its performance considering that it only has access to 11 human-labeled examples. I think the difference can be narrowed down further putting more effort on tweaking the prompt and using a more expensive engine.

| Data | Accuracy |
| Only Labeled Examples (11) | 0.117 |
| Noisy Examples (11) | 0.177 |
| Critic Filtered (111) | 0.185 |
| True Labels (3725) | 0.251 |

Traditional human annotation can be costly in terms of money and engineering time. A new Human-in-the-Loop Language Model Augmented Data Annotation paradigm aims to take much of the load off humans by harnessing GPT3’s massive generation capabilities. This technique can be very useful in tasks where:

  • Expert Annotators Needed (Health Care)
  • Annotations are labor intensive (abstract summary)
  • Time to market is critical (faster > more accurate for MVP)

By involving humans as reviewers in the annotation process rather than actual annotators, we can still control the quality of the data produced and obtain high quality data in domains where internal annotator disagreement may be high.

Leave a Reply