Unlocking the treasure trove of unstructured data with deep learning
Search is a crucial functionality in many applications and companies globally. Whether in manufacturing, finance, healthcare, or almost any other industry, organizations have vast internal information and document repositories.
Unfortunately, the scale of many companies’ data means that the organization and accessibility of information can become incredibly inefficient. The problem is exacerbated for language-based information.
Language is a tool for people to communicate often abstract ideas and concepts. Naturally, ideas and concepts are harder for a computer to comprehend and store in a meaningful way.
Most organizations rely on a cluster of keyword-based search interfaces hosted on various ‘internal portals’ to deal with language data. This can satisfy business requirements for some of that data if done well.
A keyword-based search is ideal if a person knows what they’re looking for and the keywords and terminology of the information they need. When the keywords and terminology of the answer are unknown, keyword search is inadequate. People searching for unknown answers in large repositories of documents is a drain on productivity.
How do we minimize this problem? The answer lies with semantic search, specifically with the question-answering (QA) flavor of semantic search.
Semantic search allows us to search based on concepts and ideas rather than keywords. Given a phrase, a semantic search tool returns the most semantically similar phrases from a repository.
Question-answering takes this idea further by searching using a natural language question and returning relevant documents and specific answers. QA aims to mimic natural language as much as possible.
If we asked a shop assistant, “where are those tasty, freshly baked things that are not cookies but look like cookies?” we would expect directions that take us to those things. This natural form of conversation is what QA aims to reproduce.
This article will introduce the different forms of QA, the components of these ‘QA stacks’, and where we might use them.
Before we dive into the details, let us paint a high-level picture of QA.
First, our focus is on open-domain QA (ODQA). ODQA systems deal with questions across broad topics and cannot rely on specific rules in your code.
The alternative to open-domain is closed-domain, which focuses on a limited domain/scope and can often rely on explicit logic. We will not cover closed-domain QA.
I will use ODQA and QA interchangeably for the remainder of the article. ODQA models can be split into a few subcategories.
The most common form of QA is open-book extractive QA (top-left above). Here we combine an information retrieval (IR) step and a reading comprehension (RC) step.
Any open-book QA requires an IR step to retrieve relevant information from the ‘open-book’.
Just as with open-book exams, where students can refer to their books for information during an exam, the model can refer to an external source of information. That source of information may be internal company documents, Wikipedia, Reddit, or any other information source that is not the model itself.
The IR step retrieves relevant documents and passes them to the RC (reader) step. RC consists of extracting a succinct answer from a sentence or paragraph, typically referred to as the document or context.
The other two types of QA rely on generating answers rather than extracting them. OpenAI’s GPT models are well-known generative transformer models.
In open-book abstractive QA, the first IR step is the same as extractive QA; relevant contexts are retrieved from an external source. These contexts are passed to the text generation model (such as GPT) and used to generate (not extract) an answer.
Alternatively, we can use closed-book abstractive QA. Here there is only a text generation model and no IR step. The generator model will generate an answer based on its own internal learned representation of the world. It cannot refer to any external source of information hence the name closed-book.
Let’s dive into each of these approaches and learn where we might apply each.
Extractive QA is arguably the most widely applicable form of question-answering. It allows us to ask a question and then extract an answer from a short text. For example, we have the text (or context):
Super Bowl 50 was an American football game to determine the
champion of the National Football League (NFL) for the 2015 season.
The American Football Conference (AFC) champion Denver Broncos
defeated the National Football Conference (NFC) champion Carolina
Panthers 24–10 to earn their third Super Bowl title. The game was
played on February 7, 2016, at Levi's Stadium in the San Francisco
Bay Area at Santa Clara, California.
To which we could ask the question,
"which team represented the AFC at Super Bowl 50?" and we should expect to return
The example where we present a single context and extract an answer is reading comprehension (RC).
Alone, RC is not particularly useful, but we can couple it with an external data source and search through many contexts, not just one. We call this ‘open-book extractive QA’. More commonly referred to as just extractive QA. It is not a single model but actually consists of three components:
- Indexed data (document store/vector database)
- Retriever model
- Reader model
Before asking questions, open-book QA requires indexing data that our retriever model can later access. Typically this will be chunks of sentence-to-paragraph-sized text.
Let’s work through an example. First, we need data. A popular QA dataset is the Stanford Question and Answering Dataset (SQuAD). We can download this dataset using Hugging Face’s
datasets library like so:
Here we have the context feature. It is these contexts that should be indexed in our database.
Options for the type of database vary based on the retriever model. A traditional retriever uses sparse vector retrieval with TF-IDF or BM25.
These models return contexts based on the frequency of matching words between a context and the question. More word matches equate to higher relevance. Elasticsearch is the most popular database solution for this, thanks to its scalable and strong keyword search capabilities.
The other option is to use dense vector retrieval with sentence vectors built by transformer models like BERT. Dense vectors have the advantage of enabling search via semantics. Searching with the meaning of a question as described in the ‘tasty, freshly baked things’ example. For this, a vector database like Pinecone or a standalone vector index like Faiss is needed.
We will try the dense vector approach. First, we encode our contexts with a QA model like
multi-qa-MiniLM-L6-cos-v1 from sentence-transformers. We initialize the model with:
Using the model, we encode the contexts inside our dataset object
qa to create the sentence vector representations to be indexed in our vector database.
Now we can go ahead and store these inside a vector database. We will use Pinecone in this example (which does require a free API key). First, we initialize a connection to Pinecone, create a new index, and connect to it.
From there, all we need to do is
upsert (upload and insert) our vectors to the Pinecone index. We do this in batches where each sample is a tuple of
Once the contexts have been indexed inside the database, we can move on to the QA process.
Given a question/query, the retriever creates a sparse/dense vector representation called a query vector. This query vector is compared against all of the already indexed context vectors in the database. The n most similar are returned.
These most similar contexts are passed (one at a time) to the reader model alongside the original question. Given a question and context, the reader predicts an answer’s start and end positions.
We will use the
deepest/electra-base-squad2 model from Hugging Face
transformers as our reader model. All we do is set up a
'question-answering' pipeline and pass our query and contexts to it one by one.
The reader prediction is repeated for each context. If preferred, we can order the ‘answers’ from here using the scores output by the retriever and/or reader models.
As we can see, the model returns the correct answer of
'Denver Broncos' with a score of 0.99. Most other answers return only minuscule scores, showing that our reader model easily distinguishes between good and bad answers.
As we saw before, abstractive QA can be split into two types: open-book and closed-book. We will start with open-book as the natural continuation of the previous extractive QA pipeline.
Being open-book abstractive QA, we can use the same database and retriever components used for extractive QA. These components work in the same way and deliver a set of contexts to our generator model, which replaces the reader from extractive QA.
Rather than extracting answers, contexts are used as input (alongside the question) to a generative sequence-to-sequence (seq2seq) model. The model uses the question and context to generate an answer.
Large transformer models store ‘representations’ of knowledge in their parameters. By passing relevant contexts and questions into the model, we hope that the model will use the context alongside its ‘stored knowledge’ to answer more abstract questions.
The seq2seq model used is commonly BART or T5-based. We will go ahead and initialize a seq2seq pipeline using a BART model fine-tuned for abstractive QA —
The question we asked before is specific. We’re looking for a short and concise answer of
Denver Broncos. Abstractive QA is not ideal for these types of questions:
Instead, the benefit of abstractive QA comes with more ‘abstract’ questions like
"Do NFL teams only care about playing at the Super Bowl?" Here, we’re almost asking for an opinion. There is unlikely to be an exact answer. Let’s see what the abstractive QA method thinks about this.
These answers look much better than our ‘specific’ question. The returned contexts don’t include direct information about whether the teams care about being in the Super Bowl. Instead, they contain snippets of concrete NFL/Super Bowl details.
The seq2seq model combines those details and its own internal ‘knowledge’ to produce some insightful thoughts on the question:
- “No, because it is the pinnacle of professional football” — points out that teams in the Super Bowl (whether they win or not) already know they’re at the top; in a way, they’ve ‘already won’.
- “They don’t care if they lose, they just care if they get a nice, big crowd to cheer” — players are happy that they get to entertain their fans; that is, the Super Bowl is less important.
- “They are paid a lot of money to be in the Superbowl” — points out the more obvious ‘who wouldn’t want bucket loads of money?’.
There is plenty of contradiction and opinion, but that is often the case with more abstract questioning, particularly with the question we asked.
Although these results are interesting, they’re not perfect. We can tweak parameters such as
temperature to increase/decrease randomness in the answers, but abstractive QA can be limited in its coherence.
The final architecture we will look at is closed-book abstractive QA. In reality, this is nothing more than a generative model that takes a question and relies on nothing more than its own internal knowledge. There is no retrieval step.
Although we’re dropping the retriever model, that doesn’t mean we stick with the same reader model. As we saw before, the
yjernite/bart_eli5 model requires input like:
Without the context input, the previous model does not perform as well. This is to be expected. The seq2seq model is optimized to produce coherent answers when given both question and context. If our input is in a new, unexpected format, performance suffers:
The model doesn’t know the answer and flips the direction of questioning. Unfortunately, this isn’t really what we want. However, there are many alternative models we can try. The GPT models from OpenAI are well-known examples of generative transformers and can produce good results.
GPT-3, the most recent GPT from OpenAI, is locked behind an API, but there are open-source alternatives like GPT-Neo from Eleuther AI. Let’s try one of the smaller GPT-Neo models.
Here we’re using the
'text-generation' pipeline. All we do here is generate text following a question. We get an interesting, true answer, but it doesn’t necessarily answer the question. We can try a few more questions.
We can tweak parameters to reduce the likelihood of repetition.
We do get some interesting results, although it is clear that closed-book abstractive QA is a challenging task. Larger models store more internal knowledge; thus, closed-book performance is very much tied to model size.
With bigger models, we can get better results, but for consistent answers, the open-book alternatives tend to outperform the closed-book approach.