Semantic search¶
2 min intro¶
Semantic matching is an recent alternative to the classic lexical (or keywords) based matching that has powered most search solution for years. It is also called vector search or search on dense vector in comparision to keyword search that relies on sparse vector representation of textual information.
In a nutshell, semantic search convert text documents into one or several vectors. First, it uses tokenization to split long texts into small chunks that can be understood as words (more on that later). These tokens are then encoded with an embedding model to get a vector for each token. Finally these vectors are assembled together into a unique one to represent a piece of text.
User queries follow the same process. To search for specific content, the system then compares the user's auery vector to all the ones stored for documents and then return the search results as an ordered list of items using a specific scoring function.
The LLM piece in that setup is the embedding model that tries to sum up to semantic information in a vector (More details later on how this is done and how one can adapt an existing model to improve the quality of the embedding process for a specific use-case).
Getting started¶
Install dependencies¶
The most important dependencies are:
datasets
from huggingface which allow to load data from their hubsentence-transformers
which provide some pretrained embedding models dedicated to sentence similaritynltk
to apply preprocessing to text
!pip install datasets sentence-transformers nltk tqdm matplot
Prepare resources for language analysis done in preprocessing.
import nltk
nltk.download('punkt')
Download a dataset¶
The selected dataset is a preprocessed extract from wikipedia. More details on HF hub https://huggingface.co/datasets/Cohere/wikipedia-22-12 Articles from wikipedia have been extracted from 2022 dump and splitted into paragraphs. It contains several languages but we will focus on the 'simple-english' wiki: purely in english, not too complicated wordings (focussing on simple comprehension) and not too long articles.
All these preprocessing makes this dataset very suitable for a first implementation of semantic search:
- only one language
- reduced vocabulary to avoid rare words
- short "document" (in the form of paragraph)
from datasets import load_dataset
cohere_wikipedia_simple_dataset = load_dataset(f"Cohere/wikipedia-22-12", 'simple', split='train')
Checking the content of one sample:
for idx, row in enumerate(cohere_wikipedia_simple_dataset):
print('#'*50)
for k, v in row.items():
print(f"{k}: \t{v}")
if idx == 3:
break
We can see each sample contains:
- the title of the article,
- the paragraph which consists of several sentences over few lines
- the paragraph id (ie index of the paragraph in the article, in this case '0' for first paragraph)
- link to the article as url
len(cohere_wikipedia_simple_dataset)
Text extraction¶
Extracting the raw text for each sample will allow to make a quick check on the vocabulary.
Simple approach¶
from tqdm.notebook import tqdm
paragraphs = list()
for row in tqdm(cohere_wikipedia_simple_dataset):
paragraphs.append(row['text'])
Brief linguistic exploration (OPTIONAL)¶
Vocabulary analysis (OPTIONAL)¶
We do a very simplistic word and vocabulary extraction here (ignoring that paragraph contains multiple sentences) but that will be good enough for now.
from nltk.tokenize import word_tokenize
vocabulary = dict()
for paragraph in tqdm(paragraphs):
words = word_tokenize(paragraph)
# normalize words and remove non alphabetic ones (including numbers and punctuation)
words=[word.lower() for word in words if word.isalpha()]
for word in words:
if word not in vocabulary:
vocabulary[word] = 1
else:
vocabulary[word] = vocabulary[word] + 1
print(f"Vocabulary size: {len(vocabulary)}")
sorted_vocabulary = dict(sorted(vocabulary.items(), key=lambda item: item[1]))
sorted_vocabulary_list = list(sorted_vocabulary.items())
K = 20
print(f"\nTop {K} most common words:\n\n{sorted_vocabulary_list[-K:]}")
print(f"\nTop {K} rarest words:\n\n{sorted_vocabulary_list[:K]}")
min_occurence = 500
limited_voc = { k: v for k,v in sorted_vocabulary.items() if v > min_occurence}
print(f"Limited vocuabulary size: {len(limited_voc)}")
K = 20
print(f"\nTop {K} rarest words:\n\n{list(limited_voc)[:K]}")
Sentence statistics (OPTIONAL)¶
Looking into basic statistics:
- number of characters per sentence
- number of words per sentence
- number of tokens per sentence
from nltk.tokenize import word_tokenize
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def sentence_stats(sample_text):
stats = dict()
stats['char_count'] = len(sample_text)
stats['word_count'] = len(word_tokenize(sample_text))
stats['token_count'] = len(model.tokenizer.tokenize(sample_text))
return stats
cohere_wikipedia_simple_dataset = cohere_wikipedia_simple_dataset.map(
lambda sample: sentence_stats(sample),
input_columns='text',
num_proc=6
)
avg_char_count = sum(cohere_wikipedia_simple_dataset['char_count'])/len(cohere_wikipedia_simple_dataset)
avg_word_count = sum(cohere_wikipedia_simple_dataset['word_count'])/len(cohere_wikipedia_simple_dataset)
avg_token_count = sum(cohere_wikipedia_simple_dataset['token_count'])/len(cohere_wikipedia_simple_dataset)
print(f"avg_char_count={avg_char_count}")
print(f"avg_word_count={avg_word_count}")
print(f"avg_token_count={avg_token_count}")
cohere_wikipedia_simple_dataset[0]
model.tokenize(cohere_wikipedia_simple_dataset[0]['text'])
import matplotlib.pyplot as plt
# your three lists of integers
char_count = cohere_wikipedia_simple_dataset['char_count']
word_count = cohere_wikipedia_simple_dataset['word_count']
token_count = cohere_wikipedia_simple_dataset['token_count']
# combine the three lists into one
data = [char_count, word_count, token_count]
# plot histogram for each list and combine into one graph
plt.hist(data, bins=range(min(min(data)), max(max(data)) + 1, 1), alpha=0.8, label=['Characters', 'Words', 'Tokens'])
plt.yscale('log')
plt.grid()
# set x and y axis labels and title
plt.xlabel('Count')
plt.ylabel('Occurences')
plt.title('Histograms of linguistic elements counts - log scale')
# show legend
plt.legend()
# show plot
plt.show()
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
The dataset, with almost 500K samples, is rather large for a 'getting started' tutorial. Let's quickly check how long it would take to encode everything.
import time
N = 500
tic = time.time()
model.encode(paragraphs[0:N])
toc = time.time()
duration_in_seconds = (toc-tic)/N
print(f"Rough time to encode one paragraph: {duration_in_seconds*1000:.3} ms")
dataset_duration = len(paragraphs)*duration_in_seconds
hours = dataset_duration // 3600
minutes = (dataset_duration % 3600) // 60
seconds = dataset_duration % 60
print(f"Rough time to encode the dataset: {hours} hours, {minutes} min and {int(seconds)} seconds")
print(f"Nb encoded paragraphs per second: {int(1/duration_in_seconds)}")
time_budget_in_min = 3
print(f"Nb encoded paragraphs in {time_budget_in_min}min: {int(time_budget_in_min*60/duration_in_seconds)}")
nb_paragraph_to_encode = int((time_budget_in_min*60)/duration_in_seconds)
print(f"Number of paragraphs to encode: {nb_paragraph_to_encode}")
sentence_embeddings = model.encode(paragraphs[:int(nb_paragraph_to_encode)], show_progress_bar=True)
len(sentence_embeddings)
So, we end up with an array of vectors in sentence_embeddings
that strictly follow the ordering of texts in the paragraphs
. It can thus be considered to be the index of documents. Next, we will search against this index.
Searching in the embeddings¶
Basic search¶
from sentence_transformers import SentenceTransformer, util
search_query = "climate change"
# query is encoded by calling model.encode()
query_embedding = model.encode(search_query)
# compute cosine similarity between all pairs
cos_sim = util.cos_sim(query_embedding, sentence_embeddings)
# add all pairs to a list with their cosine similarity score
all_sentence_scores = []
for i in range(len(sentence_embeddings)):
all_sentence_scores.append([cos_sim[0, i], i])
# sort list by the highest cosine similarity score
all_sentence_scores = sorted(all_sentence_scores, key=lambda x: x[0], reverse=True)
K = 5
print(f"Top-{K} sentences similar to search query: {search_query}")
for score, i in all_sentence_scores[0:K]:
print(f"\n - ({i}): {score:.4f} - {cohere_wikipedia_simple_dataset[i]}")
Search function¶
model = SentenceTransformer('all-MiniLM-L6-v2')
def search(query, index, dataset, K=5):
# query is encoded by calling model.encode()
query_embedding = model.encode(query)
# compute cosine similarity between all pairs
cos_sim = util.cos_sim(query_embedding, index)
#Add all pairs to a list with their cosine similarity score
all_sentence_scores = []
for i in range(len(sentence_embeddings)):
all_sentence_scores.append([float(cos_sim[0, i]), i])
#Sort list by the highest cosine similarity score
all_sentence_scores = sorted(all_sentence_scores, key=lambda x: x[0], reverse=True)
search_results = list()
for score, i in all_sentence_scores[0:K]:
search_results.append(
dict(
score=score,
doc_index=i,
doc_id=dataset[i]['id'],
wiki_id=dataset[i]['wiki_id'],
title=dataset[i]['title'],
text=dataset[i]['text'],
)
)
return search_results
results = search('climate change', sentence_embeddings, cohere_wikipedia_simple_dataset)
for idx, res in enumerate(results):
print(f"\n{idx}: @{res['score']:.3f} - [{res['title']}] - {res['text']}")
The search function retrieve the top 5 paragraphs, not page. Thus this result set contains multiple matches on the same article Global warming
. A More user-friendly might prefer to group matches per page.
Evaluation¶
Creating an evaluation dataset¶
To create an evaluation dataset, we need some query/response samples over a similar set of documents. In huggingface hub, one can fond several question answering datasets that are just like this. The wiki_qa
dataset is covering just that with samples of questions and sentences extracted from wikipedia articles.
One should notes that the most important quality of this dataset is that the questions are "real" in the sense that these were extracted from bing search logs. Thus these haved been asked by real users, without any specific instructions which make them "natural". The selected answered are based on users' clickthrough data and again are considered natural annotations from real users.
wiki_qa_dataset = load_dataset("wiki_qa")
wiki_qa_dataset['train']
So each sample combines a question and a sentence from a specific wikipedia document identified by its title. For each question/sentence pair a 0/1 label assess if the sentence can be considered to be an answer to the question.
Since we are looking into search system (and not a general sentence similarity engine), we will focus the evaluation on the positive Q/A pairs, mixing the tain/dev/test data from this dataset. Keep in mind that it means we shouldn't retrain or finetune our model on this set (or the same sentences from any dataset using wikipedia as a source) to avoid label contamination.
Also, since our use-case is a pure 'search system', the negative sample (with label 0
) are not really useful. We will thus only store the document with positive labelled sentence which are considered an answer. We still keep the negatively labelled sentence for the same document to have a larger dataset.
documents_set = set()
sentences = set()
positive_qa_pairs = set()
for split in wiki_qa_dataset:
print("Reconstructing documents for", split)
for row in tqdm(wiki_qa_dataset[split]):
question = row['question']
title = row['document_title']
answer_sentence = row['answer']
label = row['label']
doc = (title, answer_sentence)
documents_set.add(doc)
sentences.add(answer_sentence)
if label == 1:
positive_qa_pairs.add((question, title, answer_sentence))
documents = list(documents_set)
print(f"Number of unique documents {len(documents)}")
print(f"Number of unique positive Q/A pairs (ie unique question with specific answer) {len(positive_qa_pairs)}")
print(f"Number of unique sentences {len(sentences)} (ie {len(documents) - len(sentences)} duplicated sentences)")
One can note that there are 66 duplicated sentences in the dataset (~0,02%) . These are mostly anecdotal and won't have a deep impact on the evaluation.
documents[:3]
Evaluation index¶
To create the evaluation index, we will process all the sentences from the documents. This will be used as our reference corpus for the evaluation.
from sentence_transformers import SentenceTransformer
all_MiniLM_L6_v2_model = SentenceTransformer('all-MiniLM-L6-v2')
evaluation_documents = documents
evaluation_embeddings_index = all_MiniLM_L6_v2_model.encode([doc[1] for doc in evaluation_documents], show_progress_bar=True)
def search_func(search_query, model, index, K=10):
if isinstance(search_query, str):
search_query = [search_query]
# query is encoded by calling model.encode()
query_embedding = model.encode(search_query, show_progress_bar=False)
# compute cosine similarity between all pairs
cos_sim = util.cos_sim(query_embedding, index)
all_results = list()
for q_i, query in enumerate(tqdm(search_query)):
# add all pairs to a list with their cosine similarity score
all_sentence_scores = []
for i in range(len(index)):
all_sentence_scores.append((float(cos_sim[q_i, i]), i))
# sort list by the highest cosine similarity score
all_sentence_scores = sorted(all_sentence_scores, key=lambda x: x[0], reverse=True)
# keep top K only
all_sentence_scores = all_sentence_scores[:K]
query_results = list()
for score, i in all_sentence_scores:
result = (
score, # score
i, # document index in the corpus
evaluation_documents[i][0], # title
evaluation_documents[i][1], # text
)
query_results.append(result)
all_results.append(query_results)
return all_results
search_query = "machine pumping water"
search_results = search_func(search_query, all_MiniLM_L6_v2_model, evaluation_embeddings_index)
print(f"############\n Search query: {search_query}")
for rank, res in enumerate(search_results[0]):
i = res[1]
print(f"{rank+1}- [{evaluation_documents[i][0]}] @ {res[0]:2.3f} - {evaluation_documents[i][1]}")
/!\ Important to note: in order to allow for multiple query to be searched at the same time, we set the
search_func()
input to a list of queries (thus the use ofsearch_query = ["machine pumping water"]
)
search_query = ["machine pumping water", "water pump"]
search_results = search_func(search_query, all_MiniLM_L6_v2_model, evaluation_embeddings_index)
for q_i, query in enumerate(search_query):
print(f"############\n Search query: {query}")
for rank, res in enumerate(search_results[q_i]):
print(f"{rank+1}- [{res[2]}] @ {res[0]:2.3f} - {res[3]}")
Evaluation metrics¶
Several metrics for search systems and how to have the best estimation of a system performance is oftent an interesting challenge depending on the specific usage. The most important thing to remember is that in most cases the search problem is multi-factorial and it is often a good idea to have than one metric to have a constractive view f the system performance.
For our case, we will focus on 2 aspects:
- Recall which evaluate the capacity of a search system to capture the correct reponse to a query (more generally, it can be seen as the true positive rate)
- Mean reciprocal rank which evaluate the ability to rank the good answer higher in the retrieved list of results
For both metrics, we will assume that the relevance of a document in response to a query is binary (either relevant or not). This is true in our evaluation set where answers are labbelled 0 or 1, but not always the case in any setup since relevance can be gradual.
Finnaly we will compute the metrics over the result set fixed the top K results, resulting in measuring recall@K
and MRR@K
. This is a classic approach for search or recommender systems where users are shown a limited number of results in 'one result page'.
def recall(result_set_list, positive_result, K=5):
for result in result_set_list[0:K]:
if result == positive_result:
return 1
return 0
def avg_recall(queries, result_set_list_per_query, positive_results_per_query, K=5):
total_recall = 0
for q in queries:
total_recall += recall(result_set_list_per_query[q], positive_results_per_query[q], K)
return 1.0*total_recall/len(queries)
def mrr(result_set_list, positive_result, K=5):
for rank, result in enumerate(result_set_list[0:K]):
if result == positive_result:
return 1/(1+rank)
return 0
def avg_mrr(queries, result_set_list_per_query, positive_results_per_query, K=5):
total_mrr = 0
for q in queries:
total_mrr += mrr(result_set_list_per_query[q], positive_results_per_query[q], K)
return 1.0*total_mrr/len(queries)
def run_search_evaluation(positive_qa_pairs, model, documents, K):
print(f"Creating the evaluation index by computing {len(documents)} embeddings...")
index = model.encode([doc[1] for doc in documents], show_progress_bar=True)
result_set_list_per_query = dict()
positive_results_per_query = dict()
queries = [ qa_pair[0] for qa_pair in positive_qa_pairs]
print(f"Running search for {len(queries)} queries...")
search_results = search_func(queries, model, index, K)
print(f"Running the evaluation {len(queries)} queries...")
for q_i, qa_pair in enumerate(tqdm(positive_qa_pairs)):
query = qa_pair[0]
doc_title = qa_pair[1]
answer_sentence = qa_pair[2]
query_results = search_results[q_i]
result_set_list = list()
for res in query_results:
result_set_list.append((res[2], res[3]))
result_set_list_per_query[query] = result_set_list
positive_results_per_query[query] = (doc_title, answer_sentence)
return queries, result_set_list_per_query, positive_results_per_query
def compute_metrics(queries, result_set_list_per_query, positive_results_per_query, K):
recall_metric = avg_recall(queries, result_set_list_per_query, positive_results_per_query, K)
mrr_metric = avg_mrr(queries, result_set_list_per_query, positive_results_per_query, K)
return recall_metric, mrr_metric
It is important to take into account the size of the result set K in the metric. K is usually defined by the application based on the number of results that can be shown to a user in one screen but also on the profile of users (advanced/trained users are usually more persistent but so are the users for some e-commerce segments).
def end_to_end_evaluation(positive_qa_pairs, model, documents, K=[3, 5]):
max_K = max(K)
queries, result_set_list_per_query, positive_results_per_query = run_search_evaluation(positive_qa_pairs, model, documents, max_K)
print("Computing metrics...")
metrics = dict()
for k in K:
recall_metric, mrr_metric = compute_metrics(queries, result_set_list_per_query, positive_results_per_query, k)
metrics[k] = (recall_metric, mrr_metric)
return metrics
Running the evaluation¶
Lets get the results from the system with a result list of 10 so we can comput mutiple metrics for 3, 5 and 10.
from sentence_transformers import SentenceTransformer
all_MiniLM_L6_v2_model = SentenceTransformer('all-MiniLM-L6-v2')
metrics = end_to_end_evaluation(positive_qa_pairs, all_MiniLM_L6_v2_model, evaluation_documents, K=[3, 5, 10])
for k in [3, 5, 10]:
(recall_metric, mrr_metric) = metrics[k]
print(f"Recall@{k} = {recall_metric:0.2f} \t MRR@{k} = {mrr_metric:0.2f}")
In our case, the "raw" performance of the embedding model are pretty decent. With a recall@5
above 75%, it means that the good answer to the query/question can be found in the top 5 results in more than 3/4 of the time.
Going further¶
This simple intro allows you to get your own dedicated vector similarity search engine. Depending on your goals, few directions are possible to go further:
- scaling the system to a larger dataset or high query troughput: looking into using a vector database would be the most promising next step.
- improving the relevance of the search system for a specific vertical or domain of interest: fine-tuning a pretrained embedding model would be the way to go.
- integrating the search feature into a more interactive conversational Q/A syetm: retrieval augmented generation (or RAG) is there the right path.