Semantic Search and LLM¶
2 min intro¶
In the previous chapters, we've explored using embeddings for semantic search and fine-tuning large language models. Now, we’re ready to integrate these two powerful techniques. This chapter will demonstrate how to leverage fine-tuned models to answer specific questions by retrieving relevant information from targeted subsets of private documents.In the previous chapters, we’ve delved into using embeddings for semantic search and fine-tuning large language models. Now, we're gearing up to combine these techniques in a practical and, dare I say, caffeinated way. Imagine a chatbot dedicated to troubleshooting coffee machines—those essential contraptions that keep developers fueled during marathon coding sessions. With countless coffee disasters to address, from “Why isn’t my espresso shot coming out?” to “Help, my cappuccino is more like a latte!”—we’ll build a system that pulls precise answers from a curated set of documents. Get ready to brew up a solution that’s both strong and smart, ensuring your coding energy stays as high as your coffee consumption!
Installing dependencies¶
The key dependencies for this project include:
sentence-transformers
: This library offers a variety of pre-trained embedding models optimized for sentence similarity, enabling effective comparison and retrieval of text data.chromadb
: A robust database solution designed for efficiently storing and managing embeddings, facilitating quick lookups and scalable operations.langchain
: An essential library for building and managing complex language model workflows, integrating various components seamlessly.transformers
: Provides access to a wide range of pre-trained models and tools for various NLP tasks, crucial for fine-tuning and leveraging large language models.
%%capture
!pip install --quiet chromadb langchain_community langchain-huggingface loguru pydantic sentence-transformers
!pip install --quiet unstructured_client
!pip install --quiet transformers accelerate bitsandbytes einops
!pip install --quiet gradio
Resources¶
To support our chatbot's troubleshooting capabilities, we'll be utilizing two key documents related to popular coffee machines:
Nespresso Inissia Manual: This document provides detailed instructions for the Nespresso Inissia model, including setup, operation, and maintenance procedures. It’s a valuable resource for understanding common issues and solutions specific to this machine.
Nespresso Citiz Platinum User Manual: This manual covers the Nespresso Citiz Platinum model, offering comprehensive guidance on its features, usage, and troubleshooting tips. It’s essential for addressing problems and providing solutions for this particular machine.
These manuals will serve as the foundation for the knowledge base that our chatbot will leverage to assist users with coffee machine issues.
%%capture
!curl https://www.nespresso.com/shared_res/manuals/inissia/inissia_C_breville.pdf > sample_data/inissia_C_breville.pdf
!curl https://www.nespresso.com/shared_res/manuals/essenza-mini/2017/UM_NESPRESSO_ESSENZA_MINI_BREVILLE_PROD_WEB_2017_02_20.pdf > sample_data/essenza_mini_breville.pdf
RAG concept in a nutshell¶
Retrieval-Augmented Generation (RAG) is a powerful architecture that combines retrieval-based and generation-based approaches to enhance natural language processing tasks. As depicted in the diagram, RAG integrates two main components:
Retriever: This module is responsible for fetching relevant documents from a large corpus based on the input query. It utilizes methods such as dense retrieval to identify and retrieve pertinent information that can help answer the query.
Generator: Once the relevant documents are retrieved, the generator takes these documents along with the original query and produces a coherent and contextually accurate response. The generation process involves leveraging pre-trained language models to synthesize information and generate human-like text.
In essence, RAG combines the strengths of both retrieval and generation to produce more accurate and contextually informed answers, enhancing the overall performance of language understanding and generation tasks.
Data extraction and storage¶
Let's use the FREE Unstructured service¶
import os, json
import unstructured_client
from unstructured_client.models import operations, shared
from typing import List
from google.colab import userdata
elements = []
if True == False:
client = unstructured_client.UnstructuredClient(
api_key_auth=userdata.get("UNSTRUCTURED_API_KEY"),
server_url="https://api.unstructuredapp.io/general/v0/general"
)
for filename in ["sample_data/inissia_C_breville.pdf", "sample_data/essenza_mini_breville.pdf"]:
# Loop through PDFs, download, pre-process and then delete
with open(filename, "rb") as f:
data = f.read()
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=shared.Files(
content=data,
file_name=filename,
),
# --- Other partition parameters ---
# Note: Defining `strategy`, `chunking_strategy`, and `output_format`
# parameters as strings is accepted, but will not pass strict type checking. It is
# advised to use the defined enum classes as shown below.
strategy=shared.Strategy.HI_RES,
languages=['eng'],
coordinates=True,
chunking_strategy=shared.ChunkingStrategy.BY_TITLE,
max_characters=1024,
# --- PDF partition parameters ---
split_pdf_page=True, # If True, splits the PDF file into smaller chunks of pages.
split_pdf_allow_failed=True, # If True, the partitioning continues even if some pages fail.
split_pdf_concurrency_level=15 # Set the number of concurrent request to the maximum value: 15.
),
)
try:
res = client.general.partition(request=req)
element_dicts = [element for element in res.elements]
json_elements = json.dumps(element_dicts, indent=2)
# Split filename in path, filename and extension
filename_split = os.path.splitext(filename)
filename_no_ext = filename_split[0].split('/')[-1]
# Write the processed data to a local with same name but json extension
with open(f"/content/drive/MyDrive/{filename_no_ext}.json", "w") as file:
file.write(json_elements)
with open(f"{filename_no_ext}.json", "w") as file:
file.write(json_elements)
elements.extend(element_dicts)
except Exception as e:
print(e)
with open(f"all_elements.json", "w") as file:
file.write(json.dumps(elements, indent=2))
f"{len(elements)} chunks have been retrieved for those 2 documents"
At the time this notebook was written, you only got few chunks. Let's use a strategy based on PyMuPDF
Alternate strategy¶
!pip install --quiet pymupdf
from typing import List
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
def process_docs(docs: List[str]) -> List[Document]:
'''
This function consumes a list of file names and aplied preprocessing to get Langchain doucment chunks.
'''
# preparing the doc splitter
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=32)
# preparing the chunks
chunked_documents = list()
# reading file one by one
for doc in docs:
# loading the file with langchain loader
doc_loader = PyMuPDFLoader(doc)
# splitting the document in chunks
chunks = doc_loader.load_and_split(text_splitter)
# addding these to the returned lis
chunked_documents.extend(chunks)
return chunked_documents
elements = process_docs(['sample_data/inissia_C_breville.pdf', 'sample_data/essenza_mini_breville.pdf'])
f"{len(elements)} chunks have been retrieved for those 2 documents"
Transform Unstructured elements to list of Langchain documents¶
documents = []
for element in elements:
metadata = element.metadata
documents.append(Document(page_content=element.page_content, metadata=metadata))
Create an embedding¶
from tqdm.autonotebook import tqdm
embeddings_model_name = "all-MiniLM-L6-v2"
from langchain_huggingface import HuggingFaceEmbeddings
from tqdm.autonotebook import tqdm, trange
embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
Embed elements and store them in ChromaDB store¶
from langchain.vectorstores import utils as chromautils
from langchain_community.vectorstores import Chroma
# ChromaDB doesn't support complex metadata, e.g. lists, so we drop it here.
# If you're using a different vector store, you may not need to do this
docs = chromautils.filter_complex_metadata(documents)
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./db_1")
Define a retriever¶
#retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3, "fetch_k": 5})
for k, v in vectorstore.get().items():
print(k, v)
# query the retriever with simple input
query = "cold coffee"
retrieved_docs = retriever.invoke(input=query)
# print the chunk matching the query
for i, doc in enumerate(retrieved_docs):
print('#'*30)
print(f'\n<<{i}>> on page {doc.metadata["page"]}: \n{doc.page_content}')
Define & Run the Chatbot¶
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
Define the llm to be used¶
!pip install --quiet flash-attention
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace, HuggingFacePipeline
model_name = "microsoft/Phi-3-mini-4k-instruct"
#model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# model_name = "microsoft/Phi-3.5-mini-instruct"
#task shall be one of:
# - “text-generation",
# - “text2text-generation",
# - “summarization”,
# - “translation”
# This class uses serverless API (hosting by HF)
# llm = HuggingFaceEndpoint(
# repo_id=model_name,
# task="text-generation",
# max_new_tokens=512,
# do_sample=False,
# repetition_penalty=1.03,
# )
device = 0 if torch.cuda.is_available() else -1
#
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
bnb_config = None
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
top_k=50,
temperature=0.1
)
llm = HuggingFacePipeline(pipeline=pipeline)
chat = ChatHuggingFace(llm=llm, verbose=True)
#del llm
import gc
gc.collect()
# clear GPU memory
import torch
torch.cuda.empty_cache()
Define the prompt¶
from langchain_core.globals import set_debug
set_debug(False)
# "You are a troubleshooting chatbot that talks like a pirate."
system_prompt = (
"You are a troubleshooting chatbot.\n"
"Your goal is to help user solving the problem they have with their appliance.\n"
"Use the following pieces of retrieved context to answer the question.\n"
"If you don't know the answer, say that you don´t know.\n"
"Use three sentences maximum and keep the answer concise.\n"
"\n\n"
"{context}\n"
)
prompt = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
("human", "{input}"),
]
)
question_answer_chain = create_stuff_documents_chain(chat, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
prompt = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
("human", "{input}"),
]
)
import os
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')
os.environ["HUGGINGFACEHUB_API_TOKEN"] = hf_token
response = rag_chain.invoke({"input": "I have an Inissia coffee machine and my coffee is cold. What should I do?"})
response["answer"]
import gradio as gr
def add_message(history, message):
if message["text"] is not None:
history.append((message["text"], None))
return history, gr.MultimodalTextbox(value=None, interactive=True)
import random
color_map = {
"harmful": "crimson",
"neutral": "gray",
"beneficial": "green",
}
def html_src(harm_level):
return f"""
<div style="display: flex; gap: 5px;">
<div style="background-color: {color_map[harm_level]}; padding: 2px; border-radius: 5px;">
{harm_level}
</div>
</div>
"""
def fake_bot_response(history):
response_type = random.choice(["text", "gallery", "image", "video", "audio", "html"])
if response_type == "gallery":
history[-1][1] = gr.Gallery(
[
"https://raw.githubusercontent.com/gradio-app/gradio/main/test/test_files/bus.png",
"https://raw.githubusercontent.com/gradio-app/gradio/main/test/test_files/bus.png",
]
)
elif response_type == "image":
history[-1][1] = gr.Image(
"https://raw.githubusercontent.com/gradio-app/gradio/main/test/test_files/bus.png"
)
elif response_type == "video":
history[-1][1] = gr.Video(
"https://github.com/gradio-app/gradio/raw/main/demo/video_component/files/world.mp4"
)
elif response_type == "audio":
history[-1][1] = gr.Audio(
"https://github.com/gradio-app/gradio/raw/main/test/test_files/audio_sample.wav"
)
elif response_type == "html":
history[-1][1] = gr.HTML(
html_src(random.choice(["harmful", "neutral", "beneficial"]))
)
else:
history[-1][1] = "Cool!"
return history
def llm_response(history):
response = rag_chain.invoke({"input": history[-1][0]})
history[-1][1] = response["answer"]
return history
bot_response = llm_response
with gr.Blocks() as demo:
chatbot = gr.Chatbot(
[[None, "Hi, I'm your assistant. Ask me anything or upload a product manual to start."]],
elem_id="chatbot",
bubble_full_width=False,
)
with gr.Row():
chat_input = gr.MultimodalTextbox(
scale=4,
interactive=True,
placeholder="Enter message or upload file...",
show_label=False
)
clear_button = gr.ClearButton([chat_input, chatbot], value="Clear chat")
chat_msg = chat_input.submit(add_message, [chatbot, chat_input], [chatbot, chat_input])
bot_msg = chat_msg.then(bot_response, [chatbot], chatbot)
bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input])
demo.launch(share=True)