# Semantic Search for texts

To enable searching over a large collection of texts, we can frame it as a task of Sentence Similarity. 

Sentence Similarity is the task of determining how similar two texts are. Sentence similarity models convert input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are between them. This task is particularly useful for information retrieval and clustering/grouping.

![img](images/text-sim.png)

1. What happens underneath? Obtaining embeddings



![img2](https://sentence-transformers-embeddings-semantic-search.hf.space/media/c2fb8304521f374106be37112a4851659f485ffc45c44a736a2b9fa9.png)

2. Obtaining the closest observations in the vector space
   
We now have two numerical representations of texts (embeddings): our original text database and our query (here, the description of a python function). Our goal: get the texts in the database that have the closest meaning to our query.

![img3](https://sentence-transformers-embeddings-semantic-search.hf.space/media/c96bd26de397e9c7cf142268ba32b3b0963016f9a3be1e3c86bbf209.png)

# Application: Retrieval-Augmented Generation (RAG)

As we saw in the previous session, the context of a LLM is limited (tipically a few thousand tokens), so we if we want to introduce external knowledge (for example, internal documentation from our company), all the information is not gonna fit in the context.

Thus, we can combine the power of a LLM with a semantic search engine to retrieve the most relevant information from a large collection of texts. This is what is called **Retrieval-Augmented Generation**.

Here is a pair of diagrams of this process:

![img4](images/rag1.png)

![rag2](https://blogs.nvidia.com/wp-content/uploads/2023/11/LangChain-2-LLM-with-a-retriveal-process.jpg)

### Why RAG?

LLMs offer a natural language interface between humans and data. Widely available models come pre-trained on huge amounts of publicly available data like Wikipedia, mailing lists, textbooks, source code and more.

However, **while LLMs are trained on a great deal of data, they are not trained on your data, which may be private or specific to the problem you're trying to solve**. It's behind APIs, in SQL databases, or trapped in PDFs and slide decks.

You may choose to fine-tune a LLM with your data, but:

- Training an LLM is expensive.
- Due to the cost to train, it's hard to update a LLM with latest information.
- Observability is lacking. When you ask a LLM a question, it's not obvious how the LLM arrived at its answer.

Instead of fine-tuning, one can do context augmentation pattern called **Retrieval-Augmented Generation (RAG) to obtain more accurate text generation relevant to your specific data**. RAG involves the following high level steps:

1. Retrieve information from your data sources first,
2. Add it to your question as context, and
3. Ask the LLM to answer based on the enriched prompt.

In doing so, RAG overcomes all three weaknesses of the fine-tuning approach:

- There's no training involved, so it's cheap.
- Data is fetched only when you ask for them, so it's always up to date.
- It can show you the retrieved documents, so it's more trustworthy.

### Stages within RAG

There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

* Loading: this refers to getting your data from where it lives -- whether it's text files, PDFs, another website, a database, or an API -- into your pipeline. LlamaHub provides hundreds of connectors to choose from.

* Indexing: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.

* Storing: once your data is indexed you will almost always want to store your index, as well as other metadata, to avoid having to re-index it.

* Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.

* Evaluation: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.


![stages](https://docs.llamaindex.ai/en/stable/_static/getting_started/stages.png)

### Libraries

Though we could build our own RAG application from scratch (by using the transformers library or the OpenAI API), it is such a popular application that a lot of wrappers already exist, to simplify our job. 

The most popular library for RAG is LlamaIndex, which is compatible with all the LLM providers (HuggingFace, OpenAI, Anthropic etc)

https://docs.llamaindex.ai/en/stable/

In [1]:
#!pip install llama-index

For the moment, we use the OpenAI models for our RAG examples, so let's load the API key:

In [1]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

## Example 1 - Chat with a book

Let's first develop an example of how we can use RAG to answer queries about any book. In this case, it's gonna be:

![hp](images/hp.jpg)

A friend has shared with us the actual book in txt format, located at `data/J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt`. Let's load it and see how we can use RAG to answer questions about it.

### Load data and build an index

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

This builds an index over the documents in the data folder (which in this case just consists of the book text, but could contain many documents).

Your directory structure should look like this:

```
├── notebook or python script
└── data
    └── document1
    └── document2
    └── ...

```

### Query your data

This creates an engine for Q&A over your index and asks a simple question. 

In [3]:
query_engine = index.as_query_engine()
response = query_engine.query("On which date did Harry start classes at Hogwarts?")
print(response)

Harry started classes at Hogwarts on Thursday.


In [5]:
query_engine = index.as_query_engine()
response = query_engine.query("Generate a list of the books required for first year students at Hogwarts. Use bullet points.")
print(response)

- List of necessary books and equipment for first-year students at Hogwarts:
  - The Standard Book of Spells (Grade 1) by Miranda Goshawk
  - A History of Magic by Bathilda Bagshot
  - Magical Theory by Adalbert Waffling
  - A Beginner's Guide to Transfiguration by Emeric Switch
  - One Thousand Magical Herbs and Fungi by Phyllida Spore
  - Magical Drafts and Potions by Arsenius Jigger
  - Fantastic Beasts and Where to Find Them by Newt Scamander
  - The Dark Forces: A Guide to Self-Protection by Quentin Trimble


In [6]:
query_engine = index.as_query_engine()
response = query_engine.query("How many points is the Golden Snitch worth in a game of Quidditch?")
print(response)

The Golden Snitch is worth an extra hundred and fifty points in a game of Quidditch.


In [7]:
query_engine = index.as_query_engine()
response = query_engine.query("How many points did the Gryffindor house achieved at the end of the first year?")
print(response)

Gryffindor house achieved a total of four hundred and seventy-two points at the end of the first year.


### Storing your index

By default, the data you just loaded is stored in memory as a series of vector embeddings. You can save time (and requests to OpenAI) by saving the embeddings to disk. That can be done with this line:

```
index.storage_context.persist()
```

Of course, you don't get the benefits of persisting unless you load the data. So let's modify the code to generate and store the index if it doesn't exist, but load it if it does:

In [23]:
import os.path
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

In [24]:
query_engine = index.as_query_engine()
response = query_engine.query("How does Harry first learn that he is a wizard?")
print(response)

Harry first learns that he is a wizard when Hagrid arrives to deliver his acceptance letter to Hogwarts School of Witchcraft and Wizardry.


### Debugging

To see what's happening under the hood, you can set the logging level to DEBUG:

In [25]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [26]:
response = query_engine.query("How does Harry first learn that he is a wizard?")
print(response)

DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x31be1b130>, 'json_data': {'input': ['How does Harry first learn that he is a wizard?'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x31be1b130>, 'json_data': {'input': ['How does Harry first learn that he is a wizard?'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
DEBUG:httpcore.connection:close.started
close.started
DEBUG:httpcore.connection:close.complete
close.complete
DEBUG:httpcore.connection:connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=60.0 socket_options=None
connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=60.0 socket_options=None
DEBUG:httpcore.connection:connect_tcp.complete retur

To disable it, you can reset the notebook

## Example 2 - Financial Reporting

Now, let's analyze financial data from the APPLE company. We can download such these reports from the SEC fillings page from the US government.

https://www.sec.gov/edgar/searchedgar/companysearch

A report is already downloaded in the `data_apple` directory

In [4]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data_apple").load_data()
index = VectorStoreIndex.from_documents(documents)

### Some sample questions...

In [5]:
query_engine = index.as_query_engine()
response = query_engine.query("Where are the company's headquarters located?")
print(response)

The company's headquarters are located in Cupertino, California.


In [13]:
response = query_engine.query("Why Epic Games sued Apple?")
print(response)

Epic Games sued Apple alleging violations of federal and state antitrust laws and California’s unfair competition law based on the operation of Apple's App Store.


In [14]:
response = query_engine.query("What were the total revenues of Apple in 2023?")
print(response)

Apple's total revenues in 2023 were $383.3 billion.


In [15]:
response = query_engine.query("And the net income?")
print(response)

The net income for the year ended September 30, 2023, was $96,995 million.


In [17]:
response = query_engine.query("Generate a list of the products that Apple has launched in that year. Use bullet points.")
print(response)

- iPad and iPad Pro
- Next-generation Apple TV 4K
- MLS Season Pass
- MacBook Pro 14”, MacBook Pro 16” and Mac mini
- Second-generation HomePod
- MacBook Air 15”, Mac Studio and Mac Pro
- Apple Vision Pro™
- iOS 17, macOS Sonoma, iPadOS 17, tvOS 17 and watchOS 10
- iPhone 15, iPhone 15 Plus, iPhone 15 Pro and iPhone 15 Pro Max
- Apple Watch Series 9 and Apple Watch Ultra 2


## Example 3 - User Support Assistant

Now let's build a User Support Assistant, that can answer questions about a product. In this case, we will use the Apple Watch as an example. 

It is a PDF manual of almost 400 pages long!! 

https://help.apple.com/pdf/watch/10/en_US/apple-watch-user-guide-watchos10.pdf

It is already downloaded in the `data_user_support` directory

In [3]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data_user_support").load_data()
index = VectorStoreIndex.from_documents(documents)

### Some sample queries...

In [19]:
query_engine = index.as_query_engine()
response = query_engine.query("Describe the steps to change notification settings")
print(response)

Open the Apple Watch app on your iPhone, tap My Watch, then tap Notifications. Select the specific app you want to adjust, tap Custom, and then choose from options like Allow Notifications, Send to Notification Center, or Notifications Off. Additionally, you can customize notification grouping preferences for the app.


In [20]:
response = query_engine.query("How can i zoom in in the maps?")
print(response)

To zoom in on the maps, you can double-tap the map on the spot you want to zoom in on.


In [21]:
response = query_engine.query("Does my watch support Mirroring")
print(response)

Your watch supports Mirroring if it is an Apple Watch Series 6, Apple Watch Series 7, Apple Watch Series 8, or Apple Watch Series 9.


### Exercise

Build a simple RAG example for a set of documents of your choice

## Using local models

Instead of using the OpenAI API models for both the embeddings and the LLM, you can use local, open-source models. This is particularly useful when you have a large collection of documents and you want to avoid the costs of the API.

The code is almost the same, you only need to specify the concrete models:

https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local/