Quickstart: Retrieval Augmented Generation (RAG)

How to build a RAG workflow in under 5 mins!

In this Quickstart you'll learn how to build a RAG workflow using Together AI in 6 quick steps that can be ran in under 5 minutes!

We will leverage the embedding, reranking and inference endpoints.

1. Register for an account

First, register for an account to get an API key. New accounts come with $1 to get started.

Once you've registered, set your account's API key to an environment variable named TOGETHER_API_KEY:

export TOGETHER_API_KEY=xxxxx

2. Install your preferred library

Together provides an official library for Python:

pip install together --upgrade
from together import Together

client = Together(api_key = TOGETHER_API_KEY)

3. Data Processing and Chunking

We will RAG over Paul Grahams latest essay titled Founder Mode. The code below will scrape and load the essay into memory.

import requests
from bs4 import BeautifulSoup

def scrape_pg_essay():

    url = 'https://paulgraham.com/foundermode.html'

    try:
        # Send GET request to the URL
        response = requests.get(url)
        response.raise_for_status()  # Raise an error for bad status codes

        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Paul Graham's essays typically have the main content in a font tag
        # You might need to adjust this selector based on the actual HTML structure
        content = soup.find('font')

        if content:
            # Extract and clean the text
            text = content.get_text()
            # Remove extra whitespace and normalize line breaks
            text = ' '.join(text.split())
            return text
        else:
            return "Could not find the main content of the essay."

    except requests.RequestException as e:
        return f"Error fetching the webpage: {e}"

# Scrape the essay
pg_essay = scrape_pg_essay()

Chunk the essay:

# Naive fixed sized chunking with overlaps

def create_chunks(document, chunk_size=300, overlap=50):
    return [document[i : i + chunk_size] for i in range(0, len(document), chunk_size - overlap)]
  
  
chunks = create_chunks(pg_essay, chunk_size=250, overlap=30)

4. Generate Vector Index and Perform Retrieval

We will now use bge-large-en-v1.5 to embed the augmented chunks above into a vector index.

from typing import List
import numpy as np

def generate_embeddings(input_texts: List[str], model_api_string: str) -> List[List[float]]:
    """Generate embeddings from Together python library.

    Args:
        input_texts: a list of string input texts.
        model_api_string: str. An API string for a specific embedding model of your choice.

    Returns:
        embeddings_list: a list of embeddings. Each element corresponds to the each input text.
    """
    outputs = client.embeddings.create(
        input=input_texts,
        model=model_api_string,
    )
    return [x.embedding for x in outputs.data]
  
embeddings = generate_embeddings(chunks, "BAAI/bge-large-en-v1.5")

The function below will help us perform vector search:

def vector_retreival(query: str, top_k: int = 5, vector_index: np.ndarray = None) -> List[int]:
    """
    Retrieve the top-k most similar items from an index based on a query.
    Args:
        query (str): The query string to search for.
        top_k (int, optional): The number of top similar items to retrieve. Defaults to 5.
        index (np.ndarray, optional): The index array containing embeddings to search against. Defaults to None.
    Returns:
        List[int]: A list of indices corresponding to the top-k most similar items in the index.
    """

    query_embedding = np.array(generate_embeddings([query], 'BAAI/bge-large-en-v1.5')[0])

    similarity_scores = np.dot(query_embedding, vector_index.T)

    return list(np.argsort(-similarity_scores)[:top_k])
  
top_k_indices = vector_retreival(query =  "What are 'skip-level' meetings?", top_k = 5, vector_index = embeddings)
top_k_chunks = [chunks[i] for i in top_k_indices]

We now have a way to retrieve from the vector index given a query.

5. Rerank To Improve Quality

We will use a reranker model to improve retrieved chunk relevance quality:

def rerank(query: str, chunks: List[str], top_k = 3) -> List[int]:

    response = client.rerank.create(
    model = "Salesforce/Llama-Rank-V1",
    query = query,
    documents = chunks,
    top_n = top_k
    )

    return [result.index for result in response.results]

rerank_indices = rerank("What are 'skip-level' meetings?", chunks = top_k_chunks, top_k=3)

reranked_chunks = ''

for index in rerank_indices:
    reranked_chunks += top_k_chunks[index] + '\n\n'

print(reranked_chunks)

6. Call Generative Model - Llama 405b

We will pass the final 3 concatenated chunks into an LLM to get our final answer.

query = "What are 'skip-level' meetings?"

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    messages=[
      {"role": "system", "content": "You are a helpful chatbot."},
      {"role": "user", "content": f"Answer the question: {query}. Use only information provided here: {reranked_chunks}"},
    ],
)

response.choices[0].message.content

If you want to learn more about how to best use open models refer to our docs here!