Simple RAG with a Locally Running LLM

This is a simple example of a RAG (Retrieval-Augmented Generation) application with a locally running LLM (Large Language Model).

For this example we will use Mistral running with Ollama on macOS.

See this post for more details on how to get it up and running.

First, ensure the model is running and responding to queries over HTTP:

$ curl -X POST http://localhost:11434/api/generate
       -d '{"model":"mistral", "prompt":"Hello"}'

This should reply with a stream of tokens.

The idea of Retrieval Augmented Generation is to append information to the prompt which is not otherwise available to the model.

A simple example piece of data is the current system time. Normally, a language model does not have access to that information. If we ask:

>>> What time is it?
I don't have access to the current time, 
but you can use a world clock website or app 
to find out the current time in your location.

The following script uses RAG to append the current time to the prompt, so the LLM can answer with this new context.

simple-rag-request.py:

import json
import requests

from datetime import datetime

# Function to get extra data for RAG.
def getRAGData():
  currentTime = datetime.now().strftime("%I:%M %p")
  return "Current time is: " + currentTime + ". "

# Main program.
inputPrompt = input("Prompt: ")

API_URI = "http://localhost:11434/api/generate"

# API request body.
postBody = dict()
postBody["model"] = "mistral"

combinedPrompt = getRAGData() + inputPrompt
postBody["prompt"] = combinedPrompt
postBody["stream"] = False

result = requests.post(API_URI, json=postBody)

jsonResult = json.loads(result.text)
finalResponse = jsonResult["response"]

print(finalResponse)

Now we can run the script and see how the extra information informs the result:

$ python simple-rag-request.py
Prompt: what time is it?

The current time is 9:23 PM.

This idea is easily extended to querying proprietary data in our own databases, or any other data we wish to inject.