Building a Multi-Hop Search System: Integrating Databricks Vector Search and Model Serving with DSPy in Databricks Notebooks

9 min readJul 10, 2024

This will be a technical post that combines knowledge from multiple sources to demonstrate how to build a multi-hop search system all within Databricks. The goal of this post is mostly discovery for understanding how compound AI systems work, but also to provide a review of DSPy and how easy it is to build a pipeline.

There are 2 components I wish to integrate in the future:

DSPy — a programming model that mimics the neural network structure from machine learning so that given a metric and some data, you can use the DSPy compiler to optimize your program to perform a specific task by fine tuning internal parameters
Mosaic AI Agents — A framework offered by Databricks to build high-quality agentic RAG applications with the Databricks Data Intelligence Platform.

The main idea here is to generate prompts using DSPy and feed that to Mosaic AI Agents so as to enhance the quality of responses. Before accomplishing this task, it’s useful to understand how to build a DSPy pipeline and understand how easily DSPy fits within the Databricks ecosystem — what you will find in this post.

A word on prompt engineering from a compound AI system perspective

I like to think of prompt engineering as akin to grid searching within machine learning in order to fine-tune your hyperparameters. In the end, you’re after the perfect set of words or phrases that will get the foundational model to perform in a specific way, like have a specific tone or having a JSON output. Within the past year, there has been a conversation shift from prompt engineering habits and cheat sheets to a more declarative approach by optimizing language model pipelines, which are often composed of non-differentiable components (a real problem for optimization tasks). DSPy provides a strong argument for why compound AI systems is the future of prompting for language models (LMs).

What is Demonstrate-Search-Predict (DSPy)?

The following was taken from [2] and [4]. DSPy is a programming model that abstracts LM pipelines as text transformation graphs. That is, it is a programming model whose purpose is to design AI systems via pipelines of language models (LLMs or SLMs) along with other tools. Taken from the abstract of the DSPy paper: “We design a compiler that will optimize any DSPy pipeline to maximize a given metric… succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops.” The end-goal of building a DSPy program is to optimize a complex LM system.

A deeper look at DSPy

To understand DSPy, we will need to look at how to build the compiler system. At a high-level, we have 8 steps involved in using DSPy [7]:

Define your task
Define your pipeline
Provide examples to determine what is possible (swapping between LMs is simple). The more examples the better (at least 10)
Define the data
Define the relevant metric
Zero-Shot evaluation (we’ll stop here)
Optimize the DSPy program
Iterate

DSPy Module: A single module is a foundational building block for programs that use LMs, where each module has three attributes [6]. Those familiar with PyTorch, should be able to recall that a module’s main responsibilities are to define the structure of the network, and to specify the forward pass, detailing how input data flows through the network. Similarly, a module in DSPy’s details how an initial task will flow from one layer (module) to the next throughout the network (program):

Each module abstracts a prompting technique and can handle signatures (see below).
Has learnable parameters made up three different types: LM weights, instructions, and demonstrations of input/output behavior (like few-shot examples by more powerful). The magic is that DSPy can optimize all three of these automatically, leading to a more potent result than what humans can perform due to all the different permutations.
Modules can be composed to make larger modules in the same way modules are the foundational components used to build neural networks with PyTorch.

Signatures: When you wish to assign a task to an LM, you do so using a DSPy Module via a signature. A signature is a declarative specification of input/output behavior of a module. Signatures are made for modular, clean code, in which LM calls can be optimized into high-quality prompts (or automatic finetunes) [5]. Those that are familiar with OpenAI’s function calling capability will find that this is similar to defining your tools parameter, only DSPy modules are more declarative, composable, and contain learnable parameters, while function calling is imperative and more for flexible, one-the-fly programming.

DSPy Program: This is simply a stack of modules that consists of multiple calls to LMs. Given a metric and data, we seek to optimize the overall program.

Building a module, signature, and program together give you an uncompiled program, i.e. a zero-shot program.

Example: Zero-shot (uncompiled) program

Here, we will provide the code that can be found in [8], but we’ll do so within an Azure Databricks environment. I am using the simplest cluster available — a single user with the latest ML runtime and a Standard_DS3_v2 node type (0.75 DBU/h).

Notice that in the code below, we don’t need any additional configurations with setting up my LLMs. Model serving completely abstracts this away for the user — even when using an external model like gpt-4o.

Code

Install and import the necessary requirements. This is a little tricky if you’re simply copying the code found on the source website or various Databricks blogs, so I will explain how this works based on my own research and to be as clear as possible form a small POC perspective (which is always my goal with my posts).

%pip install dspy-ai
%pip install openai==1.3.9
dbutils.library.restartPython()

import dspy
from dspy.retrieve.databricks_rm import DatabricksRM

import openai
from openai import OpenAI

Get environment variables needed for connecting to various components. This step assumes you have your index table created. This can be done swiftly using the UI. See my previous post on how this works.

# Retrieve the access token and url for our model serving endpoints
databricks_api_key = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
databricks_endpoint = '<https_my_workspace_info_.net>' #Copy your workspace url up to ".net"
databricks_model_serving_endpoint_url = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() + '/serving-endpoints'
databricks_index_name = '{catalog}.{database}.{name_of_indexed_table}'

Now, this next part is essential to understanding how to build DSPy pipelines. We need to call the LM (lm) and Retriever Model (rm). Luckily, the configuration is very simple. One thing to notice is that I have two columns I am grabbing — one is chunk_id and the other is chunk_text. I have taken the original DSPy paper [4], chunked it (chunk_text) and assigned the chunks the primary key (chunk_id). This utilizes model serving, so you can use external models like OpenAI. To this, simply change the model parameter to the name you gave when registering the model. For example, databricks-meta-llama-3–70b-instruct instead of openai_4o.

lm = dspy.Databricks(
    # model="databricks-dbrx-instruct",
    model = "gpt_4o",
    model_type="chat",
    api_key=databricks_api_key,
    api_base=databricks_model_serving_endpoint_url,
)

rm = DatabricksRM(
    databricks_index_name=databricks_index_name,
    databricks_endpoint=databricks_endpoint,
    databricks_token=databricks_api_key,
    columns=["chunk_id", "chunk_text"],
    k=3,
    docs_id_column_name="chunk_id",
    text_column_name="chunk_text",
)


dspy.settings.configure(lm=lm, rm=rm)

lm("Why is DSPy useful for LLMs?")

Here is the output, which is clearly way off the mark: DSPy (DeepSpeed Python) is a useful tool for Large Language Models (LLMs) because it provides a set of optimized algorithms and techniques to accelerate the training and inference of these models. Here are some reasons why DSPy is useful for LLMs:\n\n1. **Faster Training**: DSPy provides optimized implementations of various deep learning algorithms, such as Adam, SGD, and LAMB, which can significantly speed up the training process of LLMs. This is particularly important for large models that require massive computational resources and time to train.\n2. **Memory Efficiency**: LLMs often require large amounts of memory to store their model weights and activations. DSPy provides techniques like gradient checkpointing, activation checkpointing, and memory-efficient

Next, we build two signatures that will be used to generate the query and answer. Again, this is how we design the input and output of modules. Pay attention to how we are using natural language to accomplish this task (declarative framework).

class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""
    
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 sentences")

We can now build the Baleen pipeline, i.e. the pipeline that take in a prompt, generate a query, generate answers based on those queries, and provide the final answer based off the reasoning.

from dsp.utils import deduplicate

class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        # self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.retrieve = rm
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question):
        context = []
        
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            # passages = self.retrieve(query).passages
            passages = self.retrieve(query).docs
            context = deduplicate(context + passages)
            
    
        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

We’re now ready to send in our question and get the answer.

# Ask any question you like to this simple RAG program.
my_question = "What are the benefits of DSPy?"


# Get the prediction. This contains `pred.context` and `pred.answer`.
uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
pred = uncompiled_baleen(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")

And here is the answer: DSPy is useful for LLMs because it enables the creation of optimized pipelines that can outperform standard few-shot prompting and expert-created demonstrations. It allows for the rapid development of highly effective systems using relatively small LLMs, and it provides a systematic approach to designing AI pipelines by abstracting away from manipulating free-form strings and instead using modular operators to build text transformation graphs.

But what about the instructions? How do we get insight into how the model reasoned its way to this answer? We can use the following command to answer these questions.

lm.inspect_history(n=10)

This will allow us to view the 10 most recent steps the model performed to arrive at its answer. Here are some screenshots of reasoning steps for this particular question.

As you can see, the pipeline is reading in context from the vector search in order to arrive at its next reasoning step. In this example, the model was able to arrive at the answer in 6 seconds.

Alternate Steps

What about without using vector search? Well, in that case, we can follow the code from [8], but because we are using gpt-3.5-turbo for our lm, we would need to use a key vault for Azure Databricks. Without this, you will not be able send in your query. Following the code found in [8] costed me about $0.20 overall — so totally affordable if you’re looking to implement that code inside Databricks but without Vector Search.

import openai
openai.api_key = dbutils.secrets.get(scope = "key-vault-scope", key = "openai-db-secret")

Conclusion and Next Steps

Here we’ve demonstrated how to build a DSPy pipeline using model serving, vector search, and DSPy all within a Databricks Notebook. This was actually a fairly simple process with minimal configuration. The next steps are to, of course, add in examples so that one can perform validation analysis. We stopped short of that since the goal of this post is to provide clarity on how to integrate all the components offered by Databricks. Again, the next steps would be to integrate DSPy with the Mosaic AI Agent framework as well.