How to Utilize Databricks’ AI-Generated Comments and DBRX for Crafting NL-to-SQL Scripts in Your Databricks Notebooks
By now, it is well understood that using an LLM to query your database can be very tricky. Often, domain knowledge needs to be injected into the prompt as a part of the engineering process. Think of this as a piece of domain knowledge that nudges the LLM to consider whenever domain-specific knowledge is queried. For example:
semantic_prompt = 'Use column 1 from table A to answer questions about billing
time.'
So far, the best practice for Natural-Language-to-SQL (NL-to-SQL) seems to be to create downstream tables. That is, engineers/analysts need to build additional tables that encode LLM-specific information. These tables should be of high quality, low dimensionality (or utilize some streaming architecture), etc. However, Databricks records every piece of metadata about your table and stores it in an auto-generated schema called information_schema
. So, there’s no need to spend time constructing these additional tables — we can simply use the comment columns in the table overview (in Catalog Explorer) to add meaningful context.
In addition to the complexity of enriching your table with semantic information, working with stakeholders to get all the necessary configurations and permissions can also be a challenge. However, Databricks’ Unity Catalog cuts right through the mess of orchestrating proper permissions. Here, you can ensure that the user has READ ONLY permission so that no data is being altered via an LLM prompt.
Goal: The purpose of this post is to exhibit how one can leverage the power of AI-generated comments within Databricks to accomplish the task of building an NL-to-SQL assistant without leaving Databricks. I will dive into how Databricks appears to be generating these comments as well as provide some code for a simple use-case using auto-generated tables within Unity Catalog.
Article Summary:
- Generate column comments and a table comment using the suggested text produced by Databricks.
- These comments are stored as metadata inside the database
information_schema
. We can read this information and inject into our prompt. - Create a text script that describes how you would like your answer formatted. Additional cleaning might be necessary.
- Run the response using
spark.sql()
. - Optional: send the response back through the LLM for a natural language response.
Generating Column Comments:
Each time you create a catalog, a series of views are generated automatically that store metadata about everything we need to tell our LLM to form a proper query. We can use this information as a part of our RAG architecture to generate SQL code and pass that code through Spark to return our answer inside a Databricks notebook. Go to the view information_schema.columns
. There, you will find those comments have been automatically updated. Additionally, the information_schema.tables
view will contain the comment for the table as well.
How to generate quality comments using the assistance of Databricks’ recommendation: To get quality results from the recommendation of Databricks’ AI assistant inside the Catalog Explorer, the key lies in the quality naming of catalogs, tables, views, and columns. The generated text is based heavily on the schema information as well as the naming. Without proper naming, you will need to do a lot of heavy lifting in providing context to the columns and tables.
If these two items, schema and enriched information about the columns, are of high enough quality, NL-to-SQL is reduced to a prompt engineering task that’s been made simpler. Because Databricks is an end-to-end solution for all AI/ML tasks and governance, we don’t need to leave the platform in order to accomplish this task. Essentially, Databricks has handed us everything we need to accomplish this task. This is similar to their AutoML offering — the recommended AI-generated text can get you started and you can tweak it as you see fit.
Testing AI-Generated Text: Let’s test out how Databricks is formulating these suggested comments. Here are two scenarios. One with poorly named columns and with column names that are descriptive.
The main observation here is that Databricks appears to be pulling from the schema and the column names to provide context to the underlying LLM to generate these comments. Even if the table contains columns that are generic names like shown above, we can still get a generated comment that refers to the datatype only. On the other hand, the table comment pulls from the table name as well as the column name and schema information for a more complete view of the data. All of this makes sense and is unsurprising, but it’s encouraging to see it in practice.
Recommendation: Instead of building downstream tables, use the views automatically generated in information_schema
. Give enriching, proper naming of tables, columns, etc. This will guide Databricks’ AI-generated comments to properly provide context when sending it to the LLM. Let Databricks do the heavy lifting for you at first and tweak the comments as you see fit.
Examples:
Let’s use the sample data provided by Databricks in the samples
database. We will use the AI-generated comments shown earlier to build an assistant to answer some questions based on two tables customer
and orders
.
df = spark.sql('select * from samples.tpch.customer')
df2 = spark.sql('select * from samples.tpch.orders')
df.write.format("delta").saveAsTable('ai_gen_cat.ai_gen_db.sample_customer')
df2.write.format("delta").saveAsTable('ai_gen_cat.ai_gen_db.orders')
After copying the data, let’s create a function that will be used to get column_name
, comment
, and data_type
.
def get_metadata(catalog,table_name):
table_meta = spark.sql(f"select * from {catalog}.information_schema.tables where table_name = '{table_name}'")
table_meta_text = table_meta.select('comment').collect()[0]['comment']
import json
column_meta = spark.sql(f"select * from {catalog}.information_schema.columns where table_name = '{table_name}'")
column_meta_text_preprocess = column_meta.select('column_name', 'comment', 'data_type').collect()[:]
# Initialize a list to store dictionaries for each row
data_list = []
# Iterate over the sample data and populate the list with dictionaries
for row in column_meta_text_preprocess:
data_list.append({
'column_name': row.column_name,
'comment': row.comment,
'data_type': row.data_type
})
# Convert the list of dictionaries to JSON format
column_meta_text = json.dumps(data_list, indent=4)
return table_meta_text, column_meta_text
table_meta_text1, column_meta_text1 = get_metadata(catalog, 'sample_customer')
table_meta_text2, column_meta_text2 = get_metadata(catalog, 'sample_orders')
Now that we have the metadata of the individual tables, let’s combine them and feed them through our prompt:
table_meta_text = table_meta_text1 + table_meta_text2
column_meta_text = column_meta_text1 + column_meta_text2
SQL_TEMPLATE = f"""You are a Databricks SQL assistant. Your ONLY job is to write CORRECT SQL code that can be ran in Databricks notebooks. Do not provide any other charcters in your answer other than the SQL code.
YOU DO NOT DO ANYTHING OTHER THAN WRITE SQL CODE.
CONTEXT TO USE TO CONSTRUCT SQL QUERY:
TABLE INFORMATION:
{table_meta_text}
COLUMN INFORMATION:
{column_meta_text}
YOU DO NOT DO ANYTHING OTHER THAN WRITE SQL CODE.
EXAMPLE:
USER: Give me the total count of all employees
CORRECT OUTPUT:
SELECT COUNT(*)
FROM my_table;
"""
Next, we’ll initialize our model. DBRX is the model of choice for the ChatCompletion call.
from databricks_genai_inference import ChatCompletion
# Only required when running this example outside of a Databricks Notebook
DATABRICKS_HOST="https://<workspace>.databricks.com"
DATABRICKS_TOKEN="<your_token>"
def get_response(SQL_TEMPLATE, query_text):
completions_response = ChatCompletion.create(model="databricks-dbrx-instruct",
messages=[{"role": "system", "content": SQL_TEMPLATE},
{"role": "user","content": query_text}],
max_tokens=50)
return completions_reponse
sql_result = get_response(SQL_TEMPLATE, query_text)
cleaned_sql_query = sql_result.replace("\\", "")
df = spark.sql(cleaned_sql_query)
print(sql_result)
display(df)
Here is our query:
query_text = 'How many orders did customer with name Customer#000412445 have?'
The printed SQL result is below:
SELECT COUNT(*)
FROM sample_orders
WHERE o_custkey = (SELECT c_custkey FROM sample_customer WHERE c_name = 'Customer#000412445');
To use an external model via a configured model serving endpoint, here’s some sample code:
client = mlflow.deployments.get_deploy_client("databricks")
completions_response = client.predict(
endpoint="openai-completions-endpoint",
inputs={
"prompt": SQL_TEMPLATE,
"temperature": 0.1,
"max_tokens": 100,
"n": 1
}
)
#execute that result
sql_result = completions_response["choices"][0]["text"]
print(sql_result)
df = spark.sql(sql_result)
display(df)
Conclusions
We’ve demonstrated how you can leverage the AI-assisted text generation feature within Databricks to enrich your metadata tables. Coupling these two together can allow users to build NL-to-SQL scripts within Databricks notebooks, safely inside Unity Catalog.