Tips for Grounding Databricks Assistant Autocomplete
The future of Jupyter-style notebooks (or even script-based environments) is very likely to come with some sort of assistant to help speed up code writing, regardless of the underlying platform. We see this with both Databricks and Microsoft Fabric already. I like to think of this as trying to name a movie where you know the plot and some character names, but you keep blanking on the exact name of the film. Translating that to the world of coding, if you want to use a Python package you don’t normally use or you’re kind of blanking on what the exact parameters are or maybe there is some new code out there that optimizes your workloads and you want to expand to a set of notebooks, the assistant is there to nudge you (you’re just going to Google it anyway).
This past week, Databricks released its Assistant Autocomplete. Taking a quote from the announcement, “Databricks Assistant Autocomplete automatically provides fast code suggestions as you type in SQL and Python. AI code completion uses context from current code cells, Unity Catalog metadata, DataFrame data, and more to generate highly relevant suggestions as you type… Databricks Assistant Autocomplete is a powerful tool, but can be enhanced by providing extra context. In order to get the most relevant and useful suggestions you’ll need to think about how to convey your intent properly. You can accomplish this through some of the following best practices:
- Provide comments for your functions and code
- Use clear and meaningful names for functions and variables
- Follow consistent and standard code style conventions”
The goal of this post is to explore the new assistant and provide one way in which to ground the assistant in new information or information it might not be familiar with.
- Use a prompt table (stored as a delta table) to organize sample code. Not every assistant is going to know all the latest or custom code. Saving these code snippets can provide context for the assistant and reduce hallucinations.
- Keep reproducibility in mind when building your sample code database.
- Chunking information when using the assistant appears to help — which is similar to Copilot in Visual Studio.
Keep in mind that this assistant does not see your data nor does it see the output when running a cell. This is why there is an emphasis on the 3 bullets mentioned above along with understanding that it’s viewing the metadata stored in Unity Catalog.
Example
I want to focus on the bolded text and the first bullet from the quoted text provided by Databricks. Let’s consider liquid clustering as an example (which is now Generally Available). Adding a comment is essentially sending a prompt to the assistant. For example, we might use the like so:
#read in the table imdb_dataset
df = spark.read.table("main.default.imdb_dataset")
#enable liquid clustering on imdb_dataset using CTAS statement with df
However, with a zero-shot autocomplete attempt, we receive the following. Note that this is a fairly vague comment without a lot of context, but I’ll demonstrate how to improve the result.
#read in the table imdb_dataset
df = spark.read.table("main.default.imdb_dataset")
#enable liquid clustering on imdb_dataset using CTAS statement with df
spark.sql("CREATE OR REPLACE TEMPORARY VIEW imdb_dataset_clustered USING com.databricks.spark.sqldw.DeltaTable CLUSTER BY (title) AS SELECT * FROM main.default.imdb_dataset")
Clearly, the assistant is hallucinating. Instead, let’s add an additional layer for grounding by using the documentation code provided by Databricks themselves. We can create another notebook that contains relevant information and code snippets surrounding liquid clustering and store that inside a prompt template table for retrieval at any time. We can use the tag liquid clustering
to help search for this information later. Here’s an example.
EXAMPLE = """
Here is an example of enabling liquid clustering in python:
# Create an empty table
(DeltaTable.create()
.tableName("table1")
.addColumn("col0", dataType = "INT")
.addColumn("col1", dataType = "STRING")
.clusterBy("col0")
.execute())
# Using a CTAS statement
df = spark.read.table("table1")
df.write.format("delta").clusterBy("col0").saveAsTable("table2")
# CTAS using DataFrameWriterV2
df = spark.read.table("table1")
df.writeTo("table1").using("delta").clusterBy("col0").create()
"""
df = spark.createDataFrame([(f"{EXAMPLE}","liquid clustering")], ["prompt_template", "category"])
df.write.format("delta").mode('overwrite').saveAsTable("main.default.prompt_table")
display(df)
Here, I propose a strategy that keeps documentation and working notebooks disjoint. What I mean is that the above would be run in a separate documentation-only notebook that is used to ground relevant information and our working notebook would be where we’re developing code, where we pull in relevant information from the prompt table.
In our working notebook, we can search for this information as
# define the keyword to search for
keyword = 'clustering'
# use spark sql and LIKE to find similar word
df = spark.sql(f"""
SELECT prompt_template
FROM main.default.prompt_table
WHERE category LIKE '%{keyword}%'
""")
print(df.collect()[0]['prompt_template'])
and the output is
Here is an example of enabling liquid clustering in python:
# Create an empty table
(DeltaTable.create()
.tableName("table1")
.addColumn("col0", dataType = "INT")
.addColumn("col1", dataType = "STRING")
.clusterBy("col0")
.execute())
# Using a CTAS statement
df = spark.read.table("table1")
df.write.format("delta").clusterBy("col0").saveAsTable("table2")
# CTAS using DataFrameWriterV2
df = spark.read.table("table1")
df.writeTo("table1").using("delta").clusterBy("col0").create()
Now, say I’m comfortable with the middle statement. Simply copy and paste this as a comment. I’ve found that when the comment section is too large, the assistant loses track of the relevant information.
# Here is an example of enabling liquid clustering in python:
# # Using a CTAS statement
# df = spark.read.table("table1")
# df.write.format("delta").clusterBy("col0").saveAsTable("table2")
I’ve found that using a comment or a variable to store the string works just as well.
Let’s pause for a moment and recount the example so far.
- The assistant is not helpful with a simple prompt like
#enable liquid clustering on imdb_dataset using CTAS statement with df
. Instead, we need to provide more context. - Why not just Google the answer? This comes down to reusability. If we copy and paste the sample code from above as a comment into 1 workbook, the assistant will not be grounded in another notebook. Additionally, any time you restart the cluster, you will need to ground the assistant in this context — so that specific cell will need to be run again to provide more context. Instead, if we store this information in a prompt template table, we can just read the table and copy and paste exactly what is needed.
Now, putting in this comment
# Here is an example of enabling liquid clustering in python:
# # Using a CTAS statement
# df = spark.read.table("table1")
# df.write.format("delta").clusterBy("col0").saveAsTable("table2")
#########################################################
#read in the imdb table
df = spark.read.table("main.default.imdb_dataset")
and with the same comment from before, we get the correct recommendation (I also added a table name).
#enable liquid clustering on imdb_dataset using CTAS statement with df - save to table main.default.imdb_dataset_LC
df.write.format("delta").clusterBy("label").saveAsTable("main.default.imdb_dataset_LC")
Future Development
I think there is a ton of potential when it comes to these types of assistants. The difficult part will be keeping them up to date with the latest information — which is essential for companies like Databricks that rapidly develop new features optimized for industry use cases. With that in mind, here are some ideas that come to mind:
- Environment/local prompts — these would be prompts like providing documentation attached to a specific cluster to help with grounding the assistant in company-specific information or maybe just sample code that is relevant to the overall task of the folder you’re working in, cluster you’re running on, etc.
- Move outside of in-cell comments. For example, uploading documentation about liquid clustering using UI while working in the notebook to help give the assistant additional information would be transformative. This is essentially recreating GPT-4o.
Conclusion
In this post, I talked about some of the limitations of the new assistant and a possible strategy that can be used to help ground the trained LLM when performing a task that the assistant might not be familiar with.