Generating a single CSV file in Unity Catalog
A common ask among non-technical professionals is — can you please send me a report that shows X? A common format for ad-hoc reporting is CSV — it’s quick and easy. Usually, this request comes from industry professionals who have been receiving the same report for years and are very comfortable with their daily or weekly updates from data analysts. This is an understandable, yet tedious task when using file names that are automatically generated as a part of the partitioning process in Databricks. This post resolves this issue without leaving Databricks while working within the Unity Catalog data governance layer.
For example, the following code will generate 4 files — all of which will be completely mysterious to non-spark folk. So you can’t just download the partitioned CSV files and email them out, especially if you have thousands of rows.
import random
# Function to generate random name
def generate_name():
names = ['John', 'Emma', 'Michael', 'Sophia', 'William', 'Olivia', 'James', 'Ava', 'Alexander', 'Mia']
return random.choice(names)
# Generate random data
random_data = []
for _ in range(100):
id = random.randint(1, 1000)
name = generate_name()
salary = random.randint(1000, 5000)
random_data.append((id, name, salary))
schema = ['id','name','salary']
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("{my_catalog}.{my_schema}.{my_table}")
df.write.csv(f'/Volumes/{my_catalog}/{my_schema}/{my_volume}/{dev_folder}',header=True,mode="overwrite")
Going over to the file location in Catalog will reveal something like this:
Notice that Databricks has handled the number of partitions and chose 4 partitions for this particular DataFrame. You can download each of these files and see how the partition took place.
Singular Partitioning
Instead of allowing the partitioning to (optimally) occur, we can instead force 1 partition. This is the first step to generating our single CSV file.
df.coalesce(1).write.csv(f'/Volumes/{my_catalog}/{my_schema}/{my_volume}/{dev_folder}', header = True)
And the result:
Putting it all together
Great! Now, it would be awesome to grab that CSV file, rename it, and put it into its own designated folder for delivery without leaving the Databricks environment. We can do that using dbutils
! Below is a single function with an example that will result in a single CSV file with a custom name. The best part is that this is all controlled within Unity Catalog — so we can be as dynamic as we would like about permissions. Essentially, we are going to grab the single name seen in the previous image (the single file ending in .csv
), copy it to a new folder, and use dubutils
to rename it. The single CSV file is located in {prod_folder}
in the below code sample. It’s a pretty simple solution to a tedious task!
def one_csv_file(delta_path, vol_path_dev, vol_path_copy):
# Read in your pyspark dataframe
df = spark.read.format("delta").table(delta_path)
# Partition 1 time (one csv file)
df.coalesce(1).write.mode("overwrite").csv(vol_path_dev, header = True)
# Find the csv file you just created
file_list = dbutils.fs.ls(vol_path_dev)
# Grab the csv file name
for filename in file_list:
if filename.name.endswith('csv'):
name = filename.name
# Copy the file into the new location with controlled permissions using UC
dbutils.fs.cp(vol_path_dev + name, vol_path_copy)
delta_path = f"{my_catalog}.{my_schema}.{my_table}"
vol_path_dev = f'/Volumes/{my_catalog}/{my_schema}/{my_volume}/{dev_folder}/'
vol_path_copy = f'/Volumes/{my_catalog}/{my_schema}/{my_volume}/{prod_folder}/' + 'test_file.csv'
one_csv_file(delta_path, vol_path_dev, vol_path_copy)
Going to the location for your target folder will reveal your single CSV. Congrats!
Conclusion
In this post, we’ve shown a method for generating a single CSV file within Unity Catalog without the tedious task of renaming partitioned files.
All credit goes to WafaStudio on YouTube (which is an extremely informative channel for all things PySpark) for the overall idea (see References). The only difference is this code is implemented within the data governance layer provided by Databricks (Unity Catalog).