Welcome to the third part of our Terraforming Databricks series, where we dive into setting up automated workflows in Databricks. In this part, we’ll create a notebook, configure it to execute a Python script, and use it as the backbone of a Databricks job. This approach will streamline data processing and help manage complex workflows efficiently, leveraging the power of Terraform and Databricks together.

Setting Up the Notebook

The first step in creating a job is defining a notebook. In Databricks, notebooks are a foundational tool for running scripts and workflows. To place our notebook in the appropriate directory, we’ll retrieve the home directory of the current user using the databricks_current_user data source:

data "databricks_current_user" "current" {}

Once we have the home directory, we can define the notebook resource. The databricks_notebook resource allows us to specify the path, source file, and programming language. Here’s how we set it up:

resource "databricks_notebook" "main" {
  provider = databricks
  source   = "${path.module}/scripts/Foo.py"
  path     = "${data.databricks_current_user.current.home}/Foo"
  language = "PYTHON"
}

The source attribute points to the Python script on your local file system. The notebook is then placed in the user’s home directory under the specified path. This setup ensures the script is readily available for use in the Databricks environment.

Configuring the Databricks Job

With the notebook in place, we can define a job that uses this notebook to perform tasks. The databricks_job resource specifies the job’s name, description, and tasks. Each task is configured to use an existing cluster and execute the notebook with appropriate parameters.

resource "databricks_job" "foo" {

  name        = "Foo_Job"
  description = "This is a job to do foo."

  task {
    task_key = "Foo_task"

    existing_cluster_id = var.cluster_id

    notebook_task {
      notebook_path = databricks_notebook.main.path
      base_parameters = {
        "input_table"  = var.input_table
        "output_table" = var.output_table
      }
    }
  }
}

Here, the notebook_task block links the job to the notebook, passing parameters such as input_table and output_table. These parameters provide flexibility in managing input and output locations, which can be critical in scenarios like disaster recovery.

Writing the Python Script

The Python script referenced in the notebook contains the logic for processing data. It uses PySpark to perform transformations, calculate updates, and write the results to the output table:

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, col, date_sub, date_add, max, lit

# Define the table name
input_table = dbutils.widgets.get("input_table")
output_table = dbutils.widgets.get("output_table")

df = spark.table(input_table)
# Get the distinct items by id with the latest updated_on timestamp
df_latest = df.groupBy("id").agg(max("updated_on").alias("latest_update"))

# Join back with the original table to get the full record with the latest update
df_distinct_latest = df.join(df_latest, (df.id == df_latest.id) & (df.updated_on == df_latest.latest_update)).select(df["*"])

# Increase the price by 10 for each record and set a new updated_by date
df_new = df_distinct_latest.withColumn("price", col("price") + 10).withColumn("updated_on", date_add(col("updated_on"), 1)).withColumn("updated_by", lit(job_type))
# Append the updated data as new rows to the same table
df_new.write.mode("append").saveAsTable(output_table)

This script processes the data by:

  1. Extracting the latest updates for each record.
  2. Modifying values (e.g., increasing the price and updating timestamps).
  3. Appending the updated data as new rows to the output table.

Notice that we are passing in as parameters the following:

  • Input Table
  • Output Table

By passing input_table and output_table as parameters, the script supports dynamic configurations, making it adaptable to various scenarios.

Leveraging Parameters for Flexibility

Separating input and output tables via parameters enables flexibility in handling different use cases. For example, in a business continuity and disaster recovery (BCDR) scenario, you may only have read-only access to the primary data source. By updating the parameters, you can redirect writes to an alternate location without altering the core logic.

Conclusion

In this part of the series, we demonstrated how to create a Databricks notebook, link it to a job, and execute Python-based workflows. By integrating notebooks and jobs, Terraform provides a seamless way to automate data processing in Databricks. The flexibility of parameterized scripts ensures your workflows can adapt to changing requirements, such as BCDR needs. Stay tuned for the final part, where we’ll explore advanced job orchestration and monitoring techniques to take your Databricks automation to the next level.