Terraforming Azure Databricks Part 2: Unlocking Unity Catalog and Provisioning Clusters

This is the second part of our series on setting up Databricks on Azure using Terraform. In the previous article, we laid the foundation by creating the Azure resources required for a Databricks environment. Now, we move forward to configure the Databricks provider, set up Unity Catalog, and provision a Databricks cluster, diving deeper into the nuances of automating Databricks with Terraform.

Setting Up the Terraform Databricks Provider

The first step is configuring the Terraform provider for Databricks. Unlike many providers published under the “HashiCorp” organization, Databricks maintains its own, requiring a specific source in the provider block:

terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "1.58.0"
    }
  }
  required_version = ">= 1.9.0"
}

Azure Databricks requires an account to manage features and functionality across all workspaces. This setup introduces a multi-step provider configuration process. First, establish connectivity with the Databricks Accounts API, a multi-tenant environment for all Azure Databricks customers:

provider "databricks" {
  alias      = "accounts"
  host       = "https://accounts.azuredatabricks.net"
  account_id = var.databricks_account_id
}

Next, configure the provider for the specific Databricks workspace. If you’ve already provisioned the workspace using a separate Terraform root module, use a data source to retrieve its Workspace URL:

data "azurerm_databricks_access_connector" "main" {
  name                = "adbw-${var.application_name}-${var.environment_name}-${var.location}"
  resource_group_name = "rg-${var.application_name}-${var.environment_name}-${var.location}"
}

Now that we have that we can reference the Workspace URL from the data source itself.

provider "databricks" {
  host  = data.azurerm_databricks_workspace.main.workspace_url
}

This two-layered provider configuration ensures both account-level and workspace-specific functionalities are available.

Setting Up Unity Catalog

Unity Catalog is a critical component for managing metadata and governance in Databricks. To enable it, we begin by creating a metastore:

resource "databricks_metastore" "main" {
  name                                              = "metastore-${var.location}"
  force_destroy                                     = true
  storage_root                                      = "abfss://${var.container_name}@${var.storage_account_name}.dfs.core.windows.net/"
  region                                            = var.location
  delta_sharing_scope                               = "INTERNAL"
  delta_sharing_recipient_token_lifetime_in_seconds = var.delta_sharing_token_expiry
}

This resource links to Azure Blob Storage for metadata storage. We use a data source to reference the access connector and grant necessary permissions:

data "azurerm_databricks_access_connector" "main" {
  name                = "adbc-${var.application_name}-${var.environment_name}-${var.location}"
  resource_group_name = "rg-${var.application_name}-${var.environment_name}-${var.location}"
}

Then we can reference it from the data source when we create a data access permissions resource linking this connector with our Metastore.

resource "databricks_metastore_data_access" "main" {
  metastore_id = databricks_metastore.main.id
  name         = "${var.application_name}-${var.environment_name}-${var.location}-connector"
  is_default   = true
  azure_managed_identity {
    access_connector_id = data.azurerm_databricks_access_connector.main.id
  }
}

Finally, assign the metastore to the Databricks workspace:

resource "databricks_metastore_assignment" "main" {
  workspace_id = data.azurerm_databricks_workspace.main.workspace_id
  metastore_id = databricks_metastore.main.id
}

Configuring Governance and Data Structures

Once the metastore is configured, set up governance by granting account users access:

resource "databricks_grants" "grant_all_users" {
  metastore = databricks_metastore.main.id

  grant {
    principal  = "account users"
    privileges = ["CREATE_CATALOG", "CREATE_PROVIDER", "CREATE_RECIPIENT", "CREATE_SHARE", "USE_SHARE", "USE_PROVIDER", "USE_RECIPIENT"]
  }
}

From there, create catalogs, schemas, and tables to define the data structure hierarchy:

resource "databricks_catalog" "main" {
  metastore_id = databricks_metastore.main.id
  name         = var.catalog_name
  comment      = "this catalog is managed by terraform"
}

Assign permissions to these resources using databricks_grants.

resource "databricks_grants" "catalog_grants" {
  catalog = databricks_catalog.main.name
  grant {
    principal  = "account users"
    privileges = ["USE_CATALOG", "ALL_PRIVILEGES"]
  }
}

Now we can create a schema

resource "databricks_schema" "main" {
  catalog_name = databricks_catalog.main.id
  name         = var.schema_name
  comment      = "this database is managed by terraform"
}

And corresponding access permissions.

resource "databricks_grants" "schema_grants" {
  schema = databricks_schema.main.id
  grant {
    principal  = "account users"
    privileges = ["USE_SCHEMA", "ALL_PRIVILEGES"]
  }
}

Metastores contain Catalogs, Catalogs contain Schemas and Schemas contain tables, so guess what’s coming next?

resource "databricks_sql_table" "main_table" {
  name               = var.table_name
  catalog_name       = databricks_catalog.main.name
  schema_name        = databricks_schema.main.name
  table_type         = "MANAGED"
  data_source_format = "DELTA"

  column {
    name = "id"
    type = "int"
  }
  column {
    name = "name"
    type = "string"
  }
  column {
    name = "price"
    type = "int"
  }
  column {
    name = "updated_on"
    type = "date"
  }
  column {
    name = "updated_by"
    type = "string"
  }
  comment = "this table is managed by terraform"
}

Provisioning a Databricks Cluster

The final step is provisioning a Databricks cluster to allocate compute capacity. Start by retrieving the latest long-term support version of Spark:

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

Then, create the cluster:

resource "databricks_cluster" "main" {
  cluster_name            = "${var.application_name}-${var.environment_name}-${var.location}"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = var.cluster_vm_sku_type
  autotermination_minutes = var.auto_termination_minutes
  data_security_mode      = "USER_ISOLATION"

  autoscale {
    min_workers = 1
    max_workers = 2
  }

  enable_local_disk_encryption = true
  runtime_engine               = "PHOTON"
}

This cluster is ready to execute jobs with user isolation and encryption enabled.

Conclusion

In this part, we explored configuring the Databricks provider, setting up Unity Catalog, and provisioning clusters. The process highlights Terraform’s power in simplifying complex resource setups while maintaining governance and modularity. With Databricks now operational, we can look forward to further enhancing this environment in the next part of the series. Stay tuned as we take on the process of creating our first job!