Terraforming Azure Databricks Part 2: Unlocking Unity Catalog and Provisioning Clusters
This is the second part of our series on setting up Databricks on Azure using Terraform. In the previous article, we laid the foundation by creating the Azure resources required for a Databricks environment. Now, we move forward to configure the Databricks provider, set up Unity Catalog, and provision a Databricks cluster, diving deeper into the nuances of automating Databricks with Terraform.
Setting Up the Terraform Databricks Provider
The first step is configuring the Terraform provider for Databricks. Unlike many providers published under the “HashiCorp” organization, Databricks maintains its own, requiring a specific source in the provider block:
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
version = "1.58.0"
}
}
required_version = ">= 1.9.0"
}
Azure Databricks requires an account to manage features and functionality across all workspaces. This setup introduces a multi-step provider configuration process. First, establish connectivity with the Databricks Accounts API, a multi-tenant environment for all Azure Databricks customers:
provider "databricks" {
alias = "accounts"
host = "https://accounts.azuredatabricks.net"
account_id = var.databricks_account_id
}
Next, configure the provider for the specific Databricks workspace. If you’ve already provisioned the workspace using a separate Terraform root module, use a data source to retrieve its Workspace URL:
data "azurerm_databricks_access_connector" "main" {
name = "adbw-${var.application_name}-${var.environment_name}-${var.location}"
resource_group_name = "rg-${var.application_name}-${var.environment_name}-${var.location}"
}
Now that we have that we can reference the Workspace URL from the data source itself.
provider "databricks" {
host = data.azurerm_databricks_workspace.main.workspace_url
}
This two-layered provider configuration ensures both account-level and workspace-specific functionalities are available.
Setting Up Unity Catalog
Unity Catalog is a critical component for managing metadata and governance in Databricks. To enable it, we begin by creating a metastore:
resource "databricks_metastore" "main" {
name = "metastore-${var.location}"
force_destroy = true
storage_root = "abfss://${var.container_name}@${var.storage_account_name}.dfs.core.windows.net/"
region = var.location
delta_sharing_scope = "INTERNAL"
delta_sharing_recipient_token_lifetime_in_seconds = var.delta_sharing_token_expiry
}
This resource links to Azure Blob Storage for metadata storage. We use a data source to reference the access connector and grant necessary permissions:
data "azurerm_databricks_access_connector" "main" {
name = "adbc-${var.application_name}-${var.environment_name}-${var.location}"
resource_group_name = "rg-${var.application_name}-${var.environment_name}-${var.location}"
}
Then we can reference it from the data source when we create a data access permissions resource linking this connector with our Metastore.
resource "databricks_metastore_data_access" "main" {
metastore_id = databricks_metastore.main.id
name = "${var.application_name}-${var.environment_name}-${var.location}-connector"
is_default = true
azure_managed_identity {
access_connector_id = data.azurerm_databricks_access_connector.main.id
}
}
Finally, assign the metastore to the Databricks workspace:
resource "databricks_metastore_assignment" "main" {
workspace_id = data.azurerm_databricks_workspace.main.workspace_id
metastore_id = databricks_metastore.main.id
}
Configuring Governance and Data Structures
Once the metastore is configured, set up governance by granting account users access:
resource "databricks_grants" "grant_all_users" {
metastore = databricks_metastore.main.id
grant {
principal = "account users"
privileges = ["CREATE_CATALOG", "CREATE_PROVIDER", "CREATE_RECIPIENT", "CREATE_SHARE", "USE_SHARE", "USE_PROVIDER", "USE_RECIPIENT"]
}
}
From there, create catalogs, schemas, and tables to define the data structure hierarchy:
resource "databricks_catalog" "main" {
metastore_id = databricks_metastore.main.id
name = var.catalog_name
comment = "this catalog is managed by terraform"
}
Assign permissions to these resources using databricks_grants.
resource "databricks_grants" "catalog_grants" {
catalog = databricks_catalog.main.name
grant {
principal = "account users"
privileges = ["USE_CATALOG", "ALL_PRIVILEGES"]
}
}
Now we can create a schema
resource "databricks_schema" "main" {
catalog_name = databricks_catalog.main.id
name = var.schema_name
comment = "this database is managed by terraform"
}
And corresponding access permissions.
resource "databricks_grants" "schema_grants" {
schema = databricks_schema.main.id
grant {
principal = "account users"
privileges = ["USE_SCHEMA", "ALL_PRIVILEGES"]
}
}
Metastores contain Catalogs, Catalogs contain Schemas and Schemas contain tables, so guess what’s coming next?
resource "databricks_sql_table" "main_table" {
name = var.table_name
catalog_name = databricks_catalog.main.name
schema_name = databricks_schema.main.name
table_type = "MANAGED"
data_source_format = "DELTA"
column {
name = "id"
type = "int"
}
column {
name = "name"
type = "string"
}
column {
name = "price"
type = "int"
}
column {
name = "updated_on"
type = "date"
}
column {
name = "updated_by"
type = "string"
}
comment = "this table is managed by terraform"
}
Provisioning a Databricks Cluster
The final step is provisioning a Databricks cluster to allocate compute capacity. Start by retrieving the latest long-term support version of Spark:
data "databricks_spark_version" "latest_lts" {
long_term_support = true
}
Then, create the cluster:
resource "databricks_cluster" "main" {
cluster_name = "${var.application_name}-${var.environment_name}-${var.location}"
spark_version = data.databricks_spark_version.latest_lts.id
node_type_id = var.cluster_vm_sku_type
autotermination_minutes = var.auto_termination_minutes
data_security_mode = "USER_ISOLATION"
autoscale {
min_workers = 1
max_workers = 2
}
enable_local_disk_encryption = true
runtime_engine = "PHOTON"
}
This cluster is ready to execute jobs with user isolation and encryption enabled.
Conclusion
In this part, we explored configuring the Databricks provider, setting up Unity Catalog, and provisioning clusters. The process highlights Terraform’s power in simplifying complex resource setups while maintaining governance and modularity. With Databricks now operational, we can look forward to further enhancing this environment in the next part of the series. Stay tuned as we take on the process of creating our first job!