Automating Grafana Dashboards on Azure with Terraform — Part 3: Automating Dashboards with the Grafana Terraform Provider

This is Part Three of a four-part series about mashing up the Azure and Grafana Terraform providers together. In Part One, I gave an introduction to the genesis of this topic how I ended up giving a presentation about this at HashiDays 2024 and started with the technical bits about setting up an Azure Grafana Managed instance using the azurerm provider. In Part Two, I demonstrated how to connect Terraform on Grafana to the Terraform on Azure we did in Part One by using the Azure CLI extension amg to create a Service Account and an authentication token to grant access to the grafana provider to actually talk to our Azure Managed Grafana instance.

As I mentioned before, the Grafana Terraform provider is absolutely fantastic. It has so many resources. I have definitely not explored all the nooks and crannies of it but the area of the provider that I found most useful was in the “Grafana OSS” section. This is where you can find things like grafana_dashboard, grafana_folder, grafana_service_account and the like.

The first thing you should do to test that you have properly setup your provider is provision something that is rather innocuous — something without many moving parts — just to kick the tires. A good resource to do that with is the grafana_folder resource. There are no dependencies, it is drop dead simple and best of all: it is quickly verifiable! Just refresh the Grafana dashboard after Terraform Apply and you will see your new folder!

resource "grafana_folder" "day2ops" {
  title = "Day 2 Ops"
}

Besides, folders are great for keeping things organized. The dashboards you provision will go into one of your folders. So it’s an easy way to do something useful and not throw away but also test out connectivity through the Grafana provider.

Now that we are able to talk to Grafana we need to commence with the dashboard making business right?! Right. Well, provisioning and managing Grafana Dashboards is a bit different than provisioning and managing traditional infrastructure. It’s all just configuration at the end of the day but the structure of the configuration is a bit different.

Grafana dashboards are layered. Almost like a set of those Matryoshka dolls. You know the ones? The ones where the smaller doll goes inside medium sized doll goes into the large doll and so on? Yeah those things.

IMAGE

Well the Dashboard is like the big momma Matryoshka doll. Inside it are one or more panels, and within the panel is a query and queries reference data sources.

IMAGE

The reason why I even started looking into the Grafana provider for Terraform is we needed to maintain some dashboards for a Day 2 Ops Dashboard that monitored the health of some Virtual Machines and we had to stamp out different environments but we didn’t want to have a different dashboard (or Grafana Managed instance) for each one, but we did have to maintain a different Grafana dashboard for “dev” and another for “prod”. So this meant, anytime we produced a new Grafana dashboard in “dev” somebody would go export it and copy pasta that thing into “prod” and then make a bunch of manual tweaks.

The stuff they were manually tweaking was a ginormous JSON file. You see, when you export a Grafana Dashboard that’s how it comes out. When you are in the Grafana Dashboard Editor every little click you make will spit out more and more JSON configuration so these things can get a bit hairy. The high level structure however is not so bad.

{
    "annotations": {
    },
    "editable": true,
    "fiscalYearStartMonth": 0,
    "graphTooltip": 0,
    "id": 267,
    "links": [],
    "liveNow": false,
    "panels": [ ***YOUR_PANELS_GO_HERE*** ],
    "refresh": "",
    "schemaVersion": 38,
    "style": "dark",
    "tags": [],
    "templating": {
    },
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "timepicker": {},
    "timezone": "utc",
    "title": "Node Infra Metrics",
    "uid": "uXZc2eBVk",
    "version": 1,
    "weekStart": ""
  }

As you can see there are quite a few bells and whistles even within the root document of the Dashboard’s JSON. I’ve cut out much of the sub-elements to make the structure a bit more clear and easier to grok. The panels element is the one where most of the guts of a Dashboard go. As you can see, this element is an array which allows you to supply one or more Panel JSON objects. Below is an example of one that I have used:

{
    "panels": [
      {
        "collapsed": false,
        "gridPos": {
          "h": 1,
          "w": 24,
          "x": 0,
          "y": 0
        },
        "id": 37,
        "panels": [],
        "title": "Summary",
        "type": "row"
      },
      {
        "datasource": {
          "type": "grafana-azure-data-explorer-datasource",
          "uid": "$${stamp_adx}"
        },
        "gridPos": {
          "h": 12,
          "w": 24,
          "x": 0,
          "y": 1
        },
        "id": 39,
        "pluginVersion": "9.2.7.1",
        "targets": [
          {
            "database": "${stamp_database_name}",
            "datasource": {
              "type": "grafana-azure-data-explorer-datasource",
              "uid": "$${stamp_adx}"
            },
            "expression": {
              "from": {
                "property": {
                  "name": "Cassandra_Health_Signal_Features",
                  "type": "string"
                },
                "type": "property"
              },
              "groupBy": {
                "expressions": [],
                "type": "and"
              },
              "reduce": {
                "expressions": [],
                "type": "and"
              },
              "where": {
                "expressions": [],
                "type": "and"
              }
            },
            "pluginVersion": "4.1.6",
            "query": "${query}",
            "querySource": "raw",
            "rawMode": true,
            "refId": "A",
            "resultFormat": "time_series"
          }
        ],
        "title": "Aggregate IO Rates",
        "type": "timeseries"
      }
    ]
}

You’ll notice that there are a couple of embedded symbols. In Grafana, symbols begin with $ . Consequently, this is also true in HashiCorp Configuration Language’s template language. If you’ve ever used the templatefile() function you’ll know that inside the file when you want Terraform to go swap out symbol foo with 123 you need to have a placeholder for ${foo} in the file. This gets super tricky because, guess what?! Grafana uses the same syntax to represent placeholder symbols when you want to reference internal variables within a Grafana dashboard! FUN!

There is a safe word though. That is to prefix your symbol with yet another $ . Therefore, $$ is an indicator to HashiCorp Configuration Language that it should ignore the following symbol. This allows the Grafana engine to process it rather than Terraform.

As you can see, I am using the $$ when referencing the uid of the desired Data Source. The $${stamp_adx} is actually a reference to a Grafana variable ${stamp_adx} which is the value of a drop down in the Dashboard that allows the user to select which environment they want to view the data for (each environment has its own Azure Data Explorer Cluster — hence ADX).

The symbols that Terraform actually parses are ${stamp_database_name} , ${datasource_filter} and ${query} . The database name allows us to keep that configurable because technically that database is also provisioned by Terraform using the azurerm provider (more on automating Azure Data Explorer — ADX — clusters later).

In order to provision this Dashboard we store the JSON that we export from Grafana into a file called “dashboard.json” and put it into a folder corresponding to the Dashboard that it represents. This allows us to further break the file up into smaller more maintainable components if we choose.

We then load the file from that JSON file using templatefile and ensure we pass in the necessary parameters.

resource "grafana_dashboard" "node_infra_metrics" {
  folder = grafana_folder.day2ops.id
  config_json = templatefile("${path.module}/dashboards/node-infra-metrics/dashboard.json",
    {
      stamp_database_name = "${local.stamp_database_name}"
      datasource_filter   = "${local.datasource_filter}"
      query               = "${file("${path.module}/dashboards/node-infra-metrics/query.kql")}"
    }
  )
}

You’ll notice that I am using String Interpolation on each of the templatefile parameters. This looks very unnecessary but I think that’s just the way this function works at the moment. It doesn’t seem like I need String Interpolation for the first two parameters but I received errors when trying to use “normal” HCL syntax.

So let’s explain these parameters. First the Datasource Filter. This thing populates a drop down list of data sources with names matching a certain criteria. All of our Azure Data Explorer (ADX) clusters started with “synth” and when we generate the grafana_data_source resources we ensure the names follow suite (more on that later).

locals {
  datasource_filter     = "synth[^0].*"
}

Next, we have the Stamp Database Name. This is actually set by another Terraform deployment that provisions the Azure Data Explorer (ADX) cluster and databases. The Cluster is provisioned with azurerm_kusto_cluster and the database is provisioned with azurerm_kusto_database. Like many database services, the database itself is dependent on the cluster its provisioned to. While we could add some connective tissue to make this pipe in from the previous azurerm Terraform Apply we took the lazy way and just created a local for it. Maybe Terraform Stacks will help make this easier in the future so we don’t take the “lazy way”.

locals {
  stamp_database_name   = "telemetry"
}

Lastly, I dump the query into its own text file called query.kql in the corresponding dashboard’s directory. The nice thing about this is you can very easily read the query and make changes to it. This allows you to make specific updates to Dashboard queries without trying to do inline editing within the Dashboard JSON file itself (not for the faint of heart!). However, another approach would be to avoid embedding queries in Grafana Dashboards altogether and encapsulate queries inside of a Stored Procedure and then call that from Grafana. This would keep the Grafana Query simple and allow you to only manage the database schema in one place. Much more preferable!

let per_node_max=node_lookup_v3
| where CassandraDataCenter in ($datacenter)
| where CassandraCluster == \"$stamp_adx\"
| join (\ncassandra_ClientRequest
    | where $__timeFilter(timestamp)
    | where tag_name == \"Latency\"
    | where tag_scope == \"Write\" or tag_scope == \"Read-LOCAL_ONE\" or tag_scope == \"Read-LOCAL_QUORUM\"
    | extend metric_timestamp = timestamp
    | extend OneMinuteRate = todouble(field_OneMinuteRate)
) on $left.ComputerName == $right.tag_host
| summarize max(OneMinuteRate) by bin(metric_timestamp,1m), CassandraDataCenter, tag_host, tag_scope
| order by metric_timestamp asc;
per_node_max
| summarize sum(max_OneMinuteRate) by bin(metric_timestamp,1m), CassandraDataCenter, tag_host
| order by metric_timestamp asc

If I had more panels I would probably break those out into their own directories as well each with their own panel.json file and their own query.kql file. The folder structure would look something like this.

- dashboards
  - node-infra-metrics
    - dashboard.json
    - panel1
      - panel.json
      - query.kql
    - panel2
      - panel.json
      - query.kql

Anyways, as you can see, creating a Grafana Dashboard using Terraform is pretty straightforward. It does involve some pretty intricate HCL to do so but if you keep the code consistent it helps with readability.

There are a few places of complexity that we have to deal with when automating Grafana Dashboards:

Working with Grafana JSON Schema. It’s JSON so its verbose — plus we have layers of JSON with both the Dashboard and Panels having their own Schema.
Working with overlapping symbol syntax of HCL and Grafana and using escape character to signal when Grafana should do the processing instead of HCL.
Working with our queries, in whatever query language they happen to be in. Because I am using Azure Data Explorer (ADX), I am using KQL. Depending on your database, YMMV.

In Part 4, we’ll get into how to automate both Azure and Grafan using two rounds of Terraform Apply to execute two different root modules.

Until then — Happy Azure Terraforming!