Spark 3.0 Monitoring with Prometheus

spark-prometheus

Monitoring prior to 3.0

Prior to Apache Spark 3.0, there were different approaches to expose metrics to Prometheus:

1- Using Spark’s JmxSink and Prometheus’s JMXExporter (see Monitoring Apache Spark on Kubernetes with Prometheus and Grafana)

2- Using Spark’s GraphiteSink and Prometheus’s GraphiteExporter

3- Using custom sinks and Prometheus’s Pushgateway

Monitoring in 3.0

Apache Spark 3.0 introduced the following resources to expose metrics:

Those features are more convinent than the agent approach that requires a port to be open (which may not be possible). The following tables summaries the new exposed endpoints for each node:

  Port Prometheus Endpoint JSON Endpoint
Driver 4040 /metrics/prometheus/ /metrics/json/
Driver 4040 /metrics/executors/prometheus/ /api/v1/applications/{id}/executors/
Worker 8081 /metrics/prometheus/ /metrics/json/
Master 8080 /metrics/master/prometheus/ /metrics/master/json/
Master 8080 /metrics/applications/prometheus/ /metrics/applications/json/

Copy $SPARK_HOME/conf/metrics.properties.template into $SPARK_HOME/conf/metrics.properties and add/uncomment the following lines (they should at the end of the template file):

*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus

For testing, start a Spark cluster as follows:

$ sbin/start-master.sh
$ sbin/start-slave.sh spark://`hostname`:7077
$ bin/spark-shell --master spark://`hostname`:7077

Note: to enable exector metrics we need to enable spark.ui.prometheus.enabled

$ bin/spark-shell --master spark://`hostname`:7077 \
    --conf spark.ui.prometheus.enabled=true \
    --conf spark.executor.processTreeMetrics.enabled=true

Master metrics

Now we can query metrics of the Master node in JSON or in Prometheus compatible format:

$ curl -s http://localhost:8080/metrics/master/json/ | jq
{
  "version": "4.0.0",
  "gauges": {
    "master.aliveWorkers": {
      "value": 1
    },
    "master.apps": {
      "value": 1
    },
    ...
  }
}
$ curl -s http://localhost:8080/metrics/master/prometheus/ | head
metrics_master_aliveWorkers_Number{type="gauges"} 1
metrics_master_aliveWorkers_Value{type="gauges"} 1
metrics_master_apps_Number{type="gauges"} 1
metrics_master_apps_Value{type="gauges"} 1

Worker metrics

The metrics of the Worker node in JSON or in Prometheus compatible format:

$ curl -s http://localhost:8081/metrics/json/ | jq
{
  "version": "4.0.0",
  "gauges": {
    "worker.coresFree": {
      "value": 0
    },
    ...
  }
}
$ curl -s http://localhost:8081/metrics/prometheus/ | head
metrics_worker_coresFree_Number{type="gauges"} 0
metrics_worker_coresFree_Value{type="gauges"} 0
metrics_worker_coresUsed_Number{type="gauges"} 8
metrics_worker_coresUsed_Value{type="gauges"} 8

Driver metrics

And the metrics of the Driver in JSON or in Prometheus format as follows:

$ curl -s http://localhost:4040/metrics/json/ | jq
{
  "version": "4.0.0",
  "gauges": {
    "local-1593797764926.driver.BlockManager.disk.diskSpaceUsed_MB": {
      "value": 0
    },
    ...
  }
}
$ curl -s http://localhost:4040/metrics/prometheus/ | head
metrics_local_1593797764926_driver_BlockManager_disk_diskSpaceUsed_MB_Number{type="gauges"} 0
metrics_local_1593797764926_driver_BlockManager_disk_diskSpaceUsed_MB_Value{type="gauges"} 0
metrics_local_1593797764926_driver_BlockManager_memory_maxMem_MB_Number{type="gauges"} 366
metrics_local_1593797764926_driver_BlockManager_memory_maxMem_MB_Value{type="gauges"} 366

Executors metrics

The Executors metrics in Prometheus format can be accessed as follows:

$ curl -s http://localhost:4040/metrics/executors/prometheus | head
spark_info{version="3.0.0", revision="3fdfce3120f307147244e5eaf46d61419a723d50"} 1.0
metrics_executor_rddBlocks{application_id="app-20200703115147-0001", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_bytes{application_id="app-20200703115147-0001", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_bytes{application_id="app-20200703115147-0001", application_name="Spark shell", executor_id="driver"} 0

The Executors metrics in JSON format can be accessed as follows (an application ID need to be provided):

$ curl -s http://localhost:4040/api/v1/applications/app-20200703115147-0001/executors | jq
[
  {
    "id": "driver",
    "hostPort": "10.0.0.242:57429",
    "isActive": true,
    ...
  }
]