Monitoring Guide

Alluxio exposes metrics via the Java Management Extensions (JMX). Once it is enabled according to the installation instructions, you can use the metrics to monitor the cluster. See full list of metrics.

Setup Prometheus and Grafana

Adding the prometheus java agent in the Trino JVM configuration enables metrics to be queried for individual Trino processes. A Prometheus server is needed to collect the metrics from the exposed agent endpoint for a visualization tool such as Grafana to make aggregate queries in order to provide cluster level monitoring.

Start Prometheus

Download prometheus and extract the contents. Create a prometheus.yml configuration file in the extracted directory:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
scrape_configs:
  - job_name: "trino coordinator"
    static_configs:
      - targets: [ "<COORDINATOR_HOSTNAME>:9696" ]
  - job_name: "trino worker"
    static_configs:
      - targets: [ "<WORKER1_HOSTNAME>:9696", "<WORKER2_HOSTNAME>:9696", ... ]

replacing the value of the targets entries. The 9696 port should match the port configured for the prometheus java agent, ex. -javaagent:${TRINO_HOME}/lib/jmx_prometheus_javaagent-0.20.0.jar=9696 Note this configuration assumes the node running Prometheus can access each of the Trino processes through this port.

Start the prometheus server process with the command: ${PROMETHEUS_HOME}/prometheus --config.file=${PROMETHEUS_HOME}/prometheus.yml &

Start Grafana

Download grafana and extract the contents.

Start the grafana server process with the command: ${GRAFANA_HOME}/bin/grafana-server --homepath=${GRAFANA_HOME} web &

Assuming port 3000 is exposed, navigate to Grafana UI by browsing http://<HOSTNAME>:3000, replacing with the running server's hostname. The default login credentials are admin for both the username and password.

Through the UI, configure prometheus as a datasource. Assuming Prometheus and Grafana are running on the same node, the Prometheus URL will be http://localhost:9090.

Now you can either build your own dashboard or import an existing dashboard provided to monitor the cluster.

Monitoring the Cluster

It is recommended to monitor the following metrics for particular presto workers in question and cluster-wide. Below we list scenarios where you may find certain metrics helpful.

Monitoring Cache Health

To check the healthiness of the cache, please check out the metric CacheState :

0 NOT_IN_USE: failed to init or restore and can not recover, e.g., disk is not accessible
1 READ_ONLY: get operations work, not for put and delete, e.g., cache is still recovering from previous start
2 READ_WRITE: normal and expected state

Each cache will report its own status, so that it is easy to determine if there is any or which cache is in the undesired state.

Understanding PUT Errors (Writing Into Cache)

In case of high number of PUT errors seen, collecting following metrics to understand reasons of puts

CachePutErrors: total number of put errors.

This should be the summation of following detailed put errors

CachePutNotReadyErrors: Cache is still recovering, not in READ_WRITE mode.
CachePutStoreWriteErrors: Failed to write new pages to local disk.
CachePutStoreDeleteErrors: Failed to delete evicted pages (data) from local disk.
CachePutBenignRacingErrors: Certain race condition happen but doesn't affect the system. i.e. 2 threads trying to delete the same page.
CachePutEvictionErrors: Cache metadata eviction failures.
CachePutAsyncRejectionErrors: Async write is enabled and thread pool limit is reached. This should not happen with the latest implementation. If this metrics is > 0 there might be an issue.

Understanding GET Errors (Reading From Cache)

In case of high number of GET errors seen, collecting following metrics to understand reasons of puts

CacheGetErrors: Total number of get errors.

This should be the summation of following detailed get errors

CacheGetNotReadyErrors: Cache is in NOT_IN_USE mode
CacheGetStoreReadErrors: Failed to read pages from local disk

Understanding Cache Hit Ratio

There are two kinds of cache hit ratio here. They may differ slightly, but in general we expect them to track each other closely.

Name

Expression

Description

Data Cache Hit Ratio

CacheBytesReadCache / (CacheBytesReadCache + CacheBytesRequestedExternal)

Percentage of bytes the cache is able to fulfill compared to total bytes being requested

Data Request Cache Hit Ratio

CacheHitRequests / (CacheHitRequests + CacheExternalRequests)

Percentage of data request the cache is able to fulfill compared to total data requests

We want these two rates to be 80% or higher to make sure the cache is used efficiently. A rate of 90% or 95%+ is ideal.

Hardware metrics

These hardware metrics indicates the cache utilization level

CacheSpaceUsed
CacheSpaceAvailable

If the cache utilization rate is high, and the cache hit rate is low, may consider increase the size of the cache, or use filter to control what goes into the cache to make it more efficient.

Others

Young GC and old GC (part of your JVM setup): Spike in this may indicate abnormality. May need to check your GC spec.
cacheByteExternalRequest and cacheExteranlRequest: Spike in this may lead to read failure
CacheBytesEvicted: We expect constant and steady eviction to happen, but if there is spike, it may indicate issue, or it may lead to failure.
CachePages: We expect this to be normally full. But during initialization of the worker, we will reload the cache, so you should see this metrics to start from 0 and gradually go up to almost full during initialization.

Understanding the Impact of Cache Financially

Cloud vendors charges for requests made against your cloud storage and objects. Usually request costs are based on the request type (2 tiers of pricing), and are charged on the quantity of requests. We have provided the count of such relevant requests to assist you understand the impact of cache financially.

Currently, the Cloud vendors defines the tiers as following (ex. AWS S3 pricing):

Tier 1: PUT, COPY, POST, LIST requests
Tier 2: GET, SELECT, and all other requests

In Alluxio Edge context, the relevant ones are:

Tier 1: ListStatusExternalRequests (LIST)
Tier 2: CacheExternalRequests (GET), GetFileInfoExternalRequests (GET)

For each request comes to Alluxio Edge, it can either be a cache hit and served by Alluxio Edge (free), or a cache miss and goes directly to the understorage (same as before adding Alluxio Edge). So the cost after adding Alluxio can be calculated by removing the cache hit request counts. For example:

Original Tier 1 cost: (ListStatusHitRequests + ListStatusExternalRequests) * <price of tier 1 requests>
Original Tier 2 cost: (CacheHitRequests + CacheExternalRequests + GetFileInfoHitRequests + GetFileInfoExternalRequests) * <price of tier 1 requests>
New Tier 1 cost with cache: (ListStatusExternalRequests) * <price of tier 1 requests>
New Tier 2 cost with cache: (CacheExternalRequests + GetFileInfoExternalRequests) * <price of tier 1 requests>

NOTE:

This is an estimation for guidance purpose.
Different data lake format may affect the kind of requests are sent to Alluxio Edge.

Last updated 1 year ago