Monitoring Guide
Alluxio exposes metrics via the Java Management Extensions (JMX). Once it is enabled according to the installation instructions, you can use the metrics to monitor the cluster. See full list of metrics.
Setup Prometheus and Grafana
Adding the prometheus java agent in the Trino JVM configuration enables metrics to be queried for individual Trino processes. A Prometheus server is needed to collect the metrics from the exposed agent endpoint for a visualization tool such as Grafana to make aggregate queries in order to provide cluster level monitoring.
Start Prometheus
Download prometheus and extract the contents. Create a prometheus.yml
configuration file in the extracted directory:
replacing the value of the targets
entries. The 9696
port should match the port configured for the prometheus java agent, ex. -javaagent:${TRINO_HOME}/lib/jmx_prometheus_javaagent-0.20.0.jar=9696
Note this configuration assumes the node running Prometheus can access each of the Trino processes through this port.
Start the prometheus server process with the command: ${PROMETHEUS_HOME}/prometheus --config.file=${PROMETHEUS_HOME}/prometheus.yml &
Start Grafana
Download grafana and extract the contents.
Start the grafana server process with the command: ${GRAFANA_HOME}/bin/grafana-server --homepath=${GRAFANA_HOME} web &
Assuming port 3000 is exposed, navigate to Grafana UI by browsing http://<HOSTNAME>:3000
, replacing with the running server's hostname. The default login credentials are admin
for both the username and password.
Through the UI, configure prometheus as a datasource. Assuming Prometheus and Grafana are running on the same node, the Prometheus URL will be http://localhost:9090
.
Now you can either build your own dashboard or import an existing dashboard provided to monitor the cluster.
Monitoring the Cluster
It is recommended to monitor the following metrics for particular presto workers in question and cluster-wide. Below we list scenarios where you may find certain metrics helpful.
Monitoring Cache Health
To check the healthiness of the cache, please check out the metric CacheState
:
0
NOT_IN_USE
: failed to init or restore and can not recover, e.g., disk is not accessible1
READ_ONLY
: get operations work, not for put and delete, e.g., cache is still recovering from previous start2
READ_WRITE
: normal and expected state
Each cache will report its own status, so that it is easy to determine if there is any or which cache is in the undesired state.
Understanding PUT Errors (Writing Into Cache)
In case of high number of PUT errors seen, collecting following metrics to understand reasons of puts
CachePutErrors
: total number of put errors.
This should be the summation of following detailed put errors
CachePutNotReadyErrors
: Cache is still recovering, not in READ_WRITE mode.CachePutStoreWriteErrors
: Failed to write new pages to local disk.CachePutStoreDeleteErrors
: Failed to delete evicted pages (data) from local disk.CachePutBenignRacingErrors
: Certain race condition happen but doesn't affect the system. i.e. 2 threads trying to delete the same page.CachePutEvictionErrors
: Cache metadata eviction failures.CachePutAsyncRejectionErrors
: Async write is enabled and thread pool limit is reached. This should not happen with the latest implementation. If this metrics is > 0 there might be an issue.
Understanding GET Errors (Reading From Cache)
In case of high number of GET errors seen, collecting following metrics to understand reasons of puts
CacheGetErrors
: Total number of get errors.
This should be the summation of following detailed get errors
CacheGetNotReadyErrors
: Cache is in NOT_IN_USE modeCacheGetStoreReadErrors
: Failed to read pages from local disk
Understanding Cache Hit Ratio
There are two kinds of cache hit ratio here. They may differ slightly, but in general we expect them to track each other closely.
Data Cache Hit Ratio
CacheBytesReadCache / (CacheBytesReadCache + CacheBytesRequestedExternal)
Percentage of bytes the cache is able to fulfill compared to total bytes being requested
Data Request Cache Hit Ratio
CacheHitRequests / (CacheHitRequests + CacheExternalRequests)
Percentage of data request the cache is able to fulfill compared to total data requests
We want these two rates to be 80% or higher to make sure the cache is used efficiently. A rate of 90% or 95%+ is ideal.
Hardware metrics
These hardware metrics indicates the cache utilization level
CacheSpaceUsed
CacheSpaceAvailable
If the cache utilization rate is high, and the cache hit rate is low, may consider increase the size of the cache, or use filter to control what goes into the cache to make it more efficient.
Others
Young GC
andold GC
(part of your JVM setup): Spike in this may indicate abnormality. May need to check your GC spec.cacheByteExternalRequest
andcacheExteranlRequest
: Spike in this may lead to read failureCacheBytesEvicted
: We expect constant and steady eviction to happen, but if there is spike, it may indicate issue, or it may lead to failure.CachePages
: We expect this to be normally full. But during initialization of the worker, we will reload the cache, so you should see this metrics to start from 0 and gradually go up to almost full during initialization.
Understanding the Impact of Cache Financially
Cloud vendors charges for requests made against your cloud storage and objects. Usually request costs are based on the request type (2 tiers of pricing), and are charged on the quantity of requests. We have provided the count of such relevant requests to assist you understand the impact of cache financially.
Currently, the Cloud vendors defines the tiers as following (ex. AWS S3 pricing):
Tier 1: PUT, COPY, POST, LIST requests
Tier 2: GET, SELECT, and all other requests
In Alluxio Edge context, the relevant ones are:
Tier 1:
ListStatusExternalRequests
(LIST)Tier 2:
CacheExternalRequests
(GET),GetFileInfoExternalRequests
(GET)
For each request comes to Alluxio Edge, it can either be a cache hit and served by Alluxio Edge (free), or a cache miss and goes directly to the understorage (same as before adding Alluxio Edge). So the cost after adding Alluxio can be calculated by removing the cache hit request counts. For example:
Original Tier 1 cost: (
ListStatusHitRequests
+ListStatusExternalRequests
) * <price of tier 1 requests>Original Tier 2 cost: (
CacheHitRequests
+CacheExternalRequests
+GetFileInfoHitRequests
+GetFileInfoExternalRequests
) * <price of tier 1 requests>New Tier 1 cost with cache: (
ListStatusExternalRequests
) * <price of tier 1 requests>New Tier 2 cost with cache: (
CacheExternalRequests
+GetFileInfoExternalRequests
) * <price of tier 1 requests>
NOTE:
This is an estimation for guidance purpose.
Different data lake format may affect the kind of requests are sent to Alluxio Edge.
Last updated