Monitoring Alluxio
Metrics provide invaluable insight into your Alluxio cluster's health and performance. Alluxio exposes metrics in the Prometheus exposition format, allowing for easy integration with modern monitoring stacks.
This guide covers how to monitor your Alluxio cluster, from using the pre-configured dashboards provided by the Alluxio Operator to setting up your own monitoring manually.
Default Monitoring with the Alluxio Operator
The easiest way to monitor Alluxio on Kubernetes is with the Alluxio Operator. By default, the operator deploys a complete monitoring stack alongside your Alluxio cluster, including Prometheus for metrics collection and Grafana for visualization.
Accessing the Grafana Dashboard
The Grafana dashboard is the primary tool for visualizing your cluster's metrics. You can access it in two ways:
1. Accessing via Port Forwarding (Recommended)
Use kubectl port-forward to securely access the Grafana UI from your local machine.
# Find the Grafana pod and forward port 3000
kubectl -n alx-ns port-forward $(kubectl -n alx-ns get pod -l app.kubernetes.io/component=grafana -o jsonpath="{.items[0].metadata.name}") 3000:3000You can then open your browser and navigate to http://localhost:3000.
2. Accessing via Node Hostname
If your Kubernetes nodes are directly accessible on your network, you can reach Grafana via its NodePort.
# Get the hostname of the node where Grafana is running
kubectl -n alx-ns get pod $(kubectl -n alx-ns get pod -l app.kubernetes.io/component=grafana --no-headers -o custom-columns=:metadata.name) -o jsonpath='{.spec.nodeName}'Assuming the hostname is foo.kubernetes.org, you can access the Grafana service at http://foo.kubernetes.org:8080/.
Understanding the Dashboard
The default dashboard provides a comprehensive overview of your cluster's state.

The Cluster section gives a high-level summary of the cluster status.
The Process section details resource consumption (CPU, memory) and JVM metrics for each Alluxio component.
Other sections provide detailed metrics for specific components like the coordinator and workers.
Disabling the Default Grafana
If you wish to use your own Grafana instance, you can disable the default one by setting spec.grafana.enabled to false in your AlluxioCluster definition. Prometheus is a core component and cannot be disabled.
Set up metric alert rules
Process Availability - ETCD
Component
Process Availability - ETCD
Metric
etcd_server_has_leader
Metric Explanation
Shows if each etcd member currently has a leader
Query
sum(etcd_server_has_leader{job="etcd"})
Query Explanation
Sums all members that currently have a leader
Trigger Condition
value < 3
Threshold/Value
3 members expected
Meaning
One or more etcd pods are down or quorum is lost
Note
Component
Process Availability - ETCD
Metric
etcd_server_leader_changes_seen_total
Metric Explanation
Counts how many times leader has changed
Query
changes(etcd_server_leader_changes_seen_total{job="etcd"}[5m])
Query Explanation
Calculates the number of leader changes (elections) that occurred within the last 5 minutes
Trigger Condition
> 0 for 5+ min
Threshold/Value
Any change > 0
Meaning
Leader flapping; indicates etcd instability or network issues
Note
Query needs to be modified on the dashboard from 1d to 5m
Process Availability - Worker count
Component
Process Availability - Worker count
Metric
up{job="worker"}
Metric Explanation
Shows how many workers are alive (responding to )
Query
sum(up{job="worker"})
Query Explanation
Counts the number of live worker targets
Trigger Condition
value < desired worker count
Threshold/Value
< desired worker count
Meaning
One or more workers are down or not responding
Note
Set desired worker count to match production cluster size
Process Resource
Component
Process Resource
Metric
jvm_memory_used_bytes
Metric Explanation
Shows current JVM heap usage as % of max
Query
jvm_memory_used_bytes{area="heap"}/jvm_memory_max_bytes{area="heap"}
Query Explanation
Calculates current heap usage as a percentage of the maximum heap
Trigger Condition
> 0.75 for 5+ min
Threshold/Value
75–80%
Meaning
Component is using a high percentage of its heap memory, indicating potential memory pressure or impending GC thrash
Note
Applies to all components (coordinator, workers, fuse, etc.)
Component
Process Resource
Metric
jvm_gc_collection_seconds_sum
Metric Explanation
Time spent in old GC collections
Query
rate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation"}[5m])
Query Explanation
Calculates time spent in old/full GC over 5 minutes
Trigger Condition
> 5s/min for 5+ min
Threshold/Value
> 0.083
Meaning
JVM doing frequent full GCs → major pause risk
Note
Combine with old GC count to confirm
Component
Process Resource
Metric
jvm_gc_collection_seconds_count
Metric Explanation
Frequency of old GC collections
Query
rate(jvm_gc_collection_seconds_count{gc="G1 Old Generation"}[5m])
Query Explanation
Calculates number of old/full GCs per minute
Trigger Condition
> 1/min for 5+ min
Threshold/Value
> 1
Meaning
JVM doing many full GCs, likely due to memory pressure
Note
Early memory pressure warning
Component
Process Resource
Metric
jvm_gc_collection_seconds_sum
Metric Explanation
Time spent in young GC collections
Query
rate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation"}[5m])
Query Explanation
Calculates time spent in young GC over 5 minutes
Trigger Condition
> 10s/min for 5+ min
Threshold/Value
> 0.166
Meaning
High GC overhead slowing throughput
Note
Only alert if persistent
Component
Process Resource
Metric
process_cpu_seconds_total
Metric Explanation
Measures total user + system CPU time consumed by the process
Query
irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])
Query Explanation
Calculates the per-second CPU usage rate over 5 minutes
Trigger Condition
stays consistently high for 5+ min
Threshold/Value
> 80% of 1 CPU core (≈ 0.8)
Meaning
Process is CPU bound or stuck consuming full CPU
Note
Tune threshold based on node vCPU cores; alert if usage is flat and near saturation
Cache - Cache Hit Rate
Component
Cache - Cache Hit Rate
Metric
alluxio_cached_data_read_bytes_total & alluxio_missed_data_read_bytes_total
Metric Explanation
Measures how much read data was served from cache vs fetched from UFS
Query
sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])))
Query Explanation
Calculates cache hit ratio over 5 minutes
Trigger Condition
cache hit % stays low for 5+ min
Threshold/Value
< 80%
Meaning
High UFS reads, cache not being utilized effectively
Note
Adjust threshold based on workload (e.g. 70–90%)
Cache - Utilization
Component
Cache - Utilization
Metric
alluxio_cached_storage_bytes & alluxio_cached_capacity_bytes
Metric Explanation
Shows how much of the configured cache capacity is currently used
Query
sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})
Query Explanation
Calculates current used/total cache ratio
Trigger Condition
> 0.85 (warning), > 0.95 (critical) for 5+ min
Threshold/Value
85–95% utilization
Meaning
Cache is nearly full, risk of eviction thrash or write failures
Note
Adjust thresholds based on cluster size and workload pattern
Cache - Cache Eviction - Correlation
Component
Cache - Cache Eviction - Correlation
Metric
alluxio_cached_evicted_data_bytes_total + alluxio_block_store_used_bytes
Metric Explanation
Tracks evicted bytes and current cache usage to detect cache pressure
Query
(sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker"}[5m])) > 0) and ((sum(alluxio_block_store_used_bytes{job="worker"}) / sum(alluxio_block_store_capacity_bytes{job="worker"})) > 0.8)
Query Explanation
Checks if evictions are occurring while cache usage is above 80%
Trigger Condition
Evictions > 0 while usage > 80% for 5+ minutes
Threshold/Value
Usage > 80% and Evictions > 0
Meaning
Indicates cache thrashing or pressure (evictions happening despite high cache utilization)
Note
Needs to be created manually as a new panel
Fuse - UFS Fallback
Component
Fuse - UFS Fallback
Metric
alluxio_ufs_data_access_bytes_total
Metric Explanation
Tracks read traffic from Fuse pods going directly to the UFS (bypassing Alluxio cache)
Query
irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])
Query Explanation
Calculates Fuse-driven UFS read throughput over 5 minutes
Trigger Condition
sustained Fuse UFS read traffic increases
Threshold/Value
>10 MiB/s sustained for >5m
Meaning
Fuse clients are bypassing Alluxio cache, high fallback
Note
Correlate with cache hit % and request rate; fallback >10–20 MiB/s usually worth investigating
Read Throughput
Component
Read Throughput
Metric
alluxio_data_throughput_bytes_total
Metric Explanation
Measures read throughput served by workers
Query
sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m]))
Query Explanation
Calculates worker read throughput over 5m
Trigger Condition
worker read throughput drops
Threshold/Value
< set baseline (e.g. < 10 MiB/s) while UFS read goes up
Meaning
Cache not serving data, workload hitting UFS
Note
Tune threshold based on normal workload pattern
Data
Component
Data
Metric
alluxio_data_access_bytes_count{method="read"}
Metric Explanation
Counts the number of read operations (requests) served by workers
Query
irate(alluxio_data_access_bytes_count{method="read",job="worker"}[5m])
Query Explanation
Calculates read request rate (req/s) over 5 minutes
Trigger Condition
Alert when rate drops to 0 while workload is expected
Threshold/Value
near 0 for > 5 min
Meaning
Worker not serving data — possible worker crash or cache unavailable
Note
Correlate with workload schedule to avoid false positives
License - Expiration
Component
License - Expiration
Metric
alluxio_license_expiration_date
Metric Explanation
Shows the UNIX timestamp when the Alluxio license will expire
Query
(max by (cluster_name) (alluxio_license_expiration_date) - time()) / 86400
Query Explanation
Calculates the number of days remaining until license expiration by subtracting current time from the license expiry timestamp
Trigger Condition
< 30 (Warning), < 7 (Critical)
Threshold/Value
30 days, 7 days
Meaning
License is about to expire; renew before it lapses
Note
Needs to be created manually as a new panel
License - Version Mismatch
Component
License - Version Mismatch
Metric
alluxio_version_info
Metric Explanation
Shows the version of each running Alluxio component (via version label)
Query
count(count by (version) (alluxio_version_info)) > 1
Query Explanation
Checks if more than one unique Alluxio version is running across components
Trigger Condition
> 1
Threshold/Value
More than 1 version
Meaning
Version mismatch between Alluxio components
Note
Needs to be created manually as a new panel
Advanced: Querying Metrics Directly
For advanced analysis or debugging, you can query the Prometheus and component endpoints directly.
Querying with Promtool
You can execute queries directly against the Prometheus server running in your cluster.
Querying Component Endpoints
Alluxio components (coordinator, workers, FUSE) expose a /metrics/ endpoint for scraping.
Refer to the Metrics Reference for a complete list of available metrics.
Integrating with an Existing Monitoring System
If you are not using the Alluxio Operator or have an existing monitoring infrastructure, you can integrate Alluxio with it manually.
Integrating with Prometheus
Add the following scrape jobs to your prometheus.yml to collect metrics from Alluxio.
Standalone Prometheus
For a standalone Prometheus instance, use static_configs:
Prometheus in Kubernetes
For Prometheus running in Kubernetes, use kubernetes_sd_configs to automatically discover Alluxio pods. Ensure your Alluxio pods have the required labels and annotations.
Your Alluxio pods must have the following metadata:
Integrating with Grafana
Add Prometheus as a Data Source: In Grafana, add your Prometheus server as a new data source.
Import the Alluxio Dashboard: Download the official Alluxio dashboard template and import it into Grafana.
Template URL: alluxio-ai-dashboard-template.json
Follow the Grafana import guide.
Integrating with Datadog
Datadog can ingest metrics directly from Alluxio's Prometheus endpoints.
Ensure your Datadog agent can reach the Alluxio component's metrics port (
19999for coordinator,30000for workers).In your Datadog configuration, add the Alluxio endpoints to your
prometheus.ymlcheck configuration.
Example conf.d/prometheus.d/conf.yaml snippet:
This configuration allows Datadog to collect, monitor, and alert on your Alluxio cluster's metrics.
Last updated