Monitoring Alluxio

Metrics provide invaluable insight into your Alluxio cluster's health and performance. Alluxio exposes metrics in the Prometheus exposition formatarrow-up-right, allowing for easy integration with modern monitoring stacks.

This guide covers how to monitor your Alluxio cluster, from using the pre-configured dashboards provided by the Alluxio Operator to setting up your own monitoring manually.

Default Monitoring with the Alluxio Operator

The easiest way to monitor Alluxio on Kubernetes is with the Alluxio Operator. By default, the operator deploys a complete monitoring stack alongside your Alluxio cluster, including Prometheusarrow-up-right for metrics collection and Grafanaarrow-up-right for visualization.

Accessing the Grafana Dashboard

The Grafana dashboard is the primary tool for visualizing your cluster's metrics. You can access it in two ways:

Use kubectl port-forward to securely access the Grafana UI from your local machine.

# Find the Grafana pod and forward port 3000
kubectl -n alx-ns port-forward $(kubectl -n alx-ns get pod -l app.kubernetes.io/component=grafana -o jsonpath="{.items[0].metadata.name}") 3000:3000

You can then open your browser and navigate to http://localhost:3000.

2. Accessing via Node Hostname

If your Kubernetes nodes are directly accessible on your network, you can reach Grafana via its NodePort.

# Get the hostname of the node where Grafana is running
kubectl -n alx-ns get pod $(kubectl -n alx-ns get pod -l app.kubernetes.io/component=grafana --no-headers -o custom-columns=:metadata.name) -o jsonpath='{.spec.nodeName}'

Assuming the hostname is foo.kubernetes.org, you can access the Grafana service at http://foo.kubernetes.org:8080/.

Understanding the Dashboard

The default dashboard provides a comprehensive overview of your cluster's state.

  • The Cluster section gives a high-level summary of the cluster status.

  • The Process section details resource consumption (CPU, memory) and JVM metrics for each Alluxio component.

  • Other sections provide detailed metrics for specific components like the coordinator and workers.

Disabling the Default Grafana

If you wish to use your own Grafana instance, you can disable the default one by setting spec.grafana.enabled to false in your AlluxioCluster definition. Prometheus is a core component and cannot be disabled.

Set up metric alert rules

Process Availability - ETCD

Field
Value

Component

Process Availability - ETCD

Metric

etcd_server_has_leader

Metric Explanation

Shows if each etcd member currently has a leader

Query

sum(etcd_server_has_leader{job="etcd"})

Query Explanation

Sums all members that currently have a leader

Trigger Condition

value < 3

Threshold/Value

3 members expected

Meaning

One or more etcd pods are down or quorum is lost

Note

Field
Value

Component

Process Availability - ETCD

Metric

etcd_server_leader_changes_seen_total

Metric Explanation

Counts how many times leader has changed

Query

changes(etcd_server_leader_changes_seen_total{job="etcd"}[5m])

Query Explanation

Calculates the number of leader changes (elections) that occurred within the last 5 minutes

Trigger Condition

> 0 for 5+ min

Threshold/Value

Any change > 0

Meaning

Leader flapping; indicates etcd instability or network issues

Note

Query needs to be modified on the dashboard from 1d to 5m

Process Availability - Worker count

Field
Value

Component

Process Availability - Worker count

Metric

up{job="worker"}

Metric Explanation

Shows how many workers are alive (responding to )

Query

sum(up{job="worker"})

Query Explanation

Counts the number of live worker targets

Trigger Condition

value < desired worker count

Threshold/Value

< desired worker count

Meaning

One or more workers are down or not responding

Note

Set desired worker count to match production cluster size

Process Resource

Field
Value

Component

Process Resource

Metric

jvm_memory_used_bytes

Metric Explanation

Shows current JVM heap usage as % of max

Query

jvm_memory_used_bytes{area="heap"}/jvm_memory_max_bytes{area="heap"}

Query Explanation

Calculates current heap usage as a percentage of the maximum heap

Trigger Condition

> 0.75 for 5+ min

Threshold/Value

75–80%

Meaning

Component is using a high percentage of its heap memory, indicating potential memory pressure or impending GC thrash

Note

Applies to all components (coordinator, workers, fuse, etc.)

Field
Value

Component

Process Resource

Metric

jvm_gc_collection_seconds_sum

Metric Explanation

Time spent in old GC collections

Query

rate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation"}[5m])

Query Explanation

Calculates time spent in old/full GC over 5 minutes

Trigger Condition

> 5s/min for 5+ min

Threshold/Value

> 0.083

Meaning

JVM doing frequent full GCs → major pause risk

Note

Combine with old GC count to confirm

Field
Value

Component

Process Resource

Metric

jvm_gc_collection_seconds_count

Metric Explanation

Frequency of old GC collections

Query

rate(jvm_gc_collection_seconds_count{gc="G1 Old Generation"}[5m])

Query Explanation

Calculates number of old/full GCs per minute

Trigger Condition

> 1/min for 5+ min

Threshold/Value

> 1

Meaning

JVM doing many full GCs, likely due to memory pressure

Note

Early memory pressure warning

Field
Value

Component

Process Resource

Metric

jvm_gc_collection_seconds_sum

Metric Explanation

Time spent in young GC collections

Query

rate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation"}[5m])

Query Explanation

Calculates time spent in young GC over 5 minutes

Trigger Condition

> 10s/min for 5+ min

Threshold/Value

> 0.166

Meaning

High GC overhead slowing throughput

Note

Only alert if persistent

Field
Value

Component

Process Resource

Metric

process_cpu_seconds_total

Metric Explanation

Measures total user + system CPU time consumed by the process

Query

irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])

Query Explanation

Calculates the per-second CPU usage rate over 5 minutes

Trigger Condition

stays consistently high for 5+ min

Threshold/Value

> 80% of 1 CPU core (≈ 0.8)

Meaning

Process is CPU bound or stuck consuming full CPU

Note

Tune threshold based on node vCPU cores; alert if usage is flat and near saturation

Cache - Cache Hit Rate

Field
Value

Component

Cache - Cache Hit Rate

Metric

alluxio_cached_data_read_bytes_total & alluxio_missed_data_read_bytes_total

Metric Explanation

Measures how much read data was served from cache vs fetched from UFS

Query

sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])))

Query Explanation

Calculates cache hit ratio over 5 minutes

Trigger Condition

cache hit % stays low for 5+ min

Threshold/Value

< 80%

Meaning

High UFS reads, cache not being utilized effectively

Note

Adjust threshold based on workload (e.g. 70–90%)

Cache - Utilization

Field
Value

Component

Cache - Utilization

Metric

alluxio_cached_storage_bytes & alluxio_cached_capacity_bytes

Metric Explanation

Shows how much of the configured cache capacity is currently used

Query

sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})

Query Explanation

Calculates current used/total cache ratio

Trigger Condition

> 0.85 (warning), > 0.95 (critical) for 5+ min

Threshold/Value

85–95% utilization

Meaning

Cache is nearly full, risk of eviction thrash or write failures

Note

Adjust thresholds based on cluster size and workload pattern

Cache - Cache Eviction - Correlation

Field
Value

Component

Cache - Cache Eviction - Correlation

Metric

alluxio_cached_evicted_data_bytes_total + alluxio_block_store_used_bytes

Metric Explanation

Tracks evicted bytes and current cache usage to detect cache pressure

Query

(sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker"}[5m])) > 0) and ((sum(alluxio_block_store_used_bytes{job="worker"}) / sum(alluxio_block_store_capacity_bytes{job="worker"})) > 0.8)

Query Explanation

Checks if evictions are occurring while cache usage is above 80%

Trigger Condition

Evictions > 0 while usage > 80% for 5+ minutes

Threshold/Value

Usage > 80% and Evictions > 0

Meaning

Indicates cache thrashing or pressure (evictions happening despite high cache utilization)

Note

Needs to be created manually as a new panel

Fuse - UFS Fallback

Field
Value

Component

Fuse - UFS Fallback

Metric

alluxio_ufs_data_access_bytes_total

Metric Explanation

Tracks read traffic from Fuse pods going directly to the UFS (bypassing Alluxio cache)

Query

irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])

Query Explanation

Calculates Fuse-driven UFS read throughput over 5 minutes

Trigger Condition

sustained Fuse UFS read traffic increases

Threshold/Value

>10 MiB/s sustained for >5m

Meaning

Fuse clients are bypassing Alluxio cache, high fallback

Note

Correlate with cache hit % and request rate; fallback >10–20 MiB/s usually worth investigating

Read Throughput

Field
Value

Component

Read Throughput

Metric

alluxio_data_throughput_bytes_total

Metric Explanation

Measures read throughput served by workers

Query

sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m]))

Query Explanation

Calculates worker read throughput over 5m

Trigger Condition

worker read throughput drops

Threshold/Value

< set baseline (e.g. < 10 MiB/s) while UFS read goes up

Meaning

Cache not serving data, workload hitting UFS

Note

Tune threshold based on normal workload pattern

Data

Field
Value

Component

Data

Metric

alluxio_data_access_bytes_count{method="read"}

Metric Explanation

Counts the number of read operations (requests) served by workers

Query

irate(alluxio_data_access_bytes_count{method="read",job="worker"}[5m])

Query Explanation

Calculates read request rate (req/s) over 5 minutes

Trigger Condition

Alert when rate drops to 0 while workload is expected

Threshold/Value

near 0 for > 5 min

Meaning

Worker not serving data — possible worker crash or cache unavailable

Note

Correlate with workload schedule to avoid false positives

License - Expiration

Field
Value

Component

License - Expiration

Metric

alluxio_license_expiration_date

Metric Explanation

Shows the UNIX timestamp when the Alluxio license will expire

Query

(max by (cluster_name) (alluxio_license_expiration_date) - time()) / 86400

Query Explanation

Calculates the number of days remaining until license expiration by subtracting current time from the license expiry timestamp

Trigger Condition

< 30 (Warning), < 7 (Critical)

Threshold/Value

30 days, 7 days

Meaning

License is about to expire; renew before it lapses

Note

Needs to be created manually as a new panel

License - Version Mismatch

Field
Value

Component

License - Version Mismatch

Metric

alluxio_version_info

Metric Explanation

Shows the version of each running Alluxio component (via version label)

Query

count(count by (version) (alluxio_version_info)) > 1

Query Explanation

Checks if more than one unique Alluxio version is running across components

Trigger Condition

> 1

Threshold/Value

More than 1 version

Meaning

Version mismatch between Alluxio components

Note

Needs to be created manually as a new panel

Advanced: Querying Metrics Directly

For advanced analysis or debugging, you can query the Prometheus and component endpoints directly.

Querying with Promtool

You can execute queries directly against the Prometheus server running in your cluster.

Querying Component Endpoints

Alluxio components (coordinator, workers, FUSE) expose a /metrics/ endpoint for scraping.

Refer to the Metrics Reference for a complete list of available metrics.

Integrating with an Existing Monitoring System

If you are not using the Alluxio Operator or have an existing monitoring infrastructure, you can integrate Alluxio with it manually.

Integrating with Prometheus

Add the following scrape jobs to your prometheus.yml to collect metrics from Alluxio.

Standalone Prometheus

For a standalone Prometheus instance, use static_configs:

Prometheus in Kubernetes

For Prometheus running in Kubernetes, use kubernetes_sd_configs to automatically discover Alluxio pods. Ensure your Alluxio pods have the required labels and annotations.

Your Alluxio pods must have the following metadata:

Integrating with Grafana

  1. Add Prometheus as a Data Source: In Grafana, add your Prometheus server as a new data source.

  2. Import the Alluxio Dashboard: Download the official Alluxio dashboard template and import it into Grafana.

Integrating with Datadog

Datadog can ingest metrics directly from Alluxio's Prometheus endpoints.

  1. Ensure your Datadog agent can reach the Alluxio component's metrics port (19999 for coordinator, 30000 for workers).

  2. In your Datadog configuration, add the Alluxio endpoints to your prometheus.yml check configuration.

Example conf.d/prometheus.d/conf.yaml snippet:

This configuration allows Datadog to collect, monitor, and alert on your Alluxio cluster's metrics.

Last updated