Monitoring

Alluxio exposes metrics in the Prometheus exposition formatarrow-up-right, enabling integration with standard monitoring stacks. This guide covers Prometheus setup, Grafana dashboard import, alert rules, and direct metric queries for both Kubernetes (Operator) and Docker/Bare-Metal deployments.

Prometheus Setup

The Alluxio Operator deploys a Prometheus instance alongside your cluster automatically. No manual configuration is required.

Verify Prometheus is running:

kubectl -n alx-ns get pod -l app.kubernetes.io/component=prometheus
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-prometheus-6f697b6db8-sbvvg   1/1     Running   0          2m

Kubernetes: Bring Your Own Prometheus

If your cluster already has a Prometheus instance, you can disable the Operator-managed one and use Kubernetes service discovery instead.

Disable the Operator-managed Prometheus:

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  prometheus:
    enabled: false

Add the following scrape config to your existing prometheus.yml to automatically discover Alluxio pods by annotation:

scrape_configs:
  - job_name: 'alluxio-components'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Keep only pods with prometheus.io/scrape=true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Keep only Alluxio components
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: alluxio
      # Use the annotated metrics path, default to /metrics
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      # Use the annotated port
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      # Set job label from the component name
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: replace
        target_label: job

Your Alluxio pods must carry the following labels and annotations for discovery to work:

Grafana Setup

The Operator deploys Grafana automatically alongside your cluster.

Then open http://localhost:3000 in your browser.

Access via Node Hostname

If Kubernetes nodes are directly accessible on your network, look up the node where Grafana is scheduled:

Then access Grafana at http://<node-hostname>:8080/.

Disabling the Default Grafana

To use your own Grafana instance, disable the Operator-managed one:

circle-info

Prometheus is a core component of the Operator deployment and cannot be disabled independently.

Dashboard Import

Download the official Alluxio dashboard template and import it into Grafana:

In Grafana: Dashboards → Import → Upload JSON file → select /tmp/alluxio-dashboard.json → select Prometheus as the data source → click Import. For detailed import options, see the Grafana import guidearrow-up-right.

Understanding the Dashboard

  • The Cluster section gives a high-level summary of the cluster status.

  • The Process section shows resource consumption (CPU, memory) and JVM metrics for each component.

  • Additional sections provide detailed metrics for the coordinator, workers, and cache.

Alert Rules

The queries below can be used to build Prometheus alert rules or Grafana alert panels. Thresholds are recommended starting points — tune them to your workload and cluster size.

Process Availability — ETCD

Field
Value

Component

Process Availability - ETCD

Metric

etcd_server_has_leader

Metric Explanation

Shows if each etcd member currently has a leader

Query

sum(etcd_server_has_leader{job="etcd"})

Query Explanation

Sums all members that currently have a leader

Trigger Condition

value < 3

Threshold/Value

3 members expected

Meaning

One or more etcd pods are down or quorum is lost

Note

Field
Value

Component

Process Availability - ETCD

Metric

etcd_server_leader_changes_seen_total

Metric Explanation

Counts how many times leader has changed

Query

changes(etcd_server_leader_changes_seen_total{job="etcd"}[5m])

Query Explanation

Calculates the number of leader changes (elections) that occurred within the last 5 minutes

Trigger Condition

> 0 for 5+ min

Threshold/Value

Any change > 0

Meaning

Leader flapping; indicates etcd instability or network issues

Note

Query needs to be modified on the dashboard from 1d to 5m

Process Availability — Worker Count

Field
Value

Component

Process Availability - Worker count

Metric

up{job="worker"}

Metric Explanation

Shows how many workers are alive (responding to scrapes)

Query

sum(up{job="worker"})

Query Explanation

Counts the number of live worker targets

Trigger Condition

value < desired worker count

Threshold/Value

< desired worker count

Meaning

One or more workers are down or not responding

Note

Set desired worker count to match production cluster size

Process Resource

Field
Value

Component

Process Resource

Metric

jvm_memory_used_bytes

Metric Explanation

Shows current JVM heap usage as % of max

Query

jvm_memory_used_bytes{area="heap"}/jvm_memory_max_bytes{area="heap"}

Query Explanation

Calculates current heap usage as a percentage of the maximum heap

Trigger Condition

> 0.75 for 5+ min

Threshold/Value

75–80%

Meaning

Component is using a high percentage of its heap memory, indicating potential memory pressure or impending GC thrash

Note

Applies to all components (coordinator, workers, fuse, etc.)

Field
Value

Component

Process Resource

Metric

jvm_gc_collection_seconds_sum

Metric Explanation

Time spent in old GC collections

Query

rate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation"}[5m])

Query Explanation

Calculates time spent in old/full GC over 5 minutes

Trigger Condition

> 5s/min for 5+ min

Threshold/Value

> 0.083

Meaning

JVM doing frequent full GCs → major pause risk

Note

Combine with old GC count to confirm

Field
Value

Component

Process Resource

Metric

jvm_gc_collection_seconds_count

Metric Explanation

Frequency of old GC collections

Query

rate(jvm_gc_collection_seconds_count{gc="G1 Old Generation"}[5m])

Query Explanation

Calculates number of old/full GCs per minute

Trigger Condition

> 1/min for 5+ min

Threshold/Value

> 1

Meaning

JVM doing many full GCs, likely due to memory pressure

Note

Early memory pressure warning

Field
Value

Component

Process Resource

Metric

jvm_gc_collection_seconds_sum

Metric Explanation

Time spent in young GC collections

Query

rate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation"}[5m])

Query Explanation

Calculates time spent in young GC over 5 minutes

Trigger Condition

> 10s/min for 5+ min

Threshold/Value

> 0.166

Meaning

High GC overhead slowing throughput

Note

Only alert if persistent

Field
Value

Component

Process Resource

Metric

process_cpu_seconds_total

Metric Explanation

Measures total user + system CPU time consumed by the process

Query

irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])

Query Explanation

Calculates the per-second CPU usage rate over 5 minutes

Trigger Condition

stays consistently high for 5+ min

Threshold/Value

> 80% of 1 CPU core (≈ 0.8)

Meaning

Process is CPU bound or stuck consuming full CPU

Note

Tune threshold based on node vCPU cores; alert if usage is flat and near saturation

Cache — Cache Hit Rate

Field
Value

Component

Cache - Cache Hit Rate

Metric

alluxio_cached_data_read_bytes_total & alluxio_missed_data_read_bytes_total

Metric Explanation

Measures how much read data was served from cache vs fetched from UFS

Query

sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])))

Query Explanation

Calculates cache hit ratio over 5 minutes

Trigger Condition

cache hit % stays low for 5+ min

Threshold/Value

< 80%

Meaning

High UFS reads, cache not being utilized effectively

Note

Adjust threshold based on workload (e.g. 70–90%)

Cache — Utilization

Field
Value

Component

Cache - Utilization

Metric

alluxio_cached_storage_bytes & alluxio_cached_capacity_bytes

Metric Explanation

Shows how much of the configured cache capacity is currently used

Query

sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})

Query Explanation

Calculates current used/total cache ratio

Trigger Condition

> 0.85 (warning), > 0.95 (critical) for 5+ min

Threshold/Value

85–95% utilization

Meaning

Cache is nearly full, risk of eviction thrash or write failures

Note

Adjust thresholds based on cluster size and workload pattern

Cache — Eviction Correlation

Field
Value

Component

Cache - Cache Eviction - Correlation

Metric

alluxio_cached_evicted_data_bytes_total + alluxio_block_store_used_bytes

Metric Explanation

Tracks evicted bytes and current cache usage to detect cache pressure

Query

(sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker"}[5m])) > 0) and ((sum(alluxio_block_store_used_bytes{job="worker"}) / sum(alluxio_block_store_capacity_bytes{job="worker"})) > 0.8)

Query Explanation

Checks if evictions are occurring while cache usage is above 80%

Trigger Condition

Evictions > 0 while usage > 80% for 5+ minutes

Threshold/Value

Usage > 80% and Evictions > 0

Meaning

Indicates cache thrashing or pressure (evictions happening despite high cache utilization)

Note

Needs to be created manually as a new panel

FUSE — UFS Fallback

Field
Value

Component

Fuse - UFS Fallback

Metric

alluxio_ufs_data_access_bytes_total

Metric Explanation

Tracks read traffic from Fuse pods going directly to the UFS (bypassing Alluxio cache)

Query

irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])

Query Explanation

Calculates Fuse-driven UFS read throughput over 5 minutes

Trigger Condition

sustained Fuse UFS read traffic increases

Threshold/Value

>10 MiB/s sustained for >5m

Meaning

Fuse clients are bypassing Alluxio cache, high fallback

Note

Correlate with cache hit % and request rate; fallback >10–20 MiB/s usually worth investigating

Read Throughput

Field
Value

Component

Read Throughput

Metric

alluxio_data_throughput_bytes_total

Metric Explanation

Measures read throughput served by workers

Query

sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m]))

Query Explanation

Calculates worker read throughput over 5m

Trigger Condition

worker read throughput drops

Threshold/Value

< set baseline (e.g. < 10 MiB/s) while UFS read goes up

Meaning

Cache not serving data, workload hitting UFS

Note

Tune threshold based on normal workload pattern

Data — Read Request Rate

Field
Value

Component

Data

Metric

alluxio_data_access_bytes_count{method="read"}

Metric Explanation

Counts the number of read operations (requests) served by workers

Query

irate(alluxio_data_access_bytes_count{method="read",job="worker"}[5m])

Query Explanation

Calculates read request rate (req/s) over 5 minutes

Trigger Condition

Alert when rate drops to 0 while workload is expected

Threshold/Value

near 0 for > 5 min

Meaning

Worker not serving data — possible worker crash or cache unavailable

Note

Correlate with workload schedule to avoid false positives

License — Expiration

Field
Value

Component

License - Expiration

Metric

alluxio_license_expiration_date

Metric Explanation

Shows the UNIX timestamp when the Alluxio license will expire

Query

(max by (cluster_name) (alluxio_license_expiration_date) - time()) / 86400

Query Explanation

Calculates the number of days remaining until license expiration by subtracting current time from the license expiry timestamp

Trigger Condition

< 30 (Warning), < 7 (Critical)

Threshold/Value

30 days, 7 days

Meaning

License is about to expire; renew before it lapses

Note

Needs to be created manually as a new panel

License — Version Mismatch

Field
Value

Component

License - Version Mismatch

Metric

alluxio_version_info

Metric Explanation

Shows the version of each running Alluxio component (via version label)

Query

count(count by (version) (alluxio_version_info)) > 1

Query Explanation

Checks if more than one unique Alluxio version is running across components

Trigger Condition

> 1

Threshold/Value

More than 1 version

Meaning

Version mismatch between Alluxio components

Note

Needs to be created manually as a new panel

Querying Metrics Directly

For advanced analysis or debugging, query Prometheus or component endpoints directly.

Open a shell into the Prometheus pod:

Then use promtool to run instant queries:

Query component endpoints directly from within a pod:

Refer to the Metrics Reference for a complete list of available metrics and their descriptions.

Datadog Integration

Datadog can ingest metrics directly from Alluxio's Prometheus endpoints.

  1. Ensure your Datadog agent can reach the Alluxio metrics ports: 19999 (coordinator), 30000 (workers), 49999 (FUSE).

  2. Add the following to your conf.d/prometheus.d/conf.yaml:

This configuration instructs the Datadog agent to scrape and forward all Alluxio metrics.

Last updated