Grafana Dashboard

Overview

This document provides detailed descriptions of all panels in the Alluxio Grafana dashboard, including metric names, calculation methods, and threshold configurations.


Cluster Row

Panel Name
Description
Calculation Method
Unit
Threshold

Storage

Displays the total capacity, used storage, and usage percentage of the Alluxio cache. This panel is critical for monitoring overall cache utilization and preventing out-of-storage issues.

sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"}), sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}), sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})

bytes

yellow: > 0.9, red: > 0.95

Read - Throughput

Shows the read throughput from different sources: Fuse, S3 API, and direct Worker access. This helps identify the primary data access interfaces for read operations.

sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader"}[5m])) or on() vector(0)

binBps

yellow: > 100, red: > 2000

Read - Load (5m)

Displays the average read load (thread utilization) across all workers over the last 5 minutes. A high average load may indicate a system-wide performance bottleneck.

avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]))

percentunit

yellow: > 0.8

Read - Hotspot (load > 50%)

Displays per-worker read load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.

avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5

percentunit

yellow: > 0.8

Cache File

Displays the total number of files and pages currently stored in the Alluxio cache. This helps in understanding the composition and granularity of the cached data.

sum(alluxio_data_cached_files{job="worker",cluster_name=~"$cluster"}), sum(alluxio_data_cached_pages{job="worker",cluster_name=~"$cluster"})

short

Write - Throughput

Shows the write throughput from different sources: Fuse, S3 API, and direct Worker access. This helps identify the primary data access interfaces for write operations.

sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata"}[5m]))

binBps

yellow: > 100, red: > 2000

Write - Load (5m)

Displays the average write load (thread utilization) across all workers over the last 5 minutes. A high average load may indicate a system-wide performance bottleneck.

avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]))

percentunit

yellow: > 0.8

Write - Hotspot (load > 50%)

Displays per-worker write load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.

avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5

percentunit

yellow: > 0.8

Workers

Displays the number of live (healthy) and lost (unhealthy) workers in the cluster. This is a key indicator of cluster health and stability.

sum(up{job="worker",cluster_name=~"$cluster"}), sum by () (prometheus_target_scrape_pool_targets{scrape_job="worker"}) - sum by () (up{job="worker",cluster_name=~"$cluster"})

yellow: > 1

Jobs

Displays the number of currently running and waiting jobs in the job service queue. This helps monitor the status of asynchronous operations like data loading.

sum(alluxio_active_job_count{job="coordinator",type="running"}), sum(alluxio_active_job_count{job="coordinator",type="waiting"})

short

yellow: > 100

Meta - RPS

Displays the metadata Requests Per Second (RPS) from different sources: Fuse, S3 API, and direct Worker access. This is important for monitoring metadata workload intensity.

`sum(alluxio_fuse_concurrency{job="fuse",method=~"Fuse\.Create\

Fuse\.Getattr\

Fuse\.Readdir\

Meta - Load (5m)

Displays the average metadata operation load (thread utilization) across all workers over the last 5 minutes. High load can indicate a metadata-intensive workload.

avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]))

percentunit

yellow: > 0.8

Meta - Hotspot (load > 50%)

Displays per-worker metadata operation load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.

avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5

percentunit

yellow: > 0.8

License (Valid To)

Displays the expiration date of the Alluxio enterprise license. This is important for ensuring uninterrupted access to enterprise features.

`min(alluxio_license_expiration_date{job=~"coordinator\

worker"}) * 1000`

dateTimeFromNow

Component Version

Displays the distribution of Alluxio versions running across the cluster. This is useful for verifying version consistency during upgrades or troubleshooting.

count (alluxio_version_info{cluster_name=~"$cluster"}) by (version)

Component Uptime

Displays the uptime of each Alluxio component (coordinator, worker, fuse). This helps track service stability and identify recent restarts.

`timestamp(process_start_time_seconds{job=~"coordinator\

worker\

fuse",cluster_name="$cluster"}) - process_start_time_seconds{job="coordinator\

Cache Hit(%)

Displays the percentage of data/metadata read requests served from the Alluxio cache. A higher value indicates better cache efficiency and reduced load on the underlying storage.

sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + (sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))), sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_metadata_cache_miss_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))

percentunit

green: > 0.8

Cache Eviction

Displays the rate at which data is being evicted from the cache. A high eviction rate may indicate insufficient cache capacity for the workload.

sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker",cluster_name=~"$cluster"}[5m]))

binBps

Throughput - Read

Provides a detailed breakdown of read throughput across different data paths (FUSE, S3, Worker, UFS). Essential for understanding data flow and identifying bottlenecks.

sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector, sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)

binBps

Throughput - Write

Provides a detailed breakdown of write throughput across different data paths (FUSE, S3, Worker, UFS). Critical for understanding write efficiency and patterns.

sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector, sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0), sum(irate(alluxio_write_buffer_async_persist_throughput_bytes_total{job="worker"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)

binBps

Request/s - Read

Displays the total read request rate across all interfaces (FUSE, S3, Worker). Important for understanding read workload intensity and traffic patterns.

sum(irate(alluxio_data_access_bytes_count{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_access_bytes_count{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)

reqps

Request/s - Write

Displays the total write request rate across all interfaces. Critical for understanding write workload patterns and identifying write-heavy phases.

sum(irate(alluxio_data_access_bytes_count{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(sum(irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) or on() vector(0), sum(irate(alluxio_data_access_bytes_count{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)

reqps

Request/s - Metadata

Displays the metadata operation request rate across all interfaces. Important for understanding metadata workload intensity, which can impact overall cluster performance.

`sum(alluxio_fuse_concurrency{job="fuse",method=~"Fuse\.Create\

Fuse\.Getattr\

Fuse\.Readdir\

Jobs

Time series visualization of job execution patterns. Important for understanding cluster background activity levels and identifying periods of high maintenance activity that might impact performance.

sum(alluxio_active_job_count{job="coordinator",type="running"}), sum(alluxio_active_job_count{job="coordinator",type="waiting"})

short

Request Latency - P90

Displays the 90th percentile latency for all request types. This metric is key to understanding the typical user experience and identifying performance degradation.

histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op)), histogram_quantile(0.90, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.90, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))

ms

Request Latency - P99

Displays the 99th percentile latency for all request types. This metric is critical for identifying worst-case performance and outliers that may indicate system stress.

histogram_quantile(0.99, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op)), histogram_quantile(0.99, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.99, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))

ms

RPC (Client<->Worker) Latency - P90

Displays the estimated network latency for client GetStatus gRPC calls (metadata operations), calculated as the difference between client-observed latency and worker metadata processing latency. This helps identify network-related delays and connectivity issues between clients and the cluster.

(histogram_quantile(0.90, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0, (histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",method="Fuse.Getattr",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0

ms

RPC (Client<->Worker) Latency - P99

Displays the estimated network latency for client GetStatus gRPC calls (metadata operations), calculated as the difference between client-observed latency and worker metadata processing latency. This helps identify network-related delays and connectivity issues between clients and the cluster.

(histogram_quantile(0.99, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0

ms

Request Call - Success Rate

Displays the success rate for all request types across interfaces. A value below 99% (excluding Not Found errors) typically indicates system issues that require investigation.

(sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", cluster_name=~"$cluster"}[5m]))), (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", cluster_name=~"$cluster"}[5m]))), (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]) - irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m]))) / (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]))), (sum by (ufs_type) (irate(alluxio_ufs_total{cluster_name=~"$cluster"}[5m]) - irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))) / (sum by (ufs_type) (irate(alluxio_ufs_latency_ms_count{cluster_name=~"$cluster"}[5m])))

percentunit

green: > 0.9

Request Call - Failures

Displays the failure rate across all request types and interfaces. Spikes in this metric require immediate investigation to identify system or application errors.

sum by (method, state) (irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", cluster_name=~"$cluster"}[5m])), sum by (method, status) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",cluster_name=~"$cluster"}[5m])), sum by (op) (irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m])), sum by (ufs_type,error_code) (irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))

short


Fuse Row

Panel Name
Description
Calculation Method
Unit
Threshold

Request Latency - P90

Displays the 90th percentile latency for individual Fuse mount operations. This helps isolate performance issues specific to a particular Fuse client.

histogram_quantile(0.90, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.90, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Request Latency - P99

Displays the 99th percentile latency for individual Fuse mount operations. This helps isolate performance issues specific to a particular Fuse client.

histogram_quantile(0.99, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Throughput - Read

Displays the read throughput for an individual Fuse mount. This is useful for monitoring the read activity of a specific client.

irate(alluxio_data_access_bytes_sum{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Throughput - Write

Displays the write throughput for an individual Fuse mount. This is useful for monitoring the write activity of a specific client.

irate(alluxio_data_access_bytes_sum{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Request/s - Read

Displays the read request rate for an individual Fuse mount. This helps understand the read workload from a specific client.

irate(alluxio_data_access_bytes_count{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Request/s - Write

Displays the write request rate for an individual Fuse mount. This helps understand the write workload from a specific client.

irate(alluxio_data_access_bytes_count{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

FUSE Request Failure

Displays the failure rate for an individual Fuse mount. This is critical for diagnosing client-specific connectivity or operational issues.

irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", instance=~"$instance",cluster_name=~"$cluster"}[5m])

short

Throughput - UFS Fallback

Read throughput when FUSE operations fall back to underlying filesystem (UFS). High fallback rates indicate cache misses or data not in Alluxio, reducing performance benefits. Should be minimal for optimal performance.

irate(alluxio_ufs_data_access_bytes_total{job="fuse",instance=~"$instance",method="read",cluster_name=~"$cluster"}[5m]) or on() vector(0)

binBps

Client Request Latency (P99)

Displays the latency distribution of gRPC (metadata ops) and Netty (data ops) client calls. This helps identify RPC communication delays, client-side bottlenecks, and potential network or service responsiveness issues within the workers.

histogram_quantile( 0.99, sum( rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, method, instance) ), histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Client Request Concurrency

Displays the number of concurrent gRPC client calls (metadata ops). This helps identify spikes in metadata request load and potential client-side bottlenecks within the worker.

alluxio_grpc_client_concurrency{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}

reqps

Client Request Failure Rate

Displays the error rate of gRPC client calls (metadata operations) and Netty operation calls (data operations). This helps identify failing metadata requests and potential reliability issues in client-worker communication within the cluster.

sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) + sum by (instance,method) (irate(alluxio_grpc_client_successes_total{instance=~"$instance",cluster_name=~"$cluster"}[5m]))), sum by (instance,op) (irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,op) (irate(alluxio_netty_operations_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])))

percentunit

green: > 0.9

Client Request Errors

Displays the total number of gRPC client errors (metadata operations) and Netty operation errors (data operations). This helps track failing requests and potential reliability issues in client-worker communication within the cluster.

irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m]), irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])

short


S3 Row

Panel Name
Description
Calculation Method
Unit
Threshold

Request Latency - P90

Displays the 90th percentile latency for S3 API operations on a per-worker basis. This helps isolate S3 performance issues to specific workers.

histogram_quantile(0.90, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))

ms

Request Latency - P99

Displays the 99th percentile latency for S3 API operations on a per-worker basis. This helps identify worst-case S3 performance on specific workers.

histogram_quantile(0.99, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))

ms

Throughput - Read

Displays the read throughput for S3 API operations on a per-worker basis. This is useful for monitoring S3 read activity on specific workers.

irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Throughput - Write

Displays the write throughput for S3 API operations on a per-worker basis. This is useful for monitoring S3 write activity on specific workers.

irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Request/s - Read

Displays the read request rate for S3 API operations on a per-worker basis. This helps understand the S3 read workload on specific workers.

irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Request/s - Write

Displays the write request rate for S3 API operations on a per-worker basis. This helps understand the S3 write workload on specific workers.

irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Request Failure

Displays the failure rate for S3 API operations on a per-worker basis. This is critical for diagnosing S3-specific issues on individual workers.

irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",instance=~"$instance",cluster_name=~"$cluster"}[5m])

short


Worker Row

Panel Name
Description
Calculation Method
Unit
Threshold

Storage (Data) Used

Displays the amount of cache data storage used by each individual worker. This is useful for identifying uneven storage distribution.

alluxio_cached_storage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}

bytes

Storage (Meta) Used

Displays the amount of cache metadata storage used by each individual worker. This is useful for identifying uneven storage distribution.

alluxio_metastore_storage_size_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}

bytes

Storage Device Usage

Displays the page store device capacity utilization per directory on workers. This helps identify disk space consumption trends and potential storage exhaustion or imbalance within the cluster.

1- (sum by (instance, dir) (alluxio_page_store_device_available_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, dir) (alluxio_page_store_device_total_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}))

percentunit

yellow: > 0.8, red: > 0.9

Files

Displays the number of cached files for each individual worker. This helps in understanding the cache composition on a per-worker basis.

alluxio_data_cached_files{job="worker",instance=~"$instance",cluster_name=~"$cluster"}

short

Pages

Displays the number of cached pages for each individual worker. This helps in understanding the cache composition on a per-worker basis.

alluxio_data_cached_pages{job="worker",instance=~"$instance",cluster_name=~"$cluster"}

short

Cache Evicted

Displays the rate of data eviction from the cache for each individual worker. This helps identify which workers are under the most memory pressure.

irate(alluxio_cached_evicted_data_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Request Latency - P90

Displays the 90th percentile latency for internal metadata operations on a per-worker basis.

histogram_quantile(0.90, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Request Latency - P99

Displays the 99th percentile latency for internal metadata operations on a per-worker basis.

histogram_quantile(0.99, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.99, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Request (from Page Store or UFS) Latency - P90

Displays the latency distribution for read operations served by Alluxio workers from the page store or underlying file system (UFS). This helps identify read performance bottlenecks and distinguish slow storage backends or overloaded workers within the cluster.

histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_storage_response_time_ms_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, instance) )

ms

PageStore IO Latency - P90

Displays the 90th percentile latency of page store I/O operations (data operations) on workers. This helps identify slow disk I/O performance and potential storage bottlenecks within the cluster.

histogram_quantile( 0.90, sum( rate(alluxio_page_store_io_latency_microseconds_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, op, instance) )

µs

Read - Throughput

Displays the read throughput for each individual worker. This is useful for monitoring the read activity and load on each worker.

irate(alluxio_data_throughput_bytes_total{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Write - Throughput

Displays the write throughput for each individual worker. This is useful for monitoring the write activity and load on each worker.

irate(alluxio_data_throughput_bytes_total{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Read - Request/s

Displays the read request rate for each individual worker. This helps understand the read workload distribution across workers.

irate(alluxio_data_access_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Write - Request/s

Displays the write request rate for each individual worker. This helps understand the write workload distribution across workers.

irate(alluxio_data_access_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Read - Threads

Displays the active threads and queue length for the read thread pool on each worker. This is useful for diagnosing read performance bottlenecks.

alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}

short

Write - Threads

Displays the active threads and queue length for the write thread pool on each worker. This is useful for diagnosing write performance bottlenecks.

alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}

short

Metadata - Request/s

Displays the metadata request rate for each individual worker. This helps identify workers with high metadata load.

irate(alluxio_meta_operation_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Metadata - Threads

Displays the active threads and queue length for the metadata thread pool on each worker. This is useful for diagnosing metadata performance bottlenecks.

alluxio_rpc_executor_max_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}

short

Request Failure

Displays the failure rate for internal metadata operations on a per-worker basis. This helps pinpoint workers that are experiencing internal errors.

irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster", instance=~"$instance"}[5m])

short

Off Heap Memory

Displays the off-heap memory usage for each worker. This is important for monitoring memory resources and preventing out-of-memory errors.

sum(alluxio_rocksdb_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(alluxio_netty_direct_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(jvm_memory_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",area="nonheap"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="direct"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="mapped"}) by (instance)

bytes

PageStore Errors

Displays the error rate for PageStore operations on each worker. This can indicate issues with the local cache storage layer.

irate(alluxio_page_store_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m])

short

PageStore Disk Error Rate

Displays the error rate for disk-related PageStore operations on each worker. This can indicate underlying disk hardware or file system issues.

sum by (instance, dir) (irate(alluxio_page_store_dir_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) or on() vector(0)) / sum by (instance, dir) (irate(alluxio_page_store_dir_operations_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]))

percentunit

yellow: > 0.1, red: > 0.3


UFS Row

Panel Name
Description
Calculation Method
Unit
Threshold

Request Latency - P90

Displays the 90th percentile latency for UFS operations on a per-worker basis. This helps identify slow UFS interactions from specific workers.

histogram_quantile(0.90, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))

ms

Request Latency - P99

Displays the 99th percentile latency for UFS operations on a per-worker basis. This helps identify worst-case UFS performance from specific workers.

histogram_quantile(0.99, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))

ms

Throughput - Read

Displays the read throughput from the UFS on a per-worker basis. This is useful for monitoring how much data is being read from the underlying storage.

irate(alluxio_ufs_data_access_bytes_total{method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Throughput - Write

Displays the write throughput to the UFS on a per-worker basis. This is useful for monitoring how much data is being written to the underlying storage.

irate(alluxio_ufs_data_access_bytes_total{method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Errors

Displays the error rate for UFS operations on a per-worker basis. This is critical for diagnosing issues with the underlying storage system.

irate(alluxio_ufs_error_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])

none


Job Row

Panel Name
Description
Calculation Method
Unit
Threshold

Jobs

Time series visualization of job execution patterns. Important for understanding cluster background activity levels and identifying periods of high maintenance activity that might impact performance.

alluxio_active_job_count{job="coordinator",type="running"}, alluxio_active_job_count{job="coordinator",type="waiting"}

short

Job Tasks per Worker

Time series visualization of job execution tasks on each worker. Important for understanding worker activity levels and identifying bottlenecks that might impact performance.

alluxio_worker_job_task_count{}

short

Job Threads per Worker

Displays the active threads and queue length for the load job thread pool on each worker. This is useful for diagnosing load job performance bottlenecks.

alluxio_rpc_executor_max_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}

short

Job Dispatched Per Second

Displays the rate of jobs dispatched per worker.

sum by(worker)(irate(alluxio_distributed_load_job_dispatched_size_total{job="coordinator", instance=~"$instance"} [5m]))

Distributed Load Throughput

Displays the load job throughput for each individual worker. This is useful for monitoring the load job activity on each worker.

sum(irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])), irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Bytes Loaded on Workers Per Second

Displays the rate of bytes loaded on workers.

sum by(instance) (irate(alluxio_distributed_worker_bytes_loaded_bytes_total{job="worker", instance=~"$instance"} [5m])), sum by()(irate(alluxio_distributed_load_job_loaded_bytes_total{job="coordinator", instance=~"$instance"} [5m]))

binBps

Distributed Load Operation Counts Per Second

Displays the rate of distributed load operations.

irate(alluxio_distributed_load_job_scanned_total{job="coordinator", instance=~"$instance"}[5m]), irate(alluxio_distributed_load_job_processed_total{job="coordinator", instance=~"$instance"}[5m]), irate(alluxio_distributed_load_job_skipped_total{job="coordinator", instance=~"$instance"}[5m]), sum by()(irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"}[5m]))

short

Distributed Load Failure Breakdowns

Displays the failure breakdown by reason and worker.

sum by(reason, worker) (irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"} [5m]))


Process Row

Panel Name
Description
Calculation Method
Unit
Threshold

Active Worker Membership

Displays the total number of worker membership refreshes on each instance. This helps track cluster topology updates and potential instability or frequent reconfiguration of workers within the cluster.

irate(alluxio_worker_membership_refresh_count_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) > 0

short

Resource Pool Usage

Displays the utilization ratio of dynamic resource pools, calculated as current resources over maximum capacity. This helps identify over- or under-utilized pools and potential resource bottlenecks within the cluster.

sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_current_resources{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_capacity{type="max",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"})

percentunit

yellow: > 0.8

Resource Pool - Acquisition Timeout

Displays the rate of dynamic resource pool acquisition timeouts. This helps identify contention or delays in resource allocation and potential performance bottlenecks within the cluster.

irate(alluxio_dynamic_resource_pool_acquisition_timeouts_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])

short

Resource Pool - Resource Creation Latency (P99)

Displays the 99th percentile latency for creating new resources in dynamic resource pools. This helps identify slow resource allocation and potential performance bottlenecks within the cluster.

histogram_quantile( 0.99, sum( rate(alluxio_dynamic_resource_pool_create_new_resource_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, pool_kind, pool_instance, instance) )

ms

CPU time spent

Displays the CPU time spent by each process.

irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])

µs

Threads

Displays the number of threads for each process.

jvm_threads_current{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}

Heap Usage

Displays the heap memory usage for each process.

jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}

bytes

Heap Usage(%)

Displays the heap memory usage percentage for each process.

jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"} / jvm_memory_max_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}

percentunit

yellow: > 0.9

young GC time(per minute)

Displays the young generation garbage collection time per minute.

irate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60

s

young GC rate(per minute)

Displays the young generation garbage collection rate per minute.

irate(jvm_gc_collection_seconds_count{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60

old GC time(per minute)

Displays the old generation garbage collection time per minute.

irate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60

s

old GC rate(per minute)

Displays the old generation garbage collection rate per minute.

irate(jvm_gc_collection_seconds_count{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60


ETCD Row

Panel Name
Description
Calculation Method
Unit
Threshold

Up

Displays the number of node in ETCD cluster.

sum(etcd_server_has_leader{job="etcd"})

yellow: > 1, green: > 3

RPC Rate

Displays the RPC request and failure rates.

sum(rate(grpc_server_started_total{job="etcd",grpc_type="unary"}[5m])), sum(rate(grpc_server_handled_total{job="etcd",grpc_type="unary",grpc_code!="OK"}[5m]))

ops

Active Streams

Displays the number of active watch and lease streams.

sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}), sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})

ETCD Client Call Failure Rate

Displays the error rate of Etcd client calls. This helps identify failing Etcd requests and potential reliability or connectivity issues between the cluster and the Etcd service.

(sum by (instance, server) (irate(alluxio_etcd_call_errors_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) / (sum by (instance, server) (irate(alluxio_etcd_client_calls_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])))

percentunit

yellow: > 0.05, red: > 0.2

ETCD Client Call Latency (P99)

Displays the 99th percentile latency of Etcd client calls. This helps identify slow Etcd operations and potential performance bottlenecks in Etcd communication within the cluster.

histogram_quantile( 0.99, sum( rate(alluxio_etcd_client_call_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

DB Size

Displays the size of the ETCD database.

etcd_mvcc_db_total_size_in_bytes{job="etcd"}

bytes

Disk Sync Duration

Displays the disk sync duration for WAL and backend operations.

histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le)), histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))

Memory

Displays the resident memory usage of the ETCD process.

process_resident_memory_bytes{job="etcd"}

bytes

Client Traffic In

Displays the client traffic in rates.

rate(etcd_network_client_grpc_received_bytes_total{job="etcd"}[5m])

binBps

Client Traffic Out

Displays the client traffic out rates.

rate(etcd_network_client_grpc_sent_bytes_total{job="etcd"}[5m])

binBps

Peer Traffic In

Displays the peer traffic in rates.

sum(rate(etcd_network_peer_received_bytes_total{job="etcd"}[5m])) by (instance)

binBps

Peer Traffic Out

Displays the peer traffic out rates.

sum(rate(etcd_network_peer_sent_bytes_total{job="etcd"}[5m])) by (instance)

binBps

Raft Proposals

Displays the raft proposal metrics including failure rate, pending total, commit rate, and apply rate.

sum(rate(etcd_server_proposals_failed_total{job="etcd"}[5m])), sum(etcd_server_proposals_pending{job="etcd"}), sum(rate(etcd_server_proposals_committed_total{job="etcd"}[5m])), sum(rate(etcd_server_proposals_applied_total{job="$cluster"}[5m]))

none

Total Leader Elections Per Day

Displays the total number of leader elections per day.

changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])


Last updated