Grafana Dashboard
Overview
This document provides detailed descriptions of all panels in the Alluxio Grafana dashboard, including metric names, calculation methods, and threshold configurations.
Cluster Row
Storage
Displays the total capacity, used storage, and usage percentage of the Alluxio cache. This panel is critical for monitoring overall cache utilization and preventing out-of-storage issues.
sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"}), sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}), sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})
bytes
yellow: > 0.9, red: > 0.95
Read - Throughput
Shows the read throughput from different sources: Fuse, S3 API, and direct Worker access. This helps identify the primary data access interfaces for read operations.
sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader"}[5m])) or on() vector(0)
binBps
yellow: > 100, red: > 2000
Read - Load (5m)
Displays the average read load (thread utilization) across all workers over the last 5 minutes. A high average load may indicate a system-wide performance bottleneck.
avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]))
percentunit
yellow: > 0.8
Read - Hotspot (load > 50%)
Displays per-worker read load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.
avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5
percentunit
yellow: > 0.8
Cache File
Displays the total number of files and pages currently stored in the Alluxio cache. This helps in understanding the composition and granularity of the cached data.
sum(alluxio_data_cached_files{job="worker",cluster_name=~"$cluster"}), sum(alluxio_data_cached_pages{job="worker",cluster_name=~"$cluster"})
short
Write - Throughput
Shows the write throughput from different sources: Fuse, S3 API, and direct Worker access. This helps identify the primary data access interfaces for write operations.
sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata"}[5m]))
binBps
yellow: > 100, red: > 2000
Write - Load (5m)
Displays the average write load (thread utilization) across all workers over the last 5 minutes. A high average load may indicate a system-wide performance bottleneck.
avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]))
percentunit
yellow: > 0.8
Write - Hotspot (load > 50%)
Displays per-worker write load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.
avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5
percentunit
yellow: > 0.8
Workers
Displays the number of live (healthy) and lost (unhealthy) workers in the cluster. This is a key indicator of cluster health and stability.
sum(up{job="worker",cluster_name=~"$cluster"}), sum by () (prometheus_target_scrape_pool_targets{scrape_job="worker"}) - sum by () (up{job="worker",cluster_name=~"$cluster"})
yellow: > 1
Jobs
Displays the number of currently running and waiting jobs in the job service queue. This helps monitor the status of asynchronous operations like data loading.
sum(alluxio_active_job_count{job="coordinator",type="running"}), sum(alluxio_active_job_count{job="coordinator",type="waiting"})
short
yellow: > 100
Meta - RPS
Displays the metadata Requests Per Second (RPS) from different sources: Fuse, S3 API, and direct Worker access. This is important for monitoring metadata workload intensity.
`sum(alluxio_fuse_concurrency{job="fuse",method=~"Fuse\.Create\
Fuse\.Getattr\
Fuse\.Readdir\
Meta - Load (5m)
Displays the average metadata operation load (thread utilization) across all workers over the last 5 minutes. High load can indicate a metadata-intensive workload.
avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]))
percentunit
yellow: > 0.8
Meta - Hotspot (load > 50%)
Displays per-worker metadata operation load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.
avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5
percentunit
yellow: > 0.8
License (Valid To)
Displays the expiration date of the Alluxio enterprise license. This is important for ensuring uninterrupted access to enterprise features.
`min(alluxio_license_expiration_date{job=~"coordinator\
worker"}) * 1000`
dateTimeFromNow
Component Version
Displays the distribution of Alluxio versions running across the cluster. This is useful for verifying version consistency during upgrades or troubleshooting.
count (alluxio_version_info{cluster_name=~"$cluster"}) by (version)
Component Uptime
Displays the uptime of each Alluxio component (coordinator, worker, fuse). This helps track service stability and identify recent restarts.
`timestamp(process_start_time_seconds{job=~"coordinator\
worker\
fuse",cluster_name="$cluster"}) - process_start_time_seconds{job="coordinator\
Cache Hit(%)
Displays the percentage of data/metadata read requests served from the Alluxio cache. A higher value indicates better cache efficiency and reduced load on the underlying storage.
sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + (sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))), sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_metadata_cache_miss_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))
percentunit
green: > 0.8
Cache Eviction
Displays the rate at which data is being evicted from the cache. A high eviction rate may indicate insufficient cache capacity for the workload.
sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker",cluster_name=~"$cluster"}[5m]))
binBps
Throughput - Read
Provides a detailed breakdown of read throughput across different data paths (FUSE, S3, Worker, UFS). Essential for understanding data flow and identifying bottlenecks.
sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector, sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)
binBps
Throughput - Write
Provides a detailed breakdown of write throughput across different data paths (FUSE, S3, Worker, UFS). Critical for understanding write efficiency and patterns.
sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector, sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0), sum(irate(alluxio_write_buffer_async_persist_throughput_bytes_total{job="worker"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)
binBps
Request/s - Read
Displays the total read request rate across all interfaces (FUSE, S3, Worker). Important for understanding read workload intensity and traffic patterns.
sum(irate(alluxio_data_access_bytes_count{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_access_bytes_count{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)
reqps
Request/s - Write
Displays the total write request rate across all interfaces. Critical for understanding write workload patterns and identifying write-heavy phases.
sum(irate(alluxio_data_access_bytes_count{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(sum(irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) or on() vector(0), sum(irate(alluxio_data_access_bytes_count{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)
reqps
Request/s - Metadata
Displays the metadata operation request rate across all interfaces. Important for understanding metadata workload intensity, which can impact overall cluster performance.
`sum(alluxio_fuse_concurrency{job="fuse",method=~"Fuse\.Create\
Fuse\.Getattr\
Fuse\.Readdir\
Jobs
Time series visualization of job execution patterns. Important for understanding cluster background activity levels and identifying periods of high maintenance activity that might impact performance.
sum(alluxio_active_job_count{job="coordinator",type="running"}), sum(alluxio_active_job_count{job="coordinator",type="waiting"})
short
Request Latency - P90
Displays the 90th percentile latency for all request types. This metric is key to understanding the typical user experience and identifying performance degradation.
histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op)), histogram_quantile(0.90, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.90, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))
ms
Request Latency - P99
Displays the 99th percentile latency for all request types. This metric is critical for identifying worst-case performance and outliers that may indicate system stress.
histogram_quantile(0.99, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op)), histogram_quantile(0.99, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.99, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))
ms
RPC (Client<->Worker) Latency - P90
Displays the estimated network latency for client GetStatus gRPC calls (metadata operations), calculated as the difference between client-observed latency and worker metadata processing latency. This helps identify network-related delays and connectivity issues between clients and the cluster.
(histogram_quantile(0.90, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0, (histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",method="Fuse.Getattr",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0
ms
RPC (Client<->Worker) Latency - P99
Displays the estimated network latency for client GetStatus gRPC calls (metadata operations), calculated as the difference between client-observed latency and worker metadata processing latency. This helps identify network-related delays and connectivity issues between clients and the cluster.
(histogram_quantile(0.99, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0
ms
Request Call - Success Rate
Displays the success rate for all request types across interfaces. A value below 99% (excluding Not Found errors) typically indicates system issues that require investigation.
(sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", cluster_name=~"$cluster"}[5m]))), (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", cluster_name=~"$cluster"}[5m]))), (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]) - irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m]))) / (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]))), (sum by (ufs_type) (irate(alluxio_ufs_total{cluster_name=~"$cluster"}[5m]) - irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))) / (sum by (ufs_type) (irate(alluxio_ufs_latency_ms_count{cluster_name=~"$cluster"}[5m])))
percentunit
green: > 0.9
Request Call - Failures
Displays the failure rate across all request types and interfaces. Spikes in this metric require immediate investigation to identify system or application errors.
sum by (method, state) (irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", cluster_name=~"$cluster"}[5m])), sum by (method, status) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",cluster_name=~"$cluster"}[5m])), sum by (op) (irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m])), sum by (ufs_type,error_code) (irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))
short
Fuse Row
Request Latency - P90
Displays the 90th percentile latency for individual Fuse mount operations. This helps isolate performance issues specific to a particular Fuse client.
histogram_quantile(0.90, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.90, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Request Latency - P99
Displays the 99th percentile latency for individual Fuse mount operations. This helps isolate performance issues specific to a particular Fuse client.
histogram_quantile(0.99, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Throughput - Read
Displays the read throughput for an individual Fuse mount. This is useful for monitoring the read activity of a specific client.
irate(alluxio_data_access_bytes_sum{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Throughput - Write
Displays the write throughput for an individual Fuse mount. This is useful for monitoring the write activity of a specific client.
irate(alluxio_data_access_bytes_sum{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Request/s - Read
Displays the read request rate for an individual Fuse mount. This helps understand the read workload from a specific client.
irate(alluxio_data_access_bytes_count{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Request/s - Write
Displays the write request rate for an individual Fuse mount. This helps understand the write workload from a specific client.
irate(alluxio_data_access_bytes_count{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
FUSE Request Failure
Displays the failure rate for an individual Fuse mount. This is critical for diagnosing client-specific connectivity or operational issues.
irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", instance=~"$instance",cluster_name=~"$cluster"}[5m])
short
Throughput - UFS Fallback
Read throughput when FUSE operations fall back to underlying filesystem (UFS). High fallback rates indicate cache misses or data not in Alluxio, reducing performance benefits. Should be minimal for optimal performance.
irate(alluxio_ufs_data_access_bytes_total{job="fuse",instance=~"$instance",method="read",cluster_name=~"$cluster"}[5m]) or on() vector(0)
binBps
Client Request Latency (P99)
Displays the latency distribution of gRPC (metadata ops) and Netty (data ops) client calls. This helps identify RPC communication delays, client-side bottlenecks, and potential network or service responsiveness issues within the workers.
histogram_quantile( 0.99, sum( rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, method, instance) ), histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Client Request Concurrency
Displays the number of concurrent gRPC client calls (metadata ops). This helps identify spikes in metadata request load and potential client-side bottlenecks within the worker.
alluxio_grpc_client_concurrency{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}
reqps
Client Request Failure Rate
Displays the error rate of gRPC client calls (metadata operations) and Netty operation calls (data operations). This helps identify failing metadata requests and potential reliability issues in client-worker communication within the cluster.
sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) + sum by (instance,method) (irate(alluxio_grpc_client_successes_total{instance=~"$instance",cluster_name=~"$cluster"}[5m]))), sum by (instance,op) (irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,op) (irate(alluxio_netty_operations_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])))
percentunit
green: > 0.9
Client Request Errors
Displays the total number of gRPC client errors (metadata operations) and Netty operation errors (data operations). This helps track failing requests and potential reliability issues in client-worker communication within the cluster.
irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m]), irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])
short
S3 Row
Request Latency - P90
Displays the 90th percentile latency for S3 API operations on a per-worker basis. This helps isolate S3 performance issues to specific workers.
histogram_quantile(0.90, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))
ms
Request Latency - P99
Displays the 99th percentile latency for S3 API operations on a per-worker basis. This helps identify worst-case S3 performance on specific workers.
histogram_quantile(0.99, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))
ms
Throughput - Read
Displays the read throughput for S3 API operations on a per-worker basis. This is useful for monitoring S3 read activity on specific workers.
irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Throughput - Write
Displays the write throughput for S3 API operations on a per-worker basis. This is useful for monitoring S3 write activity on specific workers.
irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Request/s - Read
Displays the read request rate for S3 API operations on a per-worker basis. This helps understand the S3 read workload on specific workers.
irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Request/s - Write
Displays the write request rate for S3 API operations on a per-worker basis. This helps understand the S3 write workload on specific workers.
irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Request Failure
Displays the failure rate for S3 API operations on a per-worker basis. This is critical for diagnosing S3-specific issues on individual workers.
irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",instance=~"$instance",cluster_name=~"$cluster"}[5m])
short
Worker Row
Storage (Data) Used
Displays the amount of cache data storage used by each individual worker. This is useful for identifying uneven storage distribution.
alluxio_cached_storage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}
bytes
Storage (Meta) Used
Displays the amount of cache metadata storage used by each individual worker. This is useful for identifying uneven storage distribution.
alluxio_metastore_storage_size_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}
bytes
Storage Device Usage
Displays the page store device capacity utilization per directory on workers. This helps identify disk space consumption trends and potential storage exhaustion or imbalance within the cluster.
1- (sum by (instance, dir) (alluxio_page_store_device_available_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, dir) (alluxio_page_store_device_total_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}))
percentunit
yellow: > 0.8, red: > 0.9
Files
Displays the number of cached files for each individual worker. This helps in understanding the cache composition on a per-worker basis.
alluxio_data_cached_files{job="worker",instance=~"$instance",cluster_name=~"$cluster"}
short
Pages
Displays the number of cached pages for each individual worker. This helps in understanding the cache composition on a per-worker basis.
alluxio_data_cached_pages{job="worker",instance=~"$instance",cluster_name=~"$cluster"}
short
Cache Evicted
Displays the rate of data eviction from the cache for each individual worker. This helps identify which workers are under the most memory pressure.
irate(alluxio_cached_evicted_data_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Request Latency - P90
Displays the 90th percentile latency for internal metadata operations on a per-worker basis.
histogram_quantile(0.90, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Request Latency - P99
Displays the 99th percentile latency for internal metadata operations on a per-worker basis.
histogram_quantile(0.99, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.99, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Request (from Page Store or UFS) Latency - P90
Displays the latency distribution for read operations served by Alluxio workers from the page store or underlying file system (UFS). This helps identify read performance bottlenecks and distinguish slow storage backends or overloaded workers within the cluster.
histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_storage_response_time_ms_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, instance) )
ms
PageStore IO Latency - P90
Displays the 90th percentile latency of page store I/O operations (data operations) on workers. This helps identify slow disk I/O performance and potential storage bottlenecks within the cluster.
histogram_quantile( 0.90, sum( rate(alluxio_page_store_io_latency_microseconds_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, op, instance) )
µs
Read - Throughput
Displays the read throughput for each individual worker. This is useful for monitoring the read activity and load on each worker.
irate(alluxio_data_throughput_bytes_total{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Write - Throughput
Displays the write throughput for each individual worker. This is useful for monitoring the write activity and load on each worker.
irate(alluxio_data_throughput_bytes_total{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Read - Request/s
Displays the read request rate for each individual worker. This helps understand the read workload distribution across workers.
irate(alluxio_data_access_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Write - Request/s
Displays the write request rate for each individual worker. This helps understand the write workload distribution across workers.
irate(alluxio_data_access_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Read - Threads
Displays the active threads and queue length for the read thread pool on each worker. This is useful for diagnosing read performance bottlenecks.
alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}
short
Write - Threads
Displays the active threads and queue length for the write thread pool on each worker. This is useful for diagnosing write performance bottlenecks.
alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}
short
Metadata - Request/s
Displays the metadata request rate for each individual worker. This helps identify workers with high metadata load.
irate(alluxio_meta_operation_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Metadata - Threads
Displays the active threads and queue length for the metadata thread pool on each worker. This is useful for diagnosing metadata performance bottlenecks.
alluxio_rpc_executor_max_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}
short
Request Failure
Displays the failure rate for internal metadata operations on a per-worker basis. This helps pinpoint workers that are experiencing internal errors.
irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster", instance=~"$instance"}[5m])
short
Off Heap Memory
Displays the off-heap memory usage for each worker. This is important for monitoring memory resources and preventing out-of-memory errors.
sum(alluxio_rocksdb_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(alluxio_netty_direct_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(jvm_memory_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",area="nonheap"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="direct"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="mapped"}) by (instance)
bytes
PageStore Errors
Displays the error rate for PageStore operations on each worker. This can indicate issues with the local cache storage layer.
irate(alluxio_page_store_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m])
short
PageStore Disk Error Rate
Displays the error rate for disk-related PageStore operations on each worker. This can indicate underlying disk hardware or file system issues.
sum by (instance, dir) (irate(alluxio_page_store_dir_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) or on() vector(0)) / sum by (instance, dir) (irate(alluxio_page_store_dir_operations_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]))
percentunit
yellow: > 0.1, red: > 0.3
UFS Row
Request Latency - P90
Displays the 90th percentile latency for UFS operations on a per-worker basis. This helps identify slow UFS interactions from specific workers.
histogram_quantile(0.90, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))
ms
Request Latency - P99
Displays the 99th percentile latency for UFS operations on a per-worker basis. This helps identify worst-case UFS performance from specific workers.
histogram_quantile(0.99, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))
ms
Throughput - Read
Displays the read throughput from the UFS on a per-worker basis. This is useful for monitoring how much data is being read from the underlying storage.
irate(alluxio_ufs_data_access_bytes_total{method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Throughput - Write
Displays the write throughput to the UFS on a per-worker basis. This is useful for monitoring how much data is being written to the underlying storage.
irate(alluxio_ufs_data_access_bytes_total{method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Errors
Displays the error rate for UFS operations on a per-worker basis. This is critical for diagnosing issues with the underlying storage system.
irate(alluxio_ufs_error_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])
none
Job Row
Jobs
Time series visualization of job execution patterns. Important for understanding cluster background activity levels and identifying periods of high maintenance activity that might impact performance.
alluxio_active_job_count{job="coordinator",type="running"}, alluxio_active_job_count{job="coordinator",type="waiting"}
short
Job Tasks per Worker
Time series visualization of job execution tasks on each worker. Important for understanding worker activity levels and identifying bottlenecks that might impact performance.
alluxio_worker_job_task_count{}
short
Job Threads per Worker
Displays the active threads and queue length for the load job thread pool on each worker. This is useful for diagnosing load job performance bottlenecks.
alluxio_rpc_executor_max_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}
short
Job Dispatched Per Second
Displays the rate of jobs dispatched per worker.
sum by(worker)(irate(alluxio_distributed_load_job_dispatched_size_total{job="coordinator", instance=~"$instance"} [5m]))
Distributed Load Throughput
Displays the load job throughput for each individual worker. This is useful for monitoring the load job activity on each worker.
sum(irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])), irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Bytes Loaded on Workers Per Second
Displays the rate of bytes loaded on workers.
sum by(instance) (irate(alluxio_distributed_worker_bytes_loaded_bytes_total{job="worker", instance=~"$instance"} [5m])), sum by()(irate(alluxio_distributed_load_job_loaded_bytes_total{job="coordinator", instance=~"$instance"} [5m]))
binBps
Distributed Load Operation Counts Per Second
Displays the rate of distributed load operations.
irate(alluxio_distributed_load_job_scanned_total{job="coordinator", instance=~"$instance"}[5m]), irate(alluxio_distributed_load_job_processed_total{job="coordinator", instance=~"$instance"}[5m]), irate(alluxio_distributed_load_job_skipped_total{job="coordinator", instance=~"$instance"}[5m]), sum by()(irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"}[5m]))
short
Distributed Load Failure Breakdowns
Displays the failure breakdown by reason and worker.
sum by(reason, worker) (irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"} [5m]))
Process Row
Active Worker Membership
Displays the total number of worker membership refreshes on each instance. This helps track cluster topology updates and potential instability or frequent reconfiguration of workers within the cluster.
irate(alluxio_worker_membership_refresh_count_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) > 0
short
Resource Pool Usage
Displays the utilization ratio of dynamic resource pools, calculated as current resources over maximum capacity. This helps identify over- or under-utilized pools and potential resource bottlenecks within the cluster.
sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_current_resources{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_capacity{type="max",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"})
percentunit
yellow: > 0.8
Resource Pool - Acquisition Timeout
Displays the rate of dynamic resource pool acquisition timeouts. This helps identify contention or delays in resource allocation and potential performance bottlenecks within the cluster.
irate(alluxio_dynamic_resource_pool_acquisition_timeouts_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])
short
Resource Pool - Resource Creation Latency (P99)
Displays the 99th percentile latency for creating new resources in dynamic resource pools. This helps identify slow resource allocation and potential performance bottlenecks within the cluster.
histogram_quantile( 0.99, sum( rate(alluxio_dynamic_resource_pool_create_new_resource_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, pool_kind, pool_instance, instance) )
ms
CPU time spent
Displays the CPU time spent by each process.
irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])
µs
Threads
Displays the number of threads for each process.
jvm_threads_current{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}
Heap Usage
Displays the heap memory usage for each process.
jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}
bytes
Heap Usage(%)
Displays the heap memory usage percentage for each process.
jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"} / jvm_memory_max_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}
percentunit
yellow: > 0.9
young GC time(per minute)
Displays the young generation garbage collection time per minute.
irate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60
s
young GC rate(per minute)
Displays the young generation garbage collection rate per minute.
irate(jvm_gc_collection_seconds_count{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60
old GC time(per minute)
Displays the old generation garbage collection time per minute.
irate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60
s
old GC rate(per minute)
Displays the old generation garbage collection rate per minute.
irate(jvm_gc_collection_seconds_count{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60
ETCD Row
Up
Displays the number of node in ETCD cluster.
sum(etcd_server_has_leader{job="etcd"})
yellow: > 1, green: > 3
RPC Rate
Displays the RPC request and failure rates.
sum(rate(grpc_server_started_total{job="etcd",grpc_type="unary"}[5m])), sum(rate(grpc_server_handled_total{job="etcd",grpc_type="unary",grpc_code!="OK"}[5m]))
ops
Active Streams
Displays the number of active watch and lease streams.
sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}), sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})
ETCD Client Call Failure Rate
Displays the error rate of Etcd client calls. This helps identify failing Etcd requests and potential reliability or connectivity issues between the cluster and the Etcd service.
(sum by (instance, server) (irate(alluxio_etcd_call_errors_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) / (sum by (instance, server) (irate(alluxio_etcd_client_calls_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])))
percentunit
yellow: > 0.05, red: > 0.2
ETCD Client Call Latency (P99)
Displays the 99th percentile latency of Etcd client calls. This helps identify slow Etcd operations and potential performance bottlenecks in Etcd communication within the cluster.
histogram_quantile( 0.99, sum( rate(alluxio_etcd_client_call_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
DB Size
Displays the size of the ETCD database.
etcd_mvcc_db_total_size_in_bytes{job="etcd"}
bytes
Disk Sync Duration
Displays the disk sync duration for WAL and backend operations.
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le)), histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))
Memory
Displays the resident memory usage of the ETCD process.
process_resident_memory_bytes{job="etcd"}
bytes
Client Traffic In
Displays the client traffic in rates.
rate(etcd_network_client_grpc_received_bytes_total{job="etcd"}[5m])
binBps
Client Traffic Out
Displays the client traffic out rates.
rate(etcd_network_client_grpc_sent_bytes_total{job="etcd"}[5m])
binBps
Peer Traffic In
Displays the peer traffic in rates.
sum(rate(etcd_network_peer_received_bytes_total{job="etcd"}[5m])) by (instance)
binBps
Peer Traffic Out
Displays the peer traffic out rates.
sum(rate(etcd_network_peer_sent_bytes_total{job="etcd"}[5m])) by (instance)
binBps
Raft Proposals
Displays the raft proposal metrics including failure rate, pending total, commit rate, and apply rate.
sum(rate(etcd_server_proposals_failed_total{job="etcd"}[5m])), sum(etcd_server_proposals_pending{job="etcd"}), sum(rate(etcd_server_proposals_committed_total{job="etcd"}[5m])), sum(rate(etcd_server_proposals_applied_total{job="$cluster"}[5m]))
none
Total Leader Elections Per Day
Displays the total number of leader elections per day.
changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])
Last updated