> For the complete documentation index, see [llms.txt](https://documentation.alluxio.io/ee-ai-en/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://documentation.alluxio.io/ee-ai-en/reference/dashboard.md).

# Grafana Dashboard

## Overview

This document provides detailed descriptions of all panels in the Alluxio Grafana dashboard, including metric names, calculation methods, and threshold configurations.

***

### Cluster Row

| Panel Name                              | Description                                                                                                                                                                                                                                                                                            | Calculation Method                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Unit                | Threshold                                                                               |
| --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------- | --------------------------------------------------------------------------------------- |
| **Storage**                             | Displays the total capacity, used storage, and usage percentage of the Alluxio cache. This panel is critical for monitoring overall cache utilization and preventing out-of-storage issues.                                                                                                            | `sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})`, `sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"})`, `sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | bytes               | yellow: > 0.9, red: > 0.95                                                              |
| **Read - Throughput**                   | Shows the read throughput from different sources: Fuse, S3 API, and direct Worker access. This helps identify the primary data access interfaces for read operations.                                                                                                                                  | `sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | binBps              | yellow: > 100, red: > 2000                                                              |
| **Read - Load (5m)**                    | Displays the average read load (thread utilization) across all workers over the last 5 minutes. A high average load may indicate a system-wide performance bottleneck.                                                                                                                                 | `avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | percentunit         | yellow: > 0.8                                                                           |
| **Read - Hotspot (load > 50%)**         | Displays per-worker read load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.                                                                                            | `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5`                                                                                                                                                                                                                                                                                                                                                                               | percentunit         | yellow: > 0.8                                                                           |
| **Cache File**                          | Displays the total number of files and pages currently stored in the Alluxio cache. This helps in understanding the composition and granularity of the cached data.                                                                                                                                    | `sum(alluxio_data_cached_files{job="worker",cluster_name=~"$cluster"})`, `sum(alluxio_data_cached_pages{job="worker",cluster_name=~"$cluster"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | short               |                                                                                         |
| **Write - Throughput**                  | Shows the write throughput from different sources: Fuse, S3 API, and direct Worker access. This helps identify the primary data access interfaces for write operations.                                                                                                                                | `sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0)`, `avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | binBps              | yellow: > 100, red: > 2000                                                              |
| **Write - Load (5m)**                   | Displays the average write load (thread utilization) across all workers over the last 5 minutes. A high average load may indicate a system-wide performance bottleneck.                                                                                                                                | `avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | percentunit         | yellow: > 0.8                                                                           |
| **Write - Hotspot (load > 50%)**        | Displays per-worker write load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.                                                                                           | `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5`                                                                                                                                                                                                                                                                                                                                                                               | percentunit         | yellow: > 0.8                                                                           |
| **Workers**                             | Displays the number of live (healthy) and lost (unhealthy) workers in the cluster. This is a key indicator of cluster health and stability.                                                                                                                                                            | `sum(up{job="worker",cluster_name=~"$cluster"})`, `sum by () (prometheus_target_scrape_pool_targets{scrape_job="worker"}) - sum by () (up{job="worker",cluster_name=~"$cluster"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                     | yellow: > 1                                                                             |
| **Jobs**                                | Displays the number of currently running and waiting jobs in the job service queue. This helps monitor the status of asynchronous operations like data loading.                                                                                                                                        | `sum(alluxio_active_job_count{job="coordinator",type="running"})`, `sum(alluxio_active_job_count{job="coordinator",type="waiting"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | short               | yellow: > 100                                                                           |
| **Meta - RPS**                          | Displays the metadata Requests Per Second (RPS) from different sources: Fuse, S3 API, and direct Worker access. This is important for monitoring metadata workload intensity.                                                                                                                          | \`sum(alluxio\_fuse\_concurrency{job="fuse",method=\~"Fuse\\.Create\\                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Fuse\\.Getattr\\    | Fuse\\.Readdir\\                                                                        |
| **Meta - Load (5m)**                    | Displays the average metadata operation load (thread utilization) across all workers over the last 5 minutes. High load can indicate a metadata-intensive workload.                                                                                                                                    | `avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | percentunit         | yellow: > 0.8                                                                           |
| **Meta - Hotspot (load > 50%)**         | Displays per-worker metadata operation load (for workers exceeding 50% utilization), and highlights the most heavily used workers. This helps identify uneven I/O distribution and potential hotspots within the cluster.                                                                              | `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5`                                                                                                                                                                                                                                                                                                                                                                               | percentunit         | yellow: > 0.8                                                                           |
| **License (Valid To)**                  | Displays the expiration date of the Alluxio enterprise license. This is important for ensuring uninterrupted access to enterprise features.                                                                                                                                                            | \`min(alluxio\_license\_expiration\_date{job=\~"coordinator\\                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | worker"}) \* 1000\` | dateTimeFromNow                                                                         |
| **Component Version**                   | Displays the distribution of Alluxio versions running across the cluster. This is useful for verifying version consistency during upgrades or troubleshooting.                                                                                                                                         | `count (alluxio_version_info{cluster_name=~"$cluster"}) by (version)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                     |                                                                                         |
| **Component Uptime**                    | Displays the uptime of each Alluxio component (coordinator, worker, fuse). This helps track service stability and identify recent restarts.                                                                                                                                                            | \`timestamp(process\_start\_time\_seconds{job=\~"coordinator\\                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | worker\\            | fuse",cluster\_name=~~"$cluster"}) - process\_start\_time\_seconds{job=~~"coordinator\\ |
| **Cache Hit(%)**                        | Displays the percentage of data/metadata read requests served from the Alluxio cache. A higher value indicates better cache efficiency and reduced load on the underlying storage.                                                                                                                     | `sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + (sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0)))`, `sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_metadata_cache_miss_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))`                                                                                                                                                                                                                                                                                                                                                                              | percentunit         | green: > 0.8                                                                            |
| **Cache Eviction**                      | Displays the rate at which data is being evicted from the cache. A high eviction rate may indicate insufficient cache capacity for the workload.                                                                                                                                                       | `sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker",cluster_name=~"$cluster"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | binBps              |                                                                                         |
| **Throughput - Read**                   | Provides a detailed breakdown of read throughput across different data paths (FUSE, S3, Worker, UFS). Essential for understanding data flow and identifying bottlenecks.                                                                                                                               | `sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector`, `sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                          | binBps              |                                                                                         |
| **Throughput - Write**                  | Provides a detailed breakdown of write throughput across different data paths (FUSE, S3, Worker, UFS). Critical for understanding write efficiency and patterns.                                                                                                                                       | `sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector`, `sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0)`, `sum(irate(alluxio_write_buffer_async_persist_throughput_bytes_total{job="worker"}[5m])) or on() vector(0)`, `sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                               | binBps              |                                                                                         |
| **Request/s - Read**                    | Displays the total read request rate across all interfaces (FUSE, S3, Worker). Important for understanding read workload intensity and traffic patterns.                                                                                                                                               | `sum(irate(alluxio_data_access_bytes_count{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_data_access_bytes_count{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | reqps               |                                                                                         |
| **Request/s - Write**                   | Displays the total write request rate across all interfaces. Critical for understanding write workload patterns and identifying write-heavy phases.                                                                                                                                                    | `sum(irate(alluxio_data_access_bytes_count{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(sum(irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) or on() vector(0)`, `sum(irate(alluxio_data_access_bytes_count{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | reqps               |                                                                                         |
| **Request/s - Metadata**                | Displays the metadata operation request rate across all interfaces. Important for understanding metadata workload intensity, which can impact overall cluster performance.                                                                                                                             | \`sum(alluxio\_fuse\_concurrency{job="fuse",method=\~"Fuse\\.Create\\                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Fuse\\.Getattr\\    | Fuse\\.Readdir\\                                                                        |
| **Jobs**                                | Time series visualization of job execution patterns. Important for understanding cluster background activity levels and identifying periods of high maintenance activity that might impact performance.                                                                                                | `sum(alluxio_active_job_count{job="coordinator",type="running"})`, `sum(alluxio_active_job_count{job="coordinator",type="waiting"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | short               |                                                                                         |
| **Request Latency - P90**               | Displays the 90th percentile latency for all request types. This metric is key to understanding the typical user experience and identifying performance degradation.                                                                                                                                   | `histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method))`, `histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op))`, `histogram_quantile(0.90, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method))`, `histogram_quantile(0.90, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))`                                                                                                                                                                                                                                                                                                                                                                                                                                                    | ms                  |                                                                                         |
| **Request Latency - P99**               | Displays the 99th percentile latency for all request types. This metric is critical for identifying worst-case performance and outliers that may indicate system stress.                                                                                                                               | `histogram_quantile(0.99, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method))`, `histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op))`, `histogram_quantile(0.99, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method))`, `histogram_quantile(0.99, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))`                                                                                                                                                                                                                                                                                                                                                                                                                                                    | ms                  |                                                                                         |
| **RPC (Client<->Worker) Latency - P90** | Displays the estimated network latency for client GetStatus gRPC calls (metadata operations), calculated as the difference between client-observed latency and worker metadata processing latency. This helps identify network-related delays and connectivity issues between clients and the cluster. | `(histogram_quantile(0.90, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0`, `(histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",method="Fuse.Getattr",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0`                                                                                                                                                                                                                                                                                                                                                                     | ms                  |                                                                                         |
| **RPC (Client<->Worker) Latency - P99** | Displays the estimated network latency for client GetStatus gRPC calls (metadata operations), calculated as the difference between client-observed latency and worker metadata processing latency. This helps identify network-related delays and connectivity issues between clients and the cluster. | `(histogram_quantile(0.99, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | ms                  |                                                                                         |
| **Request Call - Success Rate**         | Displays the success rate for all request types across interfaces. A value below 99% (excluding Not Found errors) typically indicates system issues that require investigation.                                                                                                                        | `(sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", cluster_name=~"$cluster"}[5m])))`, `(sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", cluster_name=~"$cluster"}[5m])))`, `(sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]) - irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m]))) / (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m])))`, `(sum by (ufs_type) (irate(alluxio_ufs_total{cluster_name=~"$cluster"}[5m]) - irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))) / (sum by (ufs_type) (irate(alluxio_ufs_latency_ms_count{cluster_name=~"$cluster"}[5m])))` | percentunit         | green: > 0.9                                                                            |
| **Request Call - Failures**             | Displays the failure rate across all request types and interfaces. Spikes in this metric require immediate investigation to identify system or application errors.                                                                                                                                     | `sum by (method, state) (irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", cluster_name=~"$cluster"}[5m]))`, `sum by (method, status) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",cluster_name=~"$cluster"}[5m]))`, `sum by (op) (irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m]))`, `sum by (ufs_type,error_code) (irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | short               |                                                                                         |

***

### Fuse Row

| Panel Name                       | Description                                                                                                                                                                                                                                       | Calculation Method                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Unit        | Threshold    |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------- | ------------ |
| **Request Latency - P90**        | Displays the 90th percentile latency for individual Fuse mount operations. This helps isolate performance issues specific to a particular Fuse client.                                                                                            | `histogram_quantile(0.90, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]))`, `histogram_quantile( 0.90, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                                        | ms          |              |
| **Request Latency - P99**        | Displays the 99th percentile latency for individual Fuse mount operations. This helps isolate performance issues specific to a particular Fuse client.                                                                                            | `histogram_quantile(0.99, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]))`, `histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                                        | ms          |              |
| **Throughput - Read**            | Displays the read throughput for an individual Fuse mount. This is useful for monitoring the read activity of a specific client.                                                                                                                  | `irate(alluxio_data_access_bytes_sum{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | binBps      |              |
| **Throughput - Write**           | Displays the write throughput for an individual Fuse mount. This is useful for monitoring the write activity of a specific client.                                                                                                                | `irate(alluxio_data_access_bytes_sum{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | binBps      |              |
| **Request/s - Read**             | Displays the read request rate for an individual Fuse mount. This helps understand the read workload from a specific client.                                                                                                                      | `irate(alluxio_data_access_bytes_count{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | reqps       |              |
| **Request/s - Write**            | Displays the write request rate for an individual Fuse mount. This helps understand the write workload from a specific client.                                                                                                                    | `irate(alluxio_data_access_bytes_count{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | reqps       |              |
| **FUSE Request Failure**         | Displays the failure rate for an individual Fuse mount. This is critical for diagnosing client-specific connectivity or operational issues.                                                                                                       | `irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | short       |              |
| **Throughput - UFS Fallback**    | Read throughput when FUSE operations fall back to underlying filesystem (UFS). High fallback rates indicate cache misses or data not in Alluxio, reducing performance benefits. Should be minimal for optimal performance.                        | `irate(alluxio_ufs_data_access_bytes_total{job="fuse",instance=~"$instance",method="read",cluster_name=~"$cluster"}[5m]) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | binBps      |              |
| **Client Request Latency (P99)** | Displays the latency distribution of gRPC (metadata ops) and Netty (data ops) client calls. This helps identify RPC communication delays, client-side bottlenecks, and potential network or service responsiveness issues within the workers.     | `histogram_quantile( 0.99, sum( rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, method, instance) )`, `histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                              | ms          |              |
| **Client Request Concurrency**   | Displays the number of concurrent gRPC client calls (metadata ops). This helps identify spikes in metadata request load and potential client-side bottlenecks within the worker.                                                                  | `alluxio_grpc_client_concurrency{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | reqps       |              |
| **Client Request Failure Rate**  | Displays the error rate of gRPC client calls (metadata operations) and Netty operation calls (data operations). This helps identify failing metadata requests and potential reliability issues in client-worker communication within the cluster. | `sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) + sum by (instance,method) (irate(alluxio_grpc_client_successes_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])))`, `sum by (instance,op) (irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,op) (irate(alluxio_netty_operations_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])))` | percentunit | green: > 0.9 |
| **Client Request Errors**        | Displays the total number of gRPC client errors (metadata operations) and Netty operation errors (data operations). This helps track failing requests and potential reliability issues in client-worker communication within the cluster.         | `irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])`, `irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                               | short       |              |

***

### S3 Row

| Panel Name                | Description                                                                                                                                          | Calculation Method                                                                                                                       | Unit   | Threshold |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ------ | --------- |
| **Request Latency - P90** | Displays the 90th percentile latency for S3 API operations on a per-worker basis. This helps isolate S3 performance issues to specific workers.      | `histogram_quantile(0.90, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))` | ms     |           |
| **Request Latency - P99** | Displays the 99th percentile latency for S3 API operations on a per-worker basis. This helps identify worst-case S3 performance on specific workers. | `histogram_quantile(0.99, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))` | ms     |           |
| **Throughput - Read**     | Displays the read throughput for S3 API operations on a per-worker basis. This is useful for monitoring S3 read activity on specific workers.        | `irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`              | binBps |           |
| **Throughput - Write**    | Displays the write throughput for S3 API operations on a per-worker basis. This is useful for monitoring S3 write activity on specific workers.      | `irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`             | binBps |           |
| **Request/s - Read**      | Displays the read request rate for S3 API operations on a per-worker basis. This helps understand the S3 read workload on specific workers.          | `irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`            | reqps  |           |
| **Request/s - Write**     | Displays the write request rate for S3 API operations on a per-worker basis. This helps understand the S3 write workload on specific workers.        | `irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`           | reqps  |           |
| **Request Failure**       | Displays the failure rate for S3 API operations on a per-worker basis. This is critical for diagnosing S3-specific issues on individual workers.     | `irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",instance=~"$instance",cluster_name=~"$cluster"}[5m])`          | short  |           |

***

### Worker Row

| Panel Name                                         | Description                                                                                                                                                                                                                                                           | Calculation Method                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Unit        | Threshold                 |
| -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | ------------------------- |
| **Storage (Data) Used**                            | Displays the amount of cache data storage used by each individual worker. This is useful for identifying uneven storage distribution.                                                                                                                                 | `alluxio_cached_storage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | bytes       |                           |
| **Storage (Meta) Used**                            | Displays the amount of cache metadata storage used by each individual worker. This is useful for identifying uneven storage distribution.                                                                                                                             | `alluxio_metastore_storage_size_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | bytes       |                           |
| **Storage Device Usage**                           | Displays the page store device capacity utilization per directory on workers. This helps identify disk space consumption trends and potential storage exhaustion or imbalance within the cluster.                                                                     | `1- (sum by (instance, dir) (alluxio_page_store_device_available_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, dir) (alluxio_page_store_device_total_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}))`                                                                                                                                                                                                                                                                                                                                      | percentunit | yellow: > 0.8, red: > 0.9 |
| **Files**                                          | Displays the number of cached files for each individual worker. This helps in understanding the cache composition on a per-worker basis.                                                                                                                              | `alluxio_data_cached_files{job="worker",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | short       |                           |
| **Pages**                                          | Displays the number of cached pages for each individual worker. This helps in understanding the cache composition on a per-worker basis.                                                                                                                              | `alluxio_data_cached_pages{job="worker",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | short       |                           |
| **Cache Evicted**                                  | Displays the rate of data eviction from the cache for each individual worker. This helps identify which workers are under the most memory pressure.                                                                                                                   | `irate(alluxio_cached_evicted_data_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | binBps      |                           |
| **Request Latency - P90**                          | Displays the 90th percentile latency for internal metadata operations on a per-worker basis.                                                                                                                                                                          | `histogram_quantile(0.90, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))`, `histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                    | ms          |                           |
| **Request Latency - P99**                          | Displays the 99th percentile latency for internal metadata operations on a per-worker basis.                                                                                                                                                                          | `histogram_quantile(0.99, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))`, `histogram_quantile( 0.99, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                    | ms          |                           |
| **Request (from Page Store or UFS) Latency - P90** | Displays the latency distribution for read operations served by Alluxio workers from the page store or underlying file system (UFS). This helps identify read performance bottlenecks and distinguish slow storage backends or overloaded workers within the cluster. | `histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_storage_response_time_ms_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                                                                                                                                                                   | ms          |                           |
| **PageStore IO Latency - P90**                     | Displays the 90th percentile latency of page store I/O operations (data operations) on workers. This helps identify slow disk I/O performance and potential storage bottlenecks within the cluster.                                                                   | `histogram_quantile( 0.90, sum( rate(alluxio_page_store_io_latency_microseconds_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, op, instance) )`                                                                                                                                                                                                                                                                                                                                                                                                                                       | µs          |                           |
| **Read - Throughput**                              | Displays the read throughput for each individual worker. This is useful for monitoring the read activity and load on each worker.                                                                                                                                     | `irate(alluxio_data_throughput_bytes_total{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | binBps      |                           |
| **Write - Throughput**                             | Displays the write throughput for each individual worker. This is useful for monitoring the write activity and load on each worker.                                                                                                                                   | `irate(alluxio_data_throughput_bytes_total{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | binBps      |                           |
| **Read - Request/s**                               | Displays the read request rate for each individual worker. This helps understand the read workload distribution across workers.                                                                                                                                       | `irate(alluxio_data_access_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | reqps       |                           |
| **Write - Request/s**                              | Displays the write request rate for each individual worker. This helps understand the write workload distribution across workers.                                                                                                                                     | `irate(alluxio_data_access_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | reqps       |                           |
| **Read - Threads**                                 | Displays the active threads and queue length for the read thread pool on each worker. This is useful for diagnosing read performance bottlenecks.                                                                                                                     | `alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                      | short       |                           |
| **Write - Threads**                                | Displays the active threads and queue length for the write thread pool on each worker. This is useful for diagnosing write performance bottlenecks.                                                                                                                   | `alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                        | short       |                           |
| **Metadata - Request/s**                           | Displays the metadata request rate for each individual worker. This helps identify workers with high metadata load.                                                                                                                                                   | `irate(alluxio_meta_operation_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | reqps       |                           |
| **Metadata - Threads**                             | Displays the active threads and queue length for the metadata thread pool on each worker. This is useful for diagnosing metadata performance bottlenecks.                                                                                                             | `alluxio_rpc_executor_max_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_active_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                  | short       |                           |
| **Request Failure**                                | Displays the failure rate for internal metadata operations on a per-worker basis. This helps pinpoint workers that are experiencing internal errors.                                                                                                                  | `irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster", instance=~"$instance"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | short       |                           |
| **Off Heap Memory**                                | Displays the off-heap memory usage for each worker. This is important for monitoring memory resources and preventing out-of-memory errors.                                                                                                                            | `sum(alluxio_rocksdb_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(alluxio_netty_direct_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(jvm_memory_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",area="nonheap"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="direct"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="mapped"}) by (instance)` | bytes       |                           |
| **PageStore Errors**                               | Displays the error rate for PageStore operations on each worker. This can indicate issues with the local cache storage layer.                                                                                                                                         | `irate(alluxio_page_store_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | short       |                           |
| **PageStore Disk Error Rate**                      | Displays the error rate for disk-related PageStore operations on each worker. This can indicate underlying disk hardware or file system issues.                                                                                                                       | `sum by (instance, dir) (irate(alluxio_page_store_dir_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) or on() vector(0)) / sum by (instance, dir) (irate(alluxio_page_store_dir_operations_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]))`                                                                                                                                                                                                                                                                                                           | percentunit | yellow: > 0.1, red: > 0.3 |

***

### UFS Row

| Panel Name                | Description                                                                                                                                             | Calculation Method                                                                                                  | Unit   | Threshold |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- | ------ | --------- |
| **Request Latency - P90** | Displays the 90th percentile latency for UFS operations on a per-worker basis. This helps identify slow UFS interactions from specific workers.         | `histogram_quantile(0.90, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))` | ms     |           |
| **Request Latency - P99** | Displays the 99th percentile latency for UFS operations on a per-worker basis. This helps identify worst-case UFS performance from specific workers.    | `histogram_quantile(0.99, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))` | ms     |           |
| **Throughput - Read**     | Displays the read throughput from the UFS on a per-worker basis. This is useful for monitoring how much data is being read from the underlying storage. | `irate(alluxio_ufs_data_access_bytes_total{method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`      | binBps |           |
| **Throughput - Write**    | Displays the write throughput to the UFS on a per-worker basis. This is useful for monitoring how much data is being written to the underlying storage. | `irate(alluxio_ufs_data_access_bytes_total{method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`     | binBps |           |
| **Errors**                | Displays the error rate for UFS operations on a per-worker basis. This is critical for diagnosing issues with the underlying storage system.            | `irate(alluxio_ufs_error_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                | none   |           |

***

### Job Row

| Panel Name                                       | Description                                                                                                                                                                                             | Calculation Method                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Unit   | Threshold |
| ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | --------- |
| **Jobs**                                         | Time series visualization of job execution patterns. Important for understanding cluster background activity levels and identifying periods of high maintenance activity that might impact performance. | `alluxio_active_job_count{job="coordinator",type="running"}`, `alluxio_active_job_count{job="coordinator",type="waiting"}`                                                                                                                                                                                                                                                                                                                                                                                                                 | short  |           |
| **Job Tasks per Worker**                         | Time series visualization of job execution tasks on each worker. Important for understanding worker activity levels and identifying bottlenecks that might impact performance.                          | `alluxio_worker_job_task_count{}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | short  |           |
| **Job Threads per Worker**                       | Displays the active threads and queue length for the load job thread pool on each worker. This is useful for diagnosing load job performance bottlenecks.                                               | `alluxio_rpc_executor_max_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_active_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_queue_length{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}` | short  |           |
| **Job Dispatched Per Second**                    | Displays the rate of jobs dispatched per worker.                                                                                                                                                        | `sum by(worker)(irate(alluxio_distributed_load_job_dispatched_size_total{job="coordinator", instance=~"$instance"} [5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                 |        |           |
| **Distributed Load Throughput**                  | Displays the load job throughput for each individual worker. This is useful for monitoring the load job activity on each worker.                                                                        | `sum(irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",cluster_name=~"$cluster"}[5m]))`, `irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                    | binBps |           |
| **Bytes Loaded on Workers Per Second**           | Displays the rate of bytes loaded on workers.                                                                                                                                                           | `sum by(instance) (irate(alluxio_distributed_worker_bytes_loaded_bytes_total{job="worker", instance=~"$instance"} [5m]))`, `sum by()(irate(alluxio_distributed_load_job_loaded_bytes_total{job="coordinator", instance=~"$instance"} [5m]))`                                                                                                                                                                                                                                                                                               | binBps |           |
| **Distributed Load Operation Counts Per Second** | Displays the rate of distributed load operations.                                                                                                                                                       | `irate(alluxio_distributed_load_job_scanned_total{job="coordinator", instance=~"$instance"}[5m])`, `irate(alluxio_distributed_load_job_processed_total{job="coordinator", instance=~"$instance"}[5m])`, `irate(alluxio_distributed_load_job_skipped_total{job="coordinator", instance=~"$instance"}[5m])`, `sum by()(irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"}[5m]))`                                                                                                                     | short  |           |
| **Distributed Load Failure Breakdowns**          | Displays the failure breakdown by reason and worker.                                                                                                                                                    | `sum by(reason, worker) (irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"} [5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                |        |           |

***

### Process Row

| Panel Name                                          | Description                                                                                                                                                                                                               | Calculation Method                                                                                                                                                                                                                                                                                                                | Unit        | Threshold     |
| --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | ------------- |
| **Active Worker Membership**                        | Displays the total number of worker membership refreshes on each instance. This helps track cluster topology updates and potential instability or frequent reconfiguration of workers within the cluster.                 | `irate(alluxio_worker_membership_refresh_count_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) > 0`                                                                                                                                                                                                    | short       |               |
| **Resource Pool Usage**                             | Displays the utilization ratio of dynamic resource pools, calculated as current resources over maximum capacity. This helps identify over- or under-utilized pools and potential resource bottlenecks within the cluster. | `sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_current_resources{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_capacity{type="max",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"})` | percentunit | yellow: > 0.8 |
| **Resource Pool - Acquisition Timeout**             | Displays the rate of dynamic resource pool acquisition timeouts. This helps identify contention or delays in resource allocation and potential performance bottlenecks within the cluster.                                | `irate(alluxio_dynamic_resource_pool_acquisition_timeouts_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                             | short       |               |
| **Resource Pool - Resource Creation Latency (P99)** | Displays the 99th percentile latency for creating new resources in dynamic resource pools. This helps identify slow resource allocation and potential performance bottlenecks within the cluster.                         | `histogram_quantile( 0.99, sum( rate(alluxio_dynamic_resource_pool_create_new_resource_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, pool_kind, pool_instance, instance) )`                                                                                                    | ms          |               |
| **CPU time spent**                                  | Displays the CPU time spent by each process.                                                                                                                                                                              | `irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                            | µs          |               |
| **Threads**                                         | Displays the number of threads for each process.                                                                                                                                                                          | `jvm_threads_current{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                             |             |               |
| **Heap Usage**                                      | Displays the heap memory usage for each process.                                                                                                                                                                          | `jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                               | bytes       |               |
| **Heap Usage(%)**                                   | Displays the heap memory usage percentage for each process.                                                                                                                                                               | `jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"} / jvm_memory_max_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                            | percentunit | yellow: > 0.9 |
| **young GC time(per minute)**                       | Displays the young generation garbage collection time per minute.                                                                                                                                                         | `irate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60`                                                                                                                                                                                          | s           |               |
| **young GC rate(per minute)**                       | Displays the young generation garbage collection rate per minute.                                                                                                                                                         | `irate(jvm_gc_collection_seconds_count{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60`                                                                                                                                                                                        |             |               |
| **old GC time(per minute)**                         | Displays the old generation garbage collection time per minute.                                                                                                                                                           | `irate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60`                                                                                                                                                                                            | s           |               |
| **old GC rate(per minute)**                         | Displays the old generation garbage collection rate per minute.                                                                                                                                                           | `irate(jvm_gc_collection_seconds_count{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60`                                                                                                                                                                                          |             |               |

***

### ETCD Row

| Panel Name                         | Description                                                                                                                                                                         | Calculation Method                                                                                                                                                                                                                                                                                                                                                                                                           | Unit        | Threshold                  |
| ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | -------------------------- |
| **Up**                             | Displays the number of node in ETCD cluster.                                                                                                                                        | `sum(etcd_server_has_leader{job="etcd"})`                                                                                                                                                                                                                                                                                                                                                                                    |             | yellow: > 1, green: > 3    |
| **RPC Rate**                       | Displays the RPC request and failure rates.                                                                                                                                         | `sum(rate(grpc_server_started_total{job="etcd",grpc_type="unary"}[5m]))`, `sum(rate(grpc_server_handled_total{job="etcd",grpc_type="unary",grpc_code!="OK"}[5m]))`                                                                                                                                                                                                                                                           | ops         |                            |
| **Active Streams**                 | Displays the number of active watch and lease streams.                                                                                                                              | `sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"})`, `sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})` |             |                            |
| **ETCD Client Call Failure Rate**  | Displays the error rate of Etcd client calls. This helps identify failing Etcd requests and potential reliability or connectivity issues between the cluster and the Etcd service.  | `(sum by (instance, server) (irate(alluxio_etcd_call_errors_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) / (sum by (instance, server) (irate(alluxio_etcd_client_calls_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])))`                                                                                                                       | percentunit | yellow: > 0.05, red: > 0.2 |
| **ETCD Client Call Latency (P99)** | Displays the 99th percentile latency of Etcd client calls. This helps identify slow Etcd operations and potential performance bottlenecks in Etcd communication within the cluster. | `histogram_quantile( 0.99, sum( rate(alluxio_etcd_client_call_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                  | ms          |                            |
| **DB Size**                        | Displays the size of the ETCD database.                                                                                                                                             | `etcd_mvcc_db_total_size_in_bytes{job="etcd"}`                                                                                                                                                                                                                                                                                                                                                                               | bytes       |                            |
| **Disk Sync Duration**             | Displays the disk sync duration for WAL and backend operations.                                                                                                                     | `histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))`, `histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))`                                                                                                                                                                              |             |                            |
| **Memory**                         | Displays the resident memory usage of the ETCD process.                                                                                                                             | `process_resident_memory_bytes{job="etcd"}`                                                                                                                                                                                                                                                                                                                                                                                  | bytes       |                            |
| **Client Traffic In**              | Displays the client traffic in rates.                                                                                                                                               | `rate(etcd_network_client_grpc_received_bytes_total{job="etcd"}[5m])`                                                                                                                                                                                                                                                                                                                                                        | binBps      |                            |
| **Client Traffic Out**             | Displays the client traffic out rates.                                                                                                                                              | `rate(etcd_network_client_grpc_sent_bytes_total{job="etcd"}[5m])`                                                                                                                                                                                                                                                                                                                                                            | binBps      |                            |
| **Peer Traffic In**                | Displays the peer traffic in rates.                                                                                                                                                 | `sum(rate(etcd_network_peer_received_bytes_total{job="etcd"}[5m])) by (instance)`                                                                                                                                                                                                                                                                                                                                            | binBps      |                            |
| **Peer Traffic Out**               | Displays the peer traffic out rates.                                                                                                                                                | `sum(rate(etcd_network_peer_sent_bytes_total{job="etcd"}[5m])) by (instance)`                                                                                                                                                                                                                                                                                                                                                | binBps      |                            |
| **Raft Proposals**                 | Displays the raft proposal metrics including failure rate, pending total, commit rate, and apply rate.                                                                              | `sum(rate(etcd_server_proposals_failed_total{job="etcd"}[5m]))`, `sum(etcd_server_proposals_pending{job="etcd"})`, `sum(rate(etcd_server_proposals_committed_total{job="etcd"}[5m]))`, `sum(rate(etcd_server_proposals_applied_total{job="$cluster"}[5m]))`                                                                                                                                                                  | none        |                            |
| **Total Leader Elections Per Day** | Displays the total number of leader elections per day.                                                                                                                              | `changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])`                                                                                                                                                                                                                                                                                                                                                             |             |                            |

***


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/reference/dashboard.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
