# Grafana 仪表板

## 概述

本文档详细描述了 Alluxio Grafana dashboard 中所有面板的信息，包括指标名称、计算方法和阈值配置。

***

### Cluster 行

| 面板名称                                    | 描述                                                                                            | 计算方法                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 单位                  | 阈值                                                                                      |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| --------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------- | --------------------------------------------------------------------------------------- | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- | -------------------------- |
| **Storage**                             | 显示 Alluxio 缓存的总容量、已使用存储量及使用率百分比。此面板对于监控整体缓存利用率、防止存储耗尽至关重要。                                    | `sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})`, `sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"})`, `sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | bytes               | yellow: > 0.9, red: > 0.95                                                              |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Read - Throughput**                   | 显示来自不同数据源的读取吞吐量：Fuse、S3 API 以及直接 Worker 访问。有助于识别读取操作的主要数据访问接口。                                | `sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | binBps              | yellow: > 100, red: > 2000                                                              |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Read - Load (5m)**                    | 显示过去 5 分钟内所有 worker 的平均读取负载（线程利用率）。平均负载过高可能表明系统存在全局性能瓶颈。                                      | `avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | percentunit         | yellow: > 0.8                                                                           |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Read - Hotspot (load > 50%)**         | 显示各 worker 的读取负载（仅展示利用率超过 50% 的 worker），并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。           | `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5`                                                                                                                                                                                                                                                                                                                                                                               | percentunit         | yellow: > 0.8                                                                           |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Cache File**                          | 显示当前存储在 Alluxio 缓存中的文件总数和页面总数。有助于了解缓存数据的构成和粒度。                                                | `sum(alluxio_data_cached_files{job="worker",cluster_name=~"$cluster"})`, `sum(alluxio_data_cached_pages{job="worker",cluster_name=~"$cluster"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | short               |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Write - Throughput**                  | 显示来自不同数据源的写入吞吐量：Fuse、S3 API 以及直接 Worker 访问。有助于识别写入操作的主要数据访问接口。                                | `sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0)`, `avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | binBps              | yellow: > 100, red: > 2000                                                              |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Write - Load (5m)**                   | 显示过去 5 分钟内所有 worker 的平均写入负载（线程利用率）。平均负载过高可能表明系统存在全局性能瓶颈。                                      | `avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | percentunit         | yellow: > 0.8                                                                           |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Write - Hotspot (load > 50%)**        | 显示各 worker 的写入负载（仅展示利用率超过 50% 的 worker），并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。           | `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5`                                                                                                                                                                                                                                                                                                                                                                               | percentunit         | yellow: > 0.8                                                                           |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Workers**                             | 显示集群中存活（健康）和丢失（不健康）的 worker 数量。这是衡量集群健康状况和稳定性的关键指标。                                           | `sum(up{job="worker",cluster_name=~"$cluster"})`, `sum by () (prometheus_target_scrape_pool_targets{scrape_job="worker"}) - sum by () (up{job="worker",cluster_name=~"$cluster"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                     | yellow: > 1                                                                             |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Jobs**                                | 显示 job 服务队列中当前正在运行和等待的 job 数量。有助于监控数据加载等异步操作的状态。                                              | `sum(alluxio_active_job_count{job="coordinator",type="running"})`, `sum(alluxio_active_job_count{job="coordinator",type="waiting"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | short               | yellow: > 100                                                                           |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Meta - RPS**                          | 显示来自不同数据源的元数据每秒请求数（RPS）：Fuse、S3 API 以及直接 Worker 访问。对于监控元数据工作负载强度非常重要。                         | \`sum(alluxio\_fuse\_concurrency{job="fuse",method=\~"Fuse\\.Create\\                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Fuse\\.Getattr\\    | Fuse\\.Readdir\\                                                                        | Fuse\\.Statfs\\ | Fuse\\.Unlink",cluster\_name=~~"$cluster"}) or on() vector(0)`, `sum(irate(alluxio\_s3\_api\_call\_latency\_ms\_count{job="worker", method!~~"GetObject\\ | PutObject",cluster\_name=~~"$cluster"}\[5m])) or on() vector(0)`, `sum(irate(alluxio\_meta\_operation\_total{job="worker",cluster\_name=~~"$cluster"}\[5m])) or on() vector(0)`, `avg(avg\_over\_time(alluxio\_rpc\_executor\_current\_queue\_length{job="worker",executor\_name="grpc-metadata"}\[5m]))\` | reqps | yellow: > 100, red: > 2000 |
| **Meta - Load (5m)**                    | 显示过去 5 分钟内所有 worker 的平均元数据操作负载（线程利用率）。高负载可能表明工作负载以元数据操作为主。                                    | `avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | percentunit         | yellow: > 0.8                                                                           |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Meta - Hotspot (load > 50%)**         | 显示各 worker 的元数据操作负载（仅展示利用率超过 50% 的 worker），并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。        | `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5`, `avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5`                                                                                                                                                                                                                                                                                                                                                                               | percentunit         | yellow: > 0.8                                                                           |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **License (Valid To)**                  | 显示 Alluxio 企业版许可证的到期日期。对于确保持续访问企业功能非常重要。                                                      | \`min(alluxio\_license\_expiration\_date{job=\~"coordinator\\                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | worker"}) \* 1000\` | dateTimeFromNow                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Component Version**                   | 显示集群中运行的各 Alluxio 版本分布情况。在升级或故障排查时，用于验证版本一致性。                                                 | `count (alluxio_version_info{cluster_name=~"$cluster"}) by (version)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                     |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Component Uptime**                    | 显示每个 Alluxio 组件（coordinator、worker、fuse）的运行时间。有助于跟踪服务稳定性并识别近期重启情况。                            | \`timestamp(process\_start\_time\_seconds{job=\~"coordinator\\                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | worker\\            | fuse",cluster\_name=~~"$cluster"}) - process\_start\_time\_seconds{job=~~"coordinator\\ | worker\\        | fuse",cluster\_name=\~"$cluster"}\`                                                                                                                       | s                                                                                                                                                                                                                                                                                                          |       |                            |
| **Cache Hit(%)**                        | 显示从 Alluxio 缓存中响应的数据/元数据读取请求百分比。值越高表示缓存效率越好，底层存储的压力也越小。                                       | `sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + (sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0)))`, `sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_metadata_cache_miss_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))`                                                                                                                                                                                                                                                                                                                                                                              | percentunit         | green: > 0.8                                                                            |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Cache Eviction**                      | 显示数据从缓存中被驱逐的速率。驱逐速率过高可能表明缓存容量不足以满足当前工作负载。                                                     | `sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker",cluster_name=~"$cluster"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | binBps              |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Throughput - Read**                   | 提供不同数据路径（FUSE、S3、Worker、UFS）读取吞吐量的详细分解。对于了解数据流向和识别瓶颈至关重要。                                     | `sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector`, `sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                          | binBps              |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Throughput - Write**                  | 提供不同数据路径（FUSE、S3、Worker、UFS）写入吞吐量的详细分解。对于了解写入效率和模式至关重要。                                       | `sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector`, `sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0)`, `sum(irate(alluxio_write_buffer_async_persist_throughput_bytes_total{job="worker"}[5m])) or on() vector(0)`, `sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                               | binBps              |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Request/s - Read**                    | 显示所有接口（FUSE、S3、Worker）的总读取请求速率。对于了解读取工作负载强度和流量模式非常重要。                                         | `sum(irate(alluxio_data_access_bytes_count{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(irate(alluxio_data_access_bytes_count{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | reqps               |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Request/s - Write**                   | 显示所有接口的总写入请求速率。对于了解写入工作负载模式和识别写入密集阶段至关重要。                                                     | `sum(irate(alluxio_data_access_bytes_count{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`, `sum(sum(irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) or on() vector(0)`, `sum(irate(alluxio_data_access_bytes_count{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | reqps               |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Request/s - Metadata**                | 显示所有接口的元数据操作请求速率。对于了解元数据工作负载强度非常重要，因为它会影响集群整体性能。                                              | \`sum(alluxio\_fuse\_concurrency{job="fuse",method=\~"Fuse\\.Create\\                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Fuse\\.Getattr\\    | Fuse\\.Readdir\\                                                                        | Fuse\\.Statfs\\ | Fuse\\.Unlink",cluster\_name=~~"$cluster"}) or on() vector(0)`, `sum(irate(alluxio\_s3\_api\_call\_latency\_ms\_count{job="worker", method!~~"GetObject\\ | PutObject",cluster\_name=~~"$cluster"}\[5m])) or on() vector(0)`, `sum(irate(alluxio\_meta\_operation\_total{job="worker",cluster\_name=~~"$cluster"}\[5m])) or on() vector(0)\`                                                                                                                           | reqps |                            |
| **Jobs**                                | Job 执行模式的时序可视化。对于了解集群后台活动水平和识别可能影响性能的高维护活动时段非常重要。                                             | `sum(alluxio_active_job_count{job="coordinator",type="running"})`, `sum(alluxio_active_job_count{job="coordinator",type="waiting"})`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | short               |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Request Latency - P90**               | 显示所有请求类型的第 90 百分位延迟。此指标是了解典型用户体验和识别性能下降的关键。                                                   | `histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method))`, `histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op))`, `histogram_quantile(0.90, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method))`, `histogram_quantile(0.90, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))`                                                                                                                                                                                                                                                                                                                                                                                                                                                    | ms                  |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Request Latency - P99**               | 显示所有请求类型的第 99 百分位延迟。此指标对于识别最差情况下的性能和可能表明系统压力的异常值至关重要。                                         | `histogram_quantile(0.99, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method))`, `histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op))`, `histogram_quantile(0.99, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method))`, `histogram_quantile(0.99, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))`                                                                                                                                                                                                                                                                                                                                                                                                                                                    | ms                  |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **RPC (Client<->Worker) Latency - P90** | 显示客户端 GetStatus gRPC 调用（元数据操作）的估算网络延迟，计算方式为客户端观测延迟与 worker 元数据处理延迟之差。有助于识别客户端与集群之间的网络延迟和连接问题。 | `(histogram_quantile(0.90, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0`, `(histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",method="Fuse.Getattr",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0`                                                                                                                                                                                                                                                                                                                                                                     | ms                  |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **RPC (Client<->Worker) Latency - P99** | 显示客户端 GetStatus gRPC 调用（元数据操作）的估算网络延迟，计算方式为客户端观测延迟与 worker 元数据处理延迟之差。有助于识别客户端与集群之间的网络延迟和连接问题。 | `(histogram_quantile(0.99, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | ms                  |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Request Call - Success Rate**         | 显示所有接口各请求类型的成功率。若该值低于 99%（Not Found 错误除外），通常表明系统存在需要排查的问题。                                    | `(sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", cluster_name=~"$cluster"}[5m])))`, `(sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", cluster_name=~"$cluster"}[5m])))`, `(sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]) - irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m]))) / (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m])))`, `(sum by (ufs_type) (irate(alluxio_ufs_total{cluster_name=~"$cluster"}[5m]) - irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))) / (sum by (ufs_type) (irate(alluxio_ufs_latency_ms_count{cluster_name=~"$cluster"}[5m])))` | percentunit         | green: > 0.9                                                                            |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |
| **Request Call - Failures**             | 显示所有请求类型和接口的失败率。此指标出现峰值时需立即排查，以识别系统或应用程序错误。                                                   | `sum by (method, state) (irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", cluster_name=~"$cluster"}[5m]))`, `sum by (method, status) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",cluster_name=~"$cluster"}[5m]))`, `sum by (op) (irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m]))`, `sum by (ufs_type,error_code) (irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | short               |                                                                                         |                 |                                                                                                                                                           |                                                                                                                                                                                                                                                                                                            |       |                            |

***

### Fuse 行

| 面板名称                             | 描述                                                                                    | 计算方法                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 单位          | 阈值           |
| -------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------- | ------------ |
| **Request Latency - P90**        | 显示各个 Fuse 挂载操作的第 90 百分位延迟。有助于定位特定 Fuse 客户端的性能问题。                                      | `histogram_quantile(0.90, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]))`, `histogram_quantile( 0.90, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                                        | ms          |              |
| **Request Latency - P99**        | 显示各个 Fuse 挂载操作的第 99 百分位延迟。有助于定位特定 Fuse 客户端的性能问题。                                      | `histogram_quantile(0.99, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]))`, `histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                                        | ms          |              |
| **Throughput - Read**            | 显示某个 Fuse 挂载的读取吞吐量。用于监控特定客户端的读取活动。                                                    | `irate(alluxio_data_access_bytes_sum{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | binBps      |              |
| **Throughput - Write**           | 显示某个 Fuse 挂载的写入吞吐量。用于监控特定客户端的写入活动。                                                    | `irate(alluxio_data_access_bytes_sum{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | binBps      |              |
| **Request/s - Read**             | 显示某个 Fuse 挂载的读取请求速率。有助于了解特定客户端的读取工作负载。                                                | `irate(alluxio_data_access_bytes_count{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | reqps       |              |
| **Request/s - Write**            | 显示某个 Fuse 挂载的写入请求速率。有助于了解特定客户端的写入工作负载。                                                | `irate(alluxio_data_access_bytes_count{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | reqps       |              |
| **FUSE Request Failure**         | 显示某个 Fuse 挂载的失败率。对于诊断特定客户端的连接或操作问题至关重要。                                               | `irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | short       |              |
| **Throughput - UFS Fallback**    | FUSE 操作回退到底层文件系统（UFS）时的读取吞吐量。高回退率表示缓存未命中或数据不在 Alluxio 中，会降低性能收益。为获得最优性能，此值应尽量小。       | `irate(alluxio_ufs_data_access_bytes_total{job="fuse",instance=~"$instance",method="read",cluster_name=~"$cluster"}[5m]) or on() vector(0)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | binBps      |              |
| **Client Request Latency (P99)** | 显示 gRPC（元数据操作）和 Netty（数据操作）客户端调用的延迟分布。有助于识别 RPC 通信延迟、客户端瓶颈以及 worker 内部潜在的网络或服务响应问题。   | `histogram_quantile( 0.99, sum( rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, method, instance) )`, `histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                              | ms          |              |
| **Client Request Concurrency**   | 显示并发 gRPC 客户端调用（元数据操作）的数量。有助于识别元数据请求负载峰值以及 worker 内部潜在的客户端瓶颈。                         | `alluxio_grpc_client_concurrency{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | reqps       |              |
| **Client Request Failure Rate**  | 显示 gRPC 客户端调用（元数据操作）和 Netty 操作调用（数据操作）的错误率。有助于识别失败的元数据请求以及集群内客户端与 worker 通信中潜在的可靠性问题。 | `sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) + sum by (instance,method) (irate(alluxio_grpc_client_successes_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])))`, `sum by (instance,op) (irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,op) (irate(alluxio_netty_operations_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])))` | percentunit | green: > 0.9 |
| **Client Request Errors**        | 显示 gRPC 客户端错误（元数据操作）和 Netty 操作错误（数据操作）的总数。有助于跟踪失败请求以及集群内客户端与 worker 通信中潜在的可靠性问题。      | `irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])`, `irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                               | short       |              |

***

### S3 行

| 面板名称                      | 描述                                                             | 计算方法                                                                                                                                     | 单位     | 阈值 |
| ------------------------- | -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ------ | -- |
| **Request Latency - P90** | 显示各 worker 上 S3 API 操作的第 90 百分位延迟。有助于将 S3 性能问题定位到特定 worker。    | `histogram_quantile(0.90, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))` | ms     |    |
| **Request Latency - P99** | 显示各 worker 上 S3 API 操作的第 99 百分位延迟。有助于识别特定 worker 上 S3 的最差情况性能。 | `histogram_quantile(0.99, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))` | ms     |    |
| **Throughput - Read**     | 显示各 worker 上 S3 API 操作的读取吞吐量。用于监控特定 worker 上的 S3 读取活动。         | `irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`              | binBps |    |
| **Throughput - Write**    | 显示各 worker 上 S3 API 操作的写入吞吐量。用于监控特定 worker 上的 S3 写入活动。         | `irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`             | binBps |    |
| **Request/s - Read**      | 显示各 worker 上 S3 API 操作的读取请求速率。有助于了解特定 worker 上的 S3 读取工作负载。     | `irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`            | reqps  |    |
| **Request/s - Write**     | 显示各 worker 上 S3 API 操作的写入请求速率。有助于了解特定 worker 上的 S3 写入工作负载。     | `irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`           | reqps  |    |
| **Request Failure**       | 显示各 worker 上 S3 API 操作的失败率。对于诊断各个 worker 上 S3 相关问题至关重要。        | `irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",instance=~"$instance",cluster_name=~"$cluster"}[5m])`          | short  |    |

***

### Worker 行

| 面板名称                                               | 描述                                                                                         | 计算方法                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 单位          | 阈值                        |
| -------------------------------------------------- | ------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | ------------------------- |
| **Storage (Data) Used**                            | 显示每个 worker 使用的缓存数据存储量。有助于识别存储分布不均的情况。                                                     | `alluxio_cached_storage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | bytes       |                           |
| **Storage (Meta) Used**                            | 显示每个 worker 使用的缓存元数据存储量。有助于识别存储分布不均的情况。                                                    | `alluxio_metastore_storage_size_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | bytes       |                           |
| **Storage Device Usage**                           | 显示 worker 各目录上 page store 设备容量的利用率。有助于识别磁盘空间消耗趋势以及集群内潜在的存储耗尽或不均衡问题。                        | `1- (sum by (instance, dir) (alluxio_page_store_device_available_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, dir) (alluxio_page_store_device_total_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}))`                                                                                                                                                                                                                                                                                                                                      | percentunit | yellow: > 0.8, red: > 0.9 |
| **Files**                                          | 显示每个 worker 的缓存文件数量。有助于了解各 worker 的缓存构成。                                                   | `alluxio_data_cached_files{job="worker",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | short       |                           |
| **Pages**                                          | 显示每个 worker 的缓存页面数量。有助于了解各 worker 的缓存构成。                                                   | `alluxio_data_cached_pages{job="worker",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | short       |                           |
| **Cache Evicted**                                  | 显示每个 worker 的缓存数据驱逐速率。有助于识别哪些 worker 面临最大的内存压力。                                            | `irate(alluxio_cached_evicted_data_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | binBps      |                           |
| **Request Latency - P90**                          | 显示各 worker 内部元数据操作的第 90 百分位延迟。                                                             | `histogram_quantile(0.90, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))`, `histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                    | ms          |                           |
| **Request Latency - P99**                          | 显示各 worker 内部元数据操作的第 99 百分位延迟。                                                             | `histogram_quantile(0.99, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))`, `histogram_quantile( 0.99, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                    | ms          |                           |
| **Request (from Page Store or UFS) Latency - P90** | 显示 Alluxio worker 从 page store 或底层文件系统（UFS）响应读取操作的延迟分布。有助于识别读取性能瓶颈，区分慢速存储后端或集群内过载的 worker。 | `histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_storage_response_time_ms_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                                                                                                                                                                                                   | ms          |                           |
| **PageStore IO Latency - P90**                     | 显示 worker 上 page store I/O 操作（数据操作）的第 90 百分位延迟。有助于识别磁盘 I/O 性能慢以及集群内潜在的存储瓶颈。                | `histogram_quantile( 0.90, sum( rate(alluxio_page_store_io_latency_microseconds_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, op, instance) )`                                                                                                                                                                                                                                                                                                                                                                                                                                       | µs          |                           |
| **Read - Throughput**                              | 显示每个 worker 的读取吞吐量。用于监控各 worker 的读取活动和负载。                                                  | `irate(alluxio_data_throughput_bytes_total{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | binBps      |                           |
| **Write - Throughput**                             | 显示每个 worker 的写入吞吐量。用于监控各 worker 的写入活动和负载。                                                  | `irate(alluxio_data_throughput_bytes_total{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | binBps      |                           |
| **Read - Request/s**                               | 显示每个 worker 的读取请求速率。有助于了解各 worker 之间的读取工作负载分布。                                             | `irate(alluxio_data_access_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | reqps       |                           |
| **Write - Request/s**                              | 显示每个 worker 的写入请求速率。有助于了解各 worker 之间的写入工作负载分布。                                             | `irate(alluxio_data_access_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | reqps       |                           |
| **Read - Threads**                                 | 显示每个 worker 读取线程池的活跃线程数和队列长度。用于诊断读取性能瓶颈。                                                   | `alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                      | short       |                           |
| **Write - Threads**                                | 显示每个 worker 写入线程池的活跃线程数和队列长度。用于诊断写入性能瓶颈。                                                   | `alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                        | short       |                           |
| **Metadata - Request/s**                           | 显示每个 worker 的元数据请求速率。有助于识别元数据负载较高的 worker。                                                 | `irate(alluxio_meta_operation_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | reqps       |                           |
| **Metadata - Threads**                             | 显示每个 worker 元数据线程池的活跃线程数和队列长度。用于诊断元数据性能瓶颈。                                                 | `alluxio_rpc_executor_max_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_active_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                  | short       |                           |
| **Request Failure**                                | 显示各 worker 内部元数据操作的失败率。有助于定位出现内部错误的 worker。                                                | `irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster", instance=~"$instance"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | short       |                           |
| **Off Heap Memory**                                | 显示每个 worker 的堆外内存使用量。对于监控内存资源、防止内存溢出错误非常重要。                                                | `sum(alluxio_rocksdb_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(alluxio_netty_direct_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(jvm_memory_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",area="nonheap"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="direct"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="mapped"}) by (instance)` | bytes       |                           |
| **PageStore Errors**                               | 显示每个 worker 上 PageStore 操作的错误率。可能表明本地缓存存储层存在问题。                                            | `irate(alluxio_page_store_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m])`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | short       |                           |
| **PageStore Disk Error Rate**                      | 显示每个 worker 上磁盘相关 PageStore 操作的错误率。可能表明底层磁盘硬件或文件系统存在问题。                                    | `sum by (instance, dir) (irate(alluxio_page_store_dir_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) or on() vector(0)) / sum by (instance, dir) (irate(alluxio_page_store_dir_operations_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]))`                                                                                                                                                                                                                                                                                                           | percentunit | yellow: > 0.1, red: > 0.3 |

***

### UFS 行

| 面板名称                      | 描述                                                           | 计算方法                                                                                                                | 单位     | 阈值 |
| ------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------- | ------ | -- |
| **Request Latency - P90** | 显示各 worker 上 UFS 操作的第 90 百分位延迟。有助于识别特定 worker 上 UFS 交互缓慢的情况。 | `histogram_quantile(0.90, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))` | ms     |    |
| **Request Latency - P99** | 显示各 worker 上 UFS 操作的第 99 百分位延迟。有助于识别特定 worker 上 UFS 的最差情况性能。 | `histogram_quantile(0.99, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))` | ms     |    |
| **Throughput - Read**     | 显示各 worker 从 UFS 读取的吞吐量。用于监控从底层存储读取的数据量。                     | `irate(alluxio_ufs_data_access_bytes_total{method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])`      | binBps |    |
| **Throughput - Write**    | 显示各 worker 向 UFS 写入的吞吐量。用于监控向底层存储写入的数据量。                     | `irate(alluxio_ufs_data_access_bytes_total{method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])`     | binBps |    |
| **Errors**                | 显示各 worker 上 UFS 操作的错误率。对于诊断底层存储系统的问题至关重要。                   | `irate(alluxio_ufs_error_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                | none   |    |

***

### Job 行

| 面板名称                                             | 描述                                                          | 计算方法                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 单位     | 阈值 |
| ------------------------------------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | -- |
| **Jobs**                                         | Job 执行模式的时序可视化。对于了解集群后台活动水平和识别可能影响性能的高维护活动时段非常重要。           | `alluxio_active_job_count{job="coordinator",type="running"}`, `alluxio_active_job_count{job="coordinator",type="waiting"}`                                                                                                                                                                                                                                                                                                                                                                                                                 | short  |    |
| **Job Tasks per Worker**                         | 各 worker 上 job 执行任务的时序可视化。对于了解 worker 活动水平和识别可能影响性能的瓶颈非常重要。 | `alluxio_worker_job_task_count{}`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | short  |    |
| **Job Threads per Worker**                       | 显示每个 worker 上 load job 线程池的活跃线程数和队列长度。用于诊断 load job 性能瓶颈。   | `alluxio_rpc_executor_max_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_active_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_queue_length{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}`, `alluxio_rpc_executor_current_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}` | short  |    |
| **Job Dispatched Per Second**                    | 显示每个 worker 的 job 分发速率。                                     | `sum by(worker)(irate(alluxio_distributed_load_job_dispatched_size_total{job="coordinator", instance=~"$instance"} [5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                 |        |    |
| **Distributed Load Throughput**                  | 显示每个 worker 的 load job 吞吐量。用于监控各 worker 上的 load job 活动。     | `sum(irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",cluster_name=~"$cluster"}[5m]))`, `irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                                                                                    | binBps |    |
| **Bytes Loaded on Workers Per Second**           | 显示各 worker 每秒加载的字节速率。                                       | `sum by(instance) (irate(alluxio_distributed_worker_bytes_loaded_bytes_total{job="worker", instance=~"$instance"} [5m]))`, `sum by()(irate(alluxio_distributed_load_job_loaded_bytes_total{job="coordinator", instance=~"$instance"} [5m]))`                                                                                                                                                                                                                                                                                               | binBps |    |
| **Distributed Load Operation Counts Per Second** | 显示分布式加载操作的速率。                                               | `irate(alluxio_distributed_load_job_scanned_total{job="coordinator", instance=~"$instance"}[5m])`, `irate(alluxio_distributed_load_job_processed_total{job="coordinator", instance=~"$instance"}[5m])`, `irate(alluxio_distributed_load_job_skipped_total{job="coordinator", instance=~"$instance"}[5m])`, `sum by()(irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"}[5m]))`                                                                                                                     | short  |    |
| **Distributed Load Failure Breakdowns**          | 按原因和 worker 显示失败的细分情况。                                      | `sum by(reason, worker) (irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"} [5m]))`                                                                                                                                                                                                                                                                                                                                                                                                                |        |    |

***

### Process 行

| 面板名称                                                | 描述                                                                | 计算方法                                                                                                                                                                                                                                                                                                                              | 单位          | 阈值            |
| --------------------------------------------------- | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | ------------- |
| **Active Worker Membership**                        | 显示每个实例上 worker 成员关系刷新的总次数。有助于跟踪集群拓扑更新以及集群内 worker 频繁重配置或潜在不稳定的情况。 | `irate(alluxio_worker_membership_refresh_count_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) > 0`                                                                                                                                                                                                    | short       |               |
| **Resource Pool Usage**                             | 显示动态资源池的利用率，计算方式为当前资源量除以最大容量。有助于识别过度使用或利用不足的资源池以及集群内潜在的资源瓶颈。      | `sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_current_resources{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_capacity{type="max",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"})` | percentunit | yellow: > 0.8 |
| **Resource Pool - Acquisition Timeout**             | 显示动态资源池获取超时的速率。有助于识别资源分配中的争用或延迟以及集群内潜在的性能瓶颈。                      | `irate(alluxio_dynamic_resource_pool_acquisition_timeouts_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                             | short       |               |
| **Resource Pool - Resource Creation Latency (P99)** | 显示在动态资源池中创建新资源的第 99 百分位延迟。有助于识别资源分配缓慢以及集群内潜在的性能瓶颈。                | `histogram_quantile( 0.99, sum( rate(alluxio_dynamic_resource_pool_create_new_resource_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, pool_kind, pool_instance, instance) )`                                                                                                    | ms          |               |
| **CPU time spent**                                  | 显示每个进程消耗的 CPU 时间。                                                 | `irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])`                                                                                                                                                                                                                            | µs          |               |
| **Threads**                                         | 显示每个进程的线程数。                                                       | `jvm_threads_current{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                                             |             |               |
| **Heap Usage**                                      | 显示每个进程的堆内存使用量。                                                    | `jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                                                                                                                               | bytes       |               |
| **Heap Usage(%)**                                   | 显示每个进程的堆内存使用率百分比。                                                 | `jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"} / jvm_memory_max_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}`                                                                                                                            | percentunit | yellow: > 0.9 |
| **young GC time(per minute)**                       | 显示每分钟年轻代垃圾回收时间。                                                   | `irate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60`                                                                                                                                                                                          | s           |               |
| **young GC rate(per minute)**                       | 显示每分钟年轻代垃圾回收速率。                                                   | `irate(jvm_gc_collection_seconds_count{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60`                                                                                                                                                                                        |             |               |
| **old GC time(per minute)**                         | 显示每分钟老年代垃圾回收时间。                                                   | `irate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60`                                                                                                                                                                                            | s           |               |
| **old GC rate(per minute)**                         | 显示每分钟老年代垃圾回收速率。                                                   | `irate(jvm_gc_collection_seconds_count{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60`                                                                                                                                                                                          |             |               |

***

### ETCD 行

| 面板名称                               | 描述                                                             | 计算方法                                                                                                                                                                                                                                                                                                                                                                                                                         | 单位          | 阈值                         |
| ---------------------------------- | -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | -------------------------- |
| **Up**                             | 显示 ETCD 集群中的节点数量。                                              | `sum(etcd_server_has_leader{job="etcd"})`                                                                                                                                                                                                                                                                                                                                                                                    |             | yellow: > 1, green: > 3    |
| **RPC Rate**                       | 显示 RPC 请求速率和失败速率。                                              | `sum(rate(grpc_server_started_total{job="etcd",grpc_type="unary"}[5m]))`, `sum(rate(grpc_server_handled_total{job="etcd",grpc_type="unary",grpc_code!="OK"}[5m]))`                                                                                                                                                                                                                                                           | ops         |                            |
| **Active Streams**                 | 显示活跃的 watch 流和 lease 流数量。                                      | `sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"})`, `sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})` |             |                            |
| **ETCD Client Call Failure Rate**  | 显示 Etcd 客户端调用的错误率。有助于识别失败的 Etcd 请求以及集群与 Etcd 服务之间潜在的可靠性或连接问题。  | `(sum by (instance, server) (irate(alluxio_etcd_call_errors_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) / (sum by (instance, server) (irate(alluxio_etcd_client_calls_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])))`                                                                                                                       | percentunit | yellow: > 0.05, red: > 0.2 |
| **ETCD Client Call Latency (P99)** | 显示 Etcd 客户端调用的第 99 百分位延迟。有助于识别 Etcd 操作缓慢以及集群内 Etcd 通信中潜在的性能瓶颈。 | `histogram_quantile( 0.99, sum( rate(alluxio_etcd_client_call_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )`                                                                                                                                                                                                                                                  | ms          |                            |
| **DB Size**                        | 显示 ETCD 数据库的大小。                                                | `etcd_mvcc_db_total_size_in_bytes{job="etcd"}`                                                                                                                                                                                                                                                                                                                                                                               | bytes       |                            |
| **Disk Sync Duration**             | 显示 WAL 和后端操作的磁盘同步持续时间。                                         | `histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))`, `histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))`                                                                                                                                                                              |             |                            |
| **Memory**                         | 显示 ETCD 进程的常驻内存使用量。                                            | `process_resident_memory_bytes{job="etcd"}`                                                                                                                                                                                                                                                                                                                                                                                  | bytes       |                            |
| **Client Traffic In**              | 显示客户端流入流量速率。                                                   | `rate(etcd_network_client_grpc_received_bytes_total{job="etcd"}[5m])`                                                                                                                                                                                                                                                                                                                                                        | binBps      |                            |
| **Client Traffic Out**             | 显示客户端流出流量速率。                                                   | `rate(etcd_network_client_grpc_sent_bytes_total{job="etcd"}[5m])`                                                                                                                                                                                                                                                                                                                                                            | binBps      |                            |
| **Peer Traffic In**                | 显示节点间流入流量速率。                                                   | `sum(rate(etcd_network_peer_received_bytes_total{job="etcd"}[5m])) by (instance)`                                                                                                                                                                                                                                                                                                                                            | binBps      |                            |
| **Peer Traffic Out**               | 显示节点间流出流量速率。                                                   | `sum(rate(etcd_network_peer_sent_bytes_total{job="etcd"}[5m])) by (instance)`                                                                                                                                                                                                                                                                                                                                                | binBps      |                            |
| **Raft Proposals**                 | 显示 raft 提案相关指标，包括失败速率、待处理总数、提交速率和应用速率。                         | `sum(rate(etcd_server_proposals_failed_total{job="etcd"}[5m]))`, `sum(etcd_server_proposals_pending{job="etcd"})`, `sum(rate(etcd_server_proposals_committed_total{job="etcd"}[5m]))`, `sum(rate(etcd_server_proposals_applied_total{job="$cluster"}[5m]))`                                                                                                                                                                  | none        |                            |
| **Total Leader Elections Per Day** | 显示每天的 leader 选举总次数。                                            | `changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])`                                                                                                                                                                                                                                                                                                                                                             |             |                            |

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-cn/reference/dashboard.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
