Grafana 仪表板

概述

本文档详细描述了 Alluxio Grafana dashboard 中所有面板的信息,包括指标名称、计算方法和阈值配置。


Cluster 行

面板名称
描述
计算方法
单位
阈值

Storage

显示 Alluxio 缓存的总容量、已使用存储量及使用率百分比。此面板对于监控整体缓存利用率、防止存储耗尽至关重要。

sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"}), sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}), sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})

bytes

yellow: > 0.9, red: > 0.95

Read - Throughput

显示来自不同数据源的读取吞吐量:Fuse、S3 API 以及直接 Worker 访问。有助于识别读取操作的主要数据访问接口。

sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader"}[5m])) or on() vector(0)

binBps

yellow: > 100, red: > 2000

Read - Load (5m)

显示过去 5 分钟内所有 worker 的平均读取负载(线程利用率)。平均负载过高可能表明系统存在全局性能瓶颈。

avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]))

percentunit

yellow: > 0.8

Read - Hotspot (load > 50%)

显示各 worker 的读取负载(仅展示利用率超过 50% 的 worker),并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。

avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5

percentunit

yellow: > 0.8

Cache File

显示当前存储在 Alluxio 缓存中的文件总数和页面总数。有助于了解缓存数据的构成和粒度。

sum(alluxio_data_cached_files{job="worker",cluster_name=~"$cluster"}), sum(alluxio_data_cached_pages{job="worker",cluster_name=~"$cluster"})

short

Write - Throughput

显示来自不同数据源的写入吞吐量:Fuse、S3 API 以及直接 Worker 访问。有助于识别写入操作的主要数据访问接口。

sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata"}[5m]))

binBps

yellow: > 100, red: > 2000

Write - Load (5m)

显示过去 5 分钟内所有 worker 的平均写入负载(线程利用率)。平均负载过高可能表明系统存在全局性能瓶颈。

avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]))

percentunit

yellow: > 0.8

Write - Hotspot (load > 50%)

显示各 worker 的写入负载(仅展示利用率超过 50% 的 worker),并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。

avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5

percentunit

yellow: > 0.8

Workers

显示集群中存活(健康)和丢失(不健康)的 worker 数量。这是衡量集群健康状况和稳定性的关键指标。

sum(up{job="worker",cluster_name=~"$cluster"}), sum by () (prometheus_target_scrape_pool_targets{scrape_job="worker"}) - sum by () (up{job="worker",cluster_name=~"$cluster"})

yellow: > 1

Jobs

显示 job 服务队列中当前正在运行和等待的 job 数量。有助于监控数据加载等异步操作的状态。

sum(alluxio_active_job_count{job="coordinator",type="running"}), sum(alluxio_active_job_count{job="coordinator",type="waiting"})

short

yellow: > 100

Meta - RPS

显示来自不同数据源的元数据每秒请求数(RPS):Fuse、S3 API 以及直接 Worker 访问。对于监控元数据工作负载强度非常重要。

sum(alluxio_fuse_concurrency{job="fuse",method=~"Fuse\\.Create\|Fuse\\.Getattr\|Fuse\\.Readdir\|Fuse\\.Statfs\|Fuse\\.Unlink",cluster_name=~"$cluster"}) or on() vector(0), sum(irate(alluxio_s3_api_call_latency_ms_count{job="worker", method!~"GetObject\|PutObject",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata"}[5m]))

reqps

yellow: > 100, red: > 2000

Meta - Load (5m)

显示过去 5 分钟内所有 worker 的平均元数据操作负载(线程利用率)。高负载可能表明工作负载以元数据操作为主。

avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]))

percentunit

yellow: > 0.8

Meta - Hotspot (load > 50%)

显示各 worker 的元数据操作负载(仅展示利用率超过 50% 的 worker),并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。

avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5

percentunit

yellow: > 0.8

License (Valid To)

显示 Alluxio 企业版许可证的到期日期。对于确保持续访问企业功能非常重要。

min(alluxio_license_expiration_date{job=~"coordinator\|worker"}) * 1000

dateTimeFromNow

Component Version

显示集群中运行的各 Alluxio 版本分布情况。在升级或故障排查时,用于验证版本一致性。

count (alluxio_version_info{cluster_name=~"$cluster"}) by (version)

Component Uptime

显示每个 Alluxio 组件(coordinator、worker、fuse)的运行时间。有助于跟踪服务稳定性并识别近期重启情况。

timestamp(process_start_time_seconds{job=~"coordinator\|worker\|fuse",cluster_name=~"$cluster"}) - process_start_time_seconds{job=~"coordinator\|worker\|fuse",cluster_name=~"$cluster"}

s

Cache Hit(%)

显示从 Alluxio 缓存中响应的数据/元数据读取请求百分比。值越高表示缓存效率越好,底层存储的压力也越小。

sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + (sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))), sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_metadata_cache_miss_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))

percentunit

green: > 0.8

Cache Eviction

显示数据从缓存中被驱逐的速率。驱逐速率过高可能表明缓存容量不足以满足当前工作负载。

sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker",cluster_name=~"$cluster"}[5m]))

binBps

Throughput - Read

提供不同数据路径(FUSE、S3、Worker、UFS)读取吞吐量的详细分解。对于了解数据流向和识别瓶颈至关重要。

sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector, sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)

binBps

Throughput - Write

提供不同数据路径(FUSE、S3、Worker、UFS)写入吞吐量的详细分解。对于了解写入效率和模式至关重要。

sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector, sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0), sum(irate(alluxio_write_buffer_async_persist_throughput_bytes_total{job="worker"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)

binBps

Request/s - Read

显示所有接口(FUSE、S3、Worker)的总读取请求速率。对于了解读取工作负载强度和流量模式非常重要。

sum(irate(alluxio_data_access_bytes_count{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_access_bytes_count{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)

reqps

Request/s - Write

显示所有接口的总写入请求速率。对于了解写入工作负载模式和识别写入密集阶段至关重要。

sum(irate(alluxio_data_access_bytes_count{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(sum(irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) or on() vector(0), sum(irate(alluxio_data_access_bytes_count{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)

reqps

Request/s - Metadata

显示所有接口的元数据操作请求速率。对于了解元数据工作负载强度非常重要,因为它会影响集群整体性能。

sum(alluxio_fuse_concurrency{job="fuse",method=~"Fuse\\.Create\|Fuse\\.Getattr\|Fuse\\.Readdir\|Fuse\\.Statfs\|Fuse\\.Unlink",cluster_name=~"$cluster"}) or on() vector(0), sum(irate(alluxio_s3_api_call_latency_ms_count{job="worker", method!~"GetObject\|PutObject",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0)

reqps

Jobs

Job 执行模式的时序可视化。对于了解集群后台活动水平和识别可能影响性能的高维护活动时段非常重要。

sum(alluxio_active_job_count{job="coordinator",type="running"}), sum(alluxio_active_job_count{job="coordinator",type="waiting"})

short

Request Latency - P90

显示所有请求类型的第 90 百分位延迟。此指标是了解典型用户体验和识别性能下降的关键。

histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op)), histogram_quantile(0.90, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.90, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))

ms

Request Latency - P99

显示所有请求类型的第 99 百分位延迟。此指标对于识别最差情况下的性能和可能表明系统压力的异常值至关重要。

histogram_quantile(0.99, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op)), histogram_quantile(0.99, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.99, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))

ms

RPC (Client<->Worker) Latency - P90

显示客户端 GetStatus gRPC 调用(元数据操作)的估算网络延迟,计算方式为客户端观测延迟与 worker 元数据处理延迟之差。有助于识别客户端与集群之间的网络延迟和连接问题。

(histogram_quantile(0.90, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0, (histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",method="Fuse.Getattr",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0

ms

RPC (Client<->Worker) Latency - P99

显示客户端 GetStatus gRPC 调用(元数据操作)的估算网络延迟,计算方式为客户端观测延迟与 worker 元数据处理延迟之差。有助于识别客户端与集群之间的网络延迟和连接问题。

(histogram_quantile(0.99, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0

ms

Request Call - Success Rate

显示所有接口各请求类型的成功率。若该值低于 99%(Not Found 错误除外),通常表明系统存在需要排查的问题。

(sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", cluster_name=~"$cluster"}[5m]))), (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", cluster_name=~"$cluster"}[5m]))), (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]) - irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m]))) / (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]))), (sum by (ufs_type) (irate(alluxio_ufs_total{cluster_name=~"$cluster"}[5m]) - irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))) / (sum by (ufs_type) (irate(alluxio_ufs_latency_ms_count{cluster_name=~"$cluster"}[5m])))

percentunit

green: > 0.9

Request Call - Failures

显示所有请求类型和接口的失败率。此指标出现峰值时需立即排查,以识别系统或应用程序错误。

sum by (method, state) (irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", cluster_name=~"$cluster"}[5m])), sum by (method, status) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",cluster_name=~"$cluster"}[5m])), sum by (op) (irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m])), sum by (ufs_type,error_code) (irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))

short


Fuse 行

面板名称
描述
计算方法
单位
阈值

Request Latency - P90

显示各个 Fuse 挂载操作的第 90 百分位延迟。有助于定位特定 Fuse 客户端的性能问题。

histogram_quantile(0.90, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.90, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Request Latency - P99

显示各个 Fuse 挂载操作的第 99 百分位延迟。有助于定位特定 Fuse 客户端的性能问题。

histogram_quantile(0.99, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Throughput - Read

显示某个 Fuse 挂载的读取吞吐量。用于监控特定客户端的读取活动。

irate(alluxio_data_access_bytes_sum{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Throughput - Write

显示某个 Fuse 挂载的写入吞吐量。用于监控特定客户端的写入活动。

irate(alluxio_data_access_bytes_sum{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Request/s - Read

显示某个 Fuse 挂载的读取请求速率。有助于了解特定客户端的读取工作负载。

irate(alluxio_data_access_bytes_count{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Request/s - Write

显示某个 Fuse 挂载的写入请求速率。有助于了解特定客户端的写入工作负载。

irate(alluxio_data_access_bytes_count{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

FUSE Request Failure

显示某个 Fuse 挂载的失败率。对于诊断特定客户端的连接或操作问题至关重要。

irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", instance=~"$instance",cluster_name=~"$cluster"}[5m])

short

Throughput - UFS Fallback

FUSE 操作回退到底层文件系统(UFS)时的读取吞吐量。高回退率表示缓存未命中或数据不在 Alluxio 中,会降低性能收益。为获得最优性能,此值应尽量小。

irate(alluxio_ufs_data_access_bytes_total{job="fuse",instance=~"$instance",method="read",cluster_name=~"$cluster"}[5m]) or on() vector(0)

binBps

Client Request Latency (P99)

显示 gRPC(元数据操作)和 Netty(数据操作)客户端调用的延迟分布。有助于识别 RPC 通信延迟、客户端瓶颈以及 worker 内部潜在的网络或服务响应问题。

histogram_quantile( 0.99, sum( rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, method, instance) ), histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Client Request Concurrency

显示并发 gRPC 客户端调用(元数据操作)的数量。有助于识别元数据请求负载峰值以及 worker 内部潜在的客户端瓶颈。

alluxio_grpc_client_concurrency{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}

reqps

Client Request Failure Rate

显示 gRPC 客户端调用(元数据操作)和 Netty 操作调用(数据操作)的错误率。有助于识别失败的元数据请求以及集群内客户端与 worker 通信中潜在的可靠性问题。

sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) + sum by (instance,method) (irate(alluxio_grpc_client_successes_total{instance=~"$instance",cluster_name=~"$cluster"}[5m]))), sum by (instance,op) (irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,op) (irate(alluxio_netty_operations_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])))

percentunit

green: > 0.9

Client Request Errors

显示 gRPC 客户端错误(元数据操作)和 Netty 操作错误(数据操作)的总数。有助于跟踪失败请求以及集群内客户端与 worker 通信中潜在的可靠性问题。

irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m]), irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])

short


S3 行

面板名称
描述
计算方法
单位
阈值

Request Latency - P90

显示各 worker 上 S3 API 操作的第 90 百分位延迟。有助于将 S3 性能问题定位到特定 worker。

histogram_quantile(0.90, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))

ms

Request Latency - P99

显示各 worker 上 S3 API 操作的第 99 百分位延迟。有助于识别特定 worker 上 S3 的最差情况性能。

histogram_quantile(0.99, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))

ms

Throughput - Read

显示各 worker 上 S3 API 操作的读取吞吐量。用于监控特定 worker 上的 S3 读取活动。

irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Throughput - Write

显示各 worker 上 S3 API 操作的写入吞吐量。用于监控特定 worker 上的 S3 写入活动。

irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Request/s - Read

显示各 worker 上 S3 API 操作的读取请求速率。有助于了解特定 worker 上的 S3 读取工作负载。

irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Request/s - Write

显示各 worker 上 S3 API 操作的写入请求速率。有助于了解特定 worker 上的 S3 写入工作负载。

irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Request Failure

显示各 worker 上 S3 API 操作的失败率。对于诊断各个 worker 上 S3 相关问题至关重要。

irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",instance=~"$instance",cluster_name=~"$cluster"}[5m])

short


Worker 行

面板名称
描述
计算方法
单位
阈值

Storage (Data) Used

显示每个 worker 使用的缓存数据存储量。有助于识别存储分布不均的情况。

alluxio_cached_storage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}

bytes

Storage (Meta) Used

显示每个 worker 使用的缓存元数据存储量。有助于识别存储分布不均的情况。

alluxio_metastore_storage_size_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}

bytes

Storage Device Usage

显示 worker 各目录上 page store 设备容量的利用率。有助于识别磁盘空间消耗趋势以及集群内潜在的存储耗尽或不均衡问题。

1- (sum by (instance, dir) (alluxio_page_store_device_available_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, dir) (alluxio_page_store_device_total_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}))

percentunit

yellow: > 0.8, red: > 0.9

Files

显示每个 worker 的缓存文件数量。有助于了解各 worker 的缓存构成。

alluxio_data_cached_files{job="worker",instance=~"$instance",cluster_name=~"$cluster"}

short

Pages

显示每个 worker 的缓存页面数量。有助于了解各 worker 的缓存构成。

alluxio_data_cached_pages{job="worker",instance=~"$instance",cluster_name=~"$cluster"}

short

Cache Evicted

显示每个 worker 的缓存数据驱逐速率。有助于识别哪些 worker 面临最大的内存压力。

irate(alluxio_cached_evicted_data_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Request Latency - P90

显示各 worker 内部元数据操作的第 90 百分位延迟。

histogram_quantile(0.90, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Request Latency - P99

显示各 worker 内部元数据操作的第 99 百分位延迟。

histogram_quantile(0.99, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.99, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

Request (from Page Store or UFS) Latency - P90

显示 Alluxio worker 从 page store 或底层文件系统(UFS)响应读取操作的延迟分布。有助于识别读取性能瓶颈,区分慢速存储后端或集群内过载的 worker。

histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_storage_response_time_ms_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, instance) )

ms

PageStore IO Latency - P90

显示 worker 上 page store I/O 操作(数据操作)的第 90 百分位延迟。有助于识别磁盘 I/O 性能慢以及集群内潜在的存储瓶颈。

histogram_quantile( 0.90, sum( rate(alluxio_page_store_io_latency_microseconds_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, op, instance) )

µs

Read - Throughput

显示每个 worker 的读取吞吐量。用于监控各 worker 的读取活动和负载。

irate(alluxio_data_throughput_bytes_total{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Write - Throughput

显示每个 worker 的写入吞吐量。用于监控各 worker 的写入活动和负载。

irate(alluxio_data_throughput_bytes_total{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Read - Request/s

显示每个 worker 的读取请求速率。有助于了解各 worker 之间的读取工作负载分布。

irate(alluxio_data_access_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Write - Request/s

显示每个 worker 的写入请求速率。有助于了解各 worker 之间的写入工作负载分布。

irate(alluxio_data_access_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Read - Threads

显示每个 worker 读取线程池的活跃线程数和队列长度。用于诊断读取性能瓶颈。

alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}

short

Write - Threads

显示每个 worker 写入线程池的活跃线程数和队列长度。用于诊断写入性能瓶颈。

alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}

short

Metadata - Request/s

显示每个 worker 的元数据请求速率。有助于识别元数据负载较高的 worker。

irate(alluxio_meta_operation_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

reqps

Metadata - Threads

显示每个 worker 元数据线程池的活跃线程数和队列长度。用于诊断元数据性能瓶颈。

alluxio_rpc_executor_max_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}

short

Request Failure

显示各 worker 内部元数据操作的失败率。有助于定位出现内部错误的 worker。

irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster", instance=~"$instance"}[5m])

short

Off Heap Memory

显示每个 worker 的堆外内存使用量。对于监控内存资源、防止内存溢出错误非常重要。

sum(alluxio_rocksdb_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(alluxio_netty_direct_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(jvm_memory_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",area="nonheap"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="direct"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="mapped"}) by (instance)

bytes

PageStore Errors

显示每个 worker 上 PageStore 操作的错误率。可能表明本地缓存存储层存在问题。

irate(alluxio_page_store_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m])

short

PageStore Disk Error Rate

显示每个 worker 上磁盘相关 PageStore 操作的错误率。可能表明底层磁盘硬件或文件系统存在问题。

sum by (instance, dir) (irate(alluxio_page_store_dir_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) or on() vector(0)) / sum by (instance, dir) (irate(alluxio_page_store_dir_operations_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]))

percentunit

yellow: > 0.1, red: > 0.3


UFS 行

面板名称
描述
计算方法
单位
阈值

Request Latency - P90

显示各 worker 上 UFS 操作的第 90 百分位延迟。有助于识别特定 worker 上 UFS 交互缓慢的情况。

histogram_quantile(0.90, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))

ms

Request Latency - P99

显示各 worker 上 UFS 操作的第 99 百分位延迟。有助于识别特定 worker 上 UFS 的最差情况性能。

histogram_quantile(0.99, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))

ms

Throughput - Read

显示各 worker 从 UFS 读取的吞吐量。用于监控从底层存储读取的数据量。

irate(alluxio_ufs_data_access_bytes_total{method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Throughput - Write

显示各 worker 向 UFS 写入的吞吐量。用于监控向底层存储写入的数据量。

irate(alluxio_ufs_data_access_bytes_total{method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Errors

显示各 worker 上 UFS 操作的错误率。对于诊断底层存储系统的问题至关重要。

irate(alluxio_ufs_error_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])

none


Job 行

面板名称
描述
计算方法
单位
阈值

Jobs

Job 执行模式的时序可视化。对于了解集群后台活动水平和识别可能影响性能的高维护活动时段非常重要。

alluxio_active_job_count{job="coordinator",type="running"}, alluxio_active_job_count{job="coordinator",type="waiting"}

short

Job Tasks per Worker

各 worker 上 job 执行任务的时序可视化。对于了解 worker 活动水平和识别可能影响性能的瓶颈非常重要。

alluxio_worker_job_task_count{}

short

Job Threads per Worker

显示每个 worker 上 load job 线程池的活跃线程数和队列长度。用于诊断 load job 性能瓶颈。

alluxio_rpc_executor_max_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}

short

Job Dispatched Per Second

显示每个 worker 的 job 分发速率。

sum by(worker)(irate(alluxio_distributed_load_job_dispatched_size_total{job="coordinator", instance=~"$instance"} [5m]))

Distributed Load Throughput

显示每个 worker 的 load job 吞吐量。用于监控各 worker 上的 load job 活动。

sum(irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])), irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])

binBps

Bytes Loaded on Workers Per Second

显示各 worker 每秒加载的字节速率。

sum by(instance) (irate(alluxio_distributed_worker_bytes_loaded_bytes_total{job="worker", instance=~"$instance"} [5m])), sum by()(irate(alluxio_distributed_load_job_loaded_bytes_total{job="coordinator", instance=~"$instance"} [5m]))

binBps

Distributed Load Operation Counts Per Second

显示分布式加载操作的速率。

irate(alluxio_distributed_load_job_scanned_total{job="coordinator", instance=~"$instance"}[5m]), irate(alluxio_distributed_load_job_processed_total{job="coordinator", instance=~"$instance"}[5m]), irate(alluxio_distributed_load_job_skipped_total{job="coordinator", instance=~"$instance"}[5m]), sum by()(irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"}[5m]))

short

Distributed Load Failure Breakdowns

按原因和 worker 显示失败的细分情况。

sum by(reason, worker) (irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"} [5m]))


Process 行

面板名称
描述
计算方法
单位
阈值

Active Worker Membership

显示每个实例上 worker 成员关系刷新的总次数。有助于跟踪集群拓扑更新以及集群内 worker 频繁重配置或潜在不稳定的情况。

irate(alluxio_worker_membership_refresh_count_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) > 0

short

Resource Pool Usage

显示动态资源池的利用率,计算方式为当前资源量除以最大容量。有助于识别过度使用或利用不足的资源池以及集群内潜在的资源瓶颈。

sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_current_resources{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_capacity{type="max",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"})

percentunit

yellow: > 0.8

Resource Pool - Acquisition Timeout

显示动态资源池获取超时的速率。有助于识别资源分配中的争用或延迟以及集群内潜在的性能瓶颈。

irate(alluxio_dynamic_resource_pool_acquisition_timeouts_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])

short

Resource Pool - Resource Creation Latency (P99)

显示在动态资源池中创建新资源的第 99 百分位延迟。有助于识别资源分配缓慢以及集群内潜在的性能瓶颈。

histogram_quantile( 0.99, sum( rate(alluxio_dynamic_resource_pool_create_new_resource_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, pool_kind, pool_instance, instance) )

ms

CPU time spent

显示每个进程消耗的 CPU 时间。

irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])

µs

Threads

显示每个进程的线程数。

jvm_threads_current{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}

Heap Usage

显示每个进程的堆内存使用量。

jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}

bytes

Heap Usage(%)

显示每个进程的堆内存使用率百分比。

jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"} / jvm_memory_max_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}

percentunit

yellow: > 0.9

young GC time(per minute)

显示每分钟年轻代垃圾回收时间。

irate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60

s

young GC rate(per minute)

显示每分钟年轻代垃圾回收速率。

irate(jvm_gc_collection_seconds_count{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60

old GC time(per minute)

显示每分钟老年代垃圾回收时间。

irate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60

s

old GC rate(per minute)

显示每分钟老年代垃圾回收速率。

irate(jvm_gc_collection_seconds_count{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60


ETCD 行

面板名称
描述
计算方法
单位
阈值

Up

显示 ETCD 集群中的节点数量。

sum(etcd_server_has_leader{job="etcd"})

yellow: > 1, green: > 3

RPC Rate

显示 RPC 请求速率和失败速率。

sum(rate(grpc_server_started_total{job="etcd",grpc_type="unary"}[5m])), sum(rate(grpc_server_handled_total{job="etcd",grpc_type="unary",grpc_code!="OK"}[5m]))

ops

Active Streams

显示活跃的 watch 流和 lease 流数量。

sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}), sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})

ETCD Client Call Failure Rate

显示 Etcd 客户端调用的错误率。有助于识别失败的 Etcd 请求以及集群与 Etcd 服务之间潜在的可靠性或连接问题。

(sum by (instance, server) (irate(alluxio_etcd_call_errors_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) / (sum by (instance, server) (irate(alluxio_etcd_client_calls_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])))

percentunit

yellow: > 0.05, red: > 0.2

ETCD Client Call Latency (P99)

显示 Etcd 客户端调用的第 99 百分位延迟。有助于识别 Etcd 操作缓慢以及集群内 Etcd 通信中潜在的性能瓶颈。

histogram_quantile( 0.99, sum( rate(alluxio_etcd_client_call_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )

ms

DB Size

显示 ETCD 数据库的大小。

etcd_mvcc_db_total_size_in_bytes{job="etcd"}

bytes

Disk Sync Duration

显示 WAL 和后端操作的磁盘同步持续时间。

histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le)), histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))

Memory

显示 ETCD 进程的常驻内存使用量。

process_resident_memory_bytes{job="etcd"}

bytes

Client Traffic In

显示客户端流入流量速率。

rate(etcd_network_client_grpc_received_bytes_total{job="etcd"}[5m])

binBps

Client Traffic Out

显示客户端流出流量速率。

rate(etcd_network_client_grpc_sent_bytes_total{job="etcd"}[5m])

binBps

Peer Traffic In

显示节点间流入流量速率。

sum(rate(etcd_network_peer_received_bytes_total{job="etcd"}[5m])) by (instance)

binBps

Peer Traffic Out

显示节点间流出流量速率。

sum(rate(etcd_network_peer_sent_bytes_total{job="etcd"}[5m])) by (instance)

binBps

Raft Proposals

显示 raft 提案相关指标,包括失败速率、待处理总数、提交速率和应用速率。

sum(rate(etcd_server_proposals_failed_total{job="etcd"}[5m])), sum(etcd_server_proposals_pending{job="etcd"}), sum(rate(etcd_server_proposals_committed_total{job="etcd"}[5m])), sum(rate(etcd_server_proposals_applied_total{job="$cluster"}[5m]))

none

Total Leader Elections Per Day

显示每天的 leader 选举总次数。

changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])


Last updated