Grafana 仪表板
概述
本文档详细描述了 Alluxio Grafana dashboard 中所有面板的信息,包括指标名称、计算方法和阈值配置。
Cluster 行
Storage
显示 Alluxio 缓存的总容量、已使用存储量及使用率百分比。此面板对于监控整体缓存利用率、防止存储耗尽至关重要。
sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"}), sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}), sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})
bytes
yellow: > 0.9, red: > 0.95
Read - Throughput
显示来自不同数据源的读取吞吐量:Fuse、S3 API 以及直接 Worker 访问。有助于识别读取操作的主要数据访问接口。
sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader"}[5m])) or on() vector(0)
binBps
yellow: > 100, red: > 2000
Read - Load (5m)
显示过去 5 分钟内所有 worker 的平均读取负载(线程利用率)。平均负载过高可能表明系统存在全局性能瓶颈。
avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]))
percentunit
yellow: > 0.8
Read - Hotspot (load > 50%)
显示各 worker 的读取负载(仅展示利用率超过 50% 的 worker),并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。
avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5
percentunit
yellow: > 0.8
Cache File
显示当前存储在 Alluxio 缓存中的文件总数和页面总数。有助于了解缓存数据的构成和粒度。
sum(alluxio_data_cached_files{job="worker",cluster_name=~"$cluster"}), sum(alluxio_data_cached_pages{job="worker",cluster_name=~"$cluster"})
short
Write - Throughput
显示来自不同数据源的写入吞吐量:Fuse、S3 API 以及直接 Worker 访问。有助于识别写入操作的主要数据访问接口。
sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata"}[5m]))
binBps
yellow: > 100, red: > 2000
Write - Load (5m)
显示过去 5 分钟内所有 worker 的平均写入负载(线程利用率)。平均负载过高可能表明系统存在全局性能瓶颈。
avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]))
percentunit
yellow: > 0.8
Write - Hotspot (load > 50%)
显示各 worker 的写入负载(仅展示利用率超过 50% 的 worker),并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。
avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5
percentunit
yellow: > 0.8
Workers
显示集群中存活(健康)和丢失(不健康)的 worker 数量。这是衡量集群健康状况和稳定性的关键指标。
sum(up{job="worker",cluster_name=~"$cluster"}), sum by () (prometheus_target_scrape_pool_targets{scrape_job="worker"}) - sum by () (up{job="worker",cluster_name=~"$cluster"})
yellow: > 1
Jobs
显示 job 服务队列中当前正在运行和等待的 job 数量。有助于监控数据加载等异步操作的状态。
sum(alluxio_active_job_count{job="coordinator",type="running"}), sum(alluxio_active_job_count{job="coordinator",type="waiting"})
short
yellow: > 100
Meta - RPS
显示来自不同数据源的元数据每秒请求数(RPS):Fuse、S3 API 以及直接 Worker 访问。对于监控元数据工作负载强度非常重要。
sum(alluxio_fuse_concurrency{job="fuse",method=~"Fuse\\.Create\|Fuse\\.Getattr\|Fuse\\.Readdir\|Fuse\\.Statfs\|Fuse\\.Unlink",cluster_name=~"$cluster"}) or on() vector(0), sum(irate(alluxio_s3_api_call_latency_ms_count{job="worker", method!~"GetObject\|PutObject",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0), avg(avg_over_time(alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata"}[5m]))
reqps
yellow: > 100, red: > 2000
Meta - Load (5m)
显示过去 5 分钟内所有 worker 的平均元数据操作负载(线程利用率)。高负载可能表明工作负载以元数据操作为主。
avg(avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]))
percentunit
yellow: > 0.8
Meta - Hotspot (load > 50%)
显示各 worker 的元数据操作负载(仅展示利用率超过 50% 的 worker),并高亮显示负载最重的 worker。有助于识别集群内 I/O 分布不均和潜在热点问题。
avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-reader"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-reader"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="netty-writer"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="netty-writer"}[5m]) > 0.5, avg_over_time(alluxio_rpc_executor_active_threads{job="worker", executor_name="grpc-metadata"}[5m])/avg_over_time(alluxio_rpc_executor_max_threads{job="worker", executor_name="grpc-metadata"}[5m]) > 0.5
percentunit
yellow: > 0.8
License (Valid To)
显示 Alluxio 企业版许可证的到期日期。对于确保持续访问企业功能非常重要。
min(alluxio_license_expiration_date{job=~"coordinator\|worker"}) * 1000
dateTimeFromNow
Component Version
显示集群中运行的各 Alluxio 版本分布情况。在升级或故障排查时,用于验证版本一致性。
count (alluxio_version_info{cluster_name=~"$cluster"}) by (version)
Component Uptime
显示每个 Alluxio 组件(coordinator、worker、fuse)的运行时间。有助于跟踪服务稳定性并识别近期重启情况。
timestamp(process_start_time_seconds{job=~"coordinator\|worker\|fuse",cluster_name=~"$cluster"}) - process_start_time_seconds{job=~"coordinator\|worker\|fuse",cluster_name=~"$cluster"}
s
Cache Hit(%)
显示从 Alluxio 缓存中响应的数据/元数据读取请求百分比。值越高表示缓存效率越好,底层存储的压力也越小。
sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + (sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))), sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_metadata_cache_hit_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_metadata_cache_miss_calls_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0))
percentunit
green: > 0.8
Cache Eviction
显示数据从缓存中被驱逐的速率。驱逐速率过高可能表明缓存容量不足以满足当前工作负载。
sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker",cluster_name=~"$cluster"}[5m]))
binBps
Throughput - Read
提供不同数据路径(FUSE、S3、Worker、UFS)读取吞吐量的详细分解。对于了解数据流向和识别瓶颈至关重要。
sum(irate(alluxio_data_access_bytes_sum{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector, sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)
binBps
Throughput - Write
提供不同数据路径(FUSE、S3、Worker、UFS)写入吞吐量的详细分解。对于了解写入效率和模式至关重要。
sum(irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector, sum(irate(alluxio_data_throughput_bytes_total{job="worker",destination="write_buffer",method="write"}[5m])) or on() vector(0), sum(irate(alluxio_write_buffer_async_persist_throughput_bytes_total{job="worker"}[5m])) or on() vector(0), sum(irate(alluxio_ufs_data_access_bytes_total{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)
binBps
Request/s - Read
显示所有接口(FUSE、S3、Worker)的总读取请求速率。对于了解读取工作负载强度和流量模式非常重要。
sum(irate(alluxio_data_access_bytes_count{job="fuse",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_data_access_bytes_count{job="worker",method="read",cluster_name=~"$cluster"}[5m])) or on() vector(0)
reqps
Request/s - Write
显示所有接口的总写入请求速率。对于了解写入工作负载模式和识别写入密集阶段至关重要。
sum(irate(alluxio_data_access_bytes_count{job="fuse",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(sum(irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) or on() vector(0), sum(irate(alluxio_data_access_bytes_count{job="worker",method="write",cluster_name=~"$cluster"}[5m])) or on() vector(0)
reqps
Request/s - Metadata
显示所有接口的元数据操作请求速率。对于了解元数据工作负载强度非常重要,因为它会影响集群整体性能。
sum(alluxio_fuse_concurrency{job="fuse",method=~"Fuse\\.Create\|Fuse\\.Getattr\|Fuse\\.Readdir\|Fuse\\.Statfs\|Fuse\\.Unlink",cluster_name=~"$cluster"}) or on() vector(0), sum(irate(alluxio_s3_api_call_latency_ms_count{job="worker", method!~"GetObject\|PutObject",cluster_name=~"$cluster"}[5m])) or on() vector(0), sum(irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m])) or on() vector(0)
reqps
Jobs
Job 执行模式的时序可视化。对于了解集群后台活动水平和识别可能影响性能的高维护活动时段非常重要。
sum(alluxio_active_job_count{job="coordinator",type="running"}), sum(alluxio_active_job_count{job="coordinator",type="waiting"})
short
Request Latency - P90
显示所有请求类型的第 90 百分位延迟。此指标是了解典型用户体验和识别性能下降的关键。
histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op)), histogram_quantile(0.90, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.90, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))
ms
Request Latency - P99
显示所有请求类型的第 99 百分位延迟。此指标对于识别最差情况下的性能和可能表明系统压力的异常值至关重要。
histogram_quantile(0.99, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, op)), histogram_quantile(0.99, sum(rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method)), histogram_quantile(0.99, sum(rate(alluxio_ufs_latency_ms_bucket{job="worker",cluster_name=~"$cluster"}[5m])) by (le, method, ufs_type))
ms
RPC (Client<->Worker) Latency - P90
显示客户端 GetStatus gRPC 调用(元数据操作)的估算网络延迟,计算方式为客户端观测延迟与 worker 元数据处理延迟之差。有助于识别客户端与集群之间的网络延迟和连接问题。
(histogram_quantile(0.90, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0, (histogram_quantile(0.90, sum(rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",method="Fuse.Getattr",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.90, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0
ms
RPC (Client<->Worker) Latency - P99
显示客户端 GetStatus gRPC 调用(元数据操作)的估算网络延迟,计算方式为客户端观测延迟与 worker 元数据处理延迟之差。有助于识别客户端与集群之间的网络延迟和连接问题。
(histogram_quantile(0.99, sum(rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",method="GetStatus",instance=~"$instance",cluster_name=~"$cluster"}[5m])) by (le)) - histogram_quantile(0.99, sum(rate(alluxio_meta_operation_latency_ms_bucket{job="worker",op="getStatus",cluster_name=~"$cluster"}[5m])) by (le))) > 0
ms
Request Call - Success Rate
显示所有接口各请求类型的成功率。若该值低于 99%(Not Found 错误除外),通常表明系统存在需要排查的问题。
(sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_fuse_call_latency_ms_count{job="fuse", cluster_name=~"$cluster"}[5m]))), (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="true",cluster_name=~"$cluster"}[5m]))) / (sum by (method) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", cluster_name=~"$cluster"}[5m]))), (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]) - irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m]))) / (sum by (op) (irate(alluxio_meta_operation_total{job="worker",cluster_name=~"$cluster"}[5m]))), (sum by (ufs_type) (irate(alluxio_ufs_total{cluster_name=~"$cluster"}[5m]) - irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))) / (sum by (ufs_type) (irate(alluxio_ufs_latency_ms_count{cluster_name=~"$cluster"}[5m])))
percentunit
green: > 0.9
Request Call - Failures
显示所有请求类型和接口的失败率。此指标出现峰值时需立即排查,以识别系统或应用程序错误。
sum by (method, state) (irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", cluster_name=~"$cluster"}[5m])), sum by (method, status) (irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",cluster_name=~"$cluster"}[5m])), sum by (op) (irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster"}[5m])), sum by (ufs_type,error_code) (irate(alluxio_ufs_error_total{cluster_name=~"$cluster"}[5m]))
short
Fuse 行
Request Latency - P90
显示各个 Fuse 挂载操作的第 90 百分位延迟。有助于定位特定 Fuse 客户端的性能问题。
histogram_quantile(0.90, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.90, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Request Latency - P99
显示各个 Fuse 挂载操作的第 99 百分位延迟。有助于定位特定 Fuse 客户端的性能问题。
histogram_quantile(0.99, rate(alluxio_fuse_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Throughput - Read
显示某个 Fuse 挂载的读取吞吐量。用于监控特定客户端的读取活动。
irate(alluxio_data_access_bytes_sum{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Throughput - Write
显示某个 Fuse 挂载的写入吞吐量。用于监控特定客户端的写入活动。
irate(alluxio_data_access_bytes_sum{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Request/s - Read
显示某个 Fuse 挂载的读取请求速率。有助于了解特定客户端的读取工作负载。
irate(alluxio_data_access_bytes_count{job="fuse",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Request/s - Write
显示某个 Fuse 挂载的写入请求速率。有助于了解特定客户端的写入工作负载。
irate(alluxio_data_access_bytes_count{job="fuse",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
FUSE Request Failure
显示某个 Fuse 挂载的失败率。对于诊断特定客户端的连接或操作问题至关重要。
irate(alluxio_fuse_result_total{job="fuse",state!="SUCCESS", instance=~"$instance",cluster_name=~"$cluster"}[5m])
short
Throughput - UFS Fallback
FUSE 操作回退到底层文件系统(UFS)时的读取吞吐量。高回退率表示缓存未命中或数据不在 Alluxio 中,会降低性能收益。为获得最优性能,此值应尽量小。
irate(alluxio_ufs_data_access_bytes_total{job="fuse",instance=~"$instance",method="read",cluster_name=~"$cluster"}[5m]) or on() vector(0)
binBps
Client Request Latency (P99)
显示 gRPC(元数据操作)和 Netty(数据操作)客户端调用的延迟分布。有助于识别 RPC 通信延迟、客户端瓶颈以及 worker 内部潜在的网络或服务响应问题。
histogram_quantile( 0.99, sum( rate(alluxio_grpc_client_call_latency_ms_bucket{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, method, instance) ), histogram_quantile( 0.99, sum( rate(alluxio_client_netty_read_time_to_receive_first_packet_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Client Request Concurrency
显示并发 gRPC 客户端调用(元数据操作)的数量。有助于识别元数据请求负载峰值以及 worker 内部潜在的客户端瓶颈。
alluxio_grpc_client_concurrency{job="fuse",instance=~"$instance",cluster_name=~"$cluster"}
reqps
Client Request Failure Rate
显示 gRPC 客户端调用(元数据操作)和 Netty 操作调用(数据操作)的错误率。有助于识别失败的元数据请求以及集群内客户端与 worker 通信中潜在的可靠性问题。
sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,method) (irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) + sum by (instance,method) (irate(alluxio_grpc_client_successes_total{instance=~"$instance",cluster_name=~"$cluster"}[5m]))), sum by (instance,op) (irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])) / (sum by (instance,op) (irate(alluxio_netty_operations_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])))
percentunit
green: > 0.9
Client Request Errors
显示 gRPC 客户端错误(元数据操作)和 Netty 操作错误(数据操作)的总数。有助于跟踪失败请求以及集群内客户端与 worker 通信中潜在的可靠性问题。
irate(alluxio_grpc_client_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m]), irate(alluxio_netty_operation_errors_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])
short
S3 行
Request Latency - P90
显示各 worker 上 S3 API 操作的第 90 百分位延迟。有助于将 S3 性能问题定位到特定 worker。
histogram_quantile(0.90, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))
ms
Request Latency - P99
显示各 worker 上 S3 API 操作的第 99 百分位延迟。有助于识别特定 worker 上 S3 的最差情况性能。
histogram_quantile(0.99, rate(alluxio_s3_api_call_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]))
ms
Throughput - Read
显示各 worker 上 S3 API 操作的读取吞吐量。用于监控特定 worker 上的 S3 读取活动。
irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Throughput - Write
显示各 worker 上 S3 API 操作的写入吞吐量。用于监控特定 worker 上的 S3 写入活动。
irate(alluxio_s3_api_throughput_bytes_sum{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Request/s - Read
显示各 worker 上 S3 API 操作的读取请求速率。有助于了解特定 worker 上的 S3 读取工作负载。
irate(alluxio_s3_api_throughput_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Request/s - Write
显示各 worker 上 S3 API 操作的写入请求速率。有助于了解特定 worker 上的 S3 写入工作负载。
irate(alluxio_s3_api_throughput_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Request Failure
显示各 worker 上 S3 API 操作的失败率。对于诊断各个 worker 上 S3 相关问题至关重要。
irate(alluxio_s3_api_call_latency_ms_count{job="worker", success="false",instance=~"$instance",cluster_name=~"$cluster"}[5m])
short
Worker 行
Storage (Data) Used
显示每个 worker 使用的缓存数据存储量。有助于识别存储分布不均的情况。
alluxio_cached_storage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}
bytes
Storage (Meta) Used
显示每个 worker 使用的缓存元数据存储量。有助于识别存储分布不均的情况。
alluxio_metastore_storage_size_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}
bytes
Storage Device Usage
显示 worker 各目录上 page store 设备容量的利用率。有助于识别磁盘空间消耗趋势以及集群内潜在的存储耗尽或不均衡问题。
1- (sum by (instance, dir) (alluxio_page_store_device_available_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, dir) (alluxio_page_store_device_total_capacity_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}))
percentunit
yellow: > 0.8, red: > 0.9
Files
显示每个 worker 的缓存文件数量。有助于了解各 worker 的缓存构成。
alluxio_data_cached_files{job="worker",instance=~"$instance",cluster_name=~"$cluster"}
short
Pages
显示每个 worker 的缓存页面数量。有助于了解各 worker 的缓存构成。
alluxio_data_cached_pages{job="worker",instance=~"$instance",cluster_name=~"$cluster"}
short
Cache Evicted
显示每个 worker 的缓存数据驱逐速率。有助于识别哪些 worker 面临最大的内存压力。
irate(alluxio_cached_evicted_data_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Request Latency - P90
显示各 worker 内部元数据操作的第 90 百分位延迟。
histogram_quantile(0.90, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Request Latency - P99
显示各 worker 内部元数据操作的第 99 百分位延迟。
histogram_quantile(0.99, rate(alluxio_meta_operation_latency_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])), histogram_quantile( 0.99, sum( rate(alluxio_worker_netty_read_time_to_send_first_packet_ms_bucket{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
Request (from Page Store or UFS) Latency - P90
显示 Alluxio worker 从 page store 或底层文件系统(UFS)响应读取操作的延迟分布。有助于识别读取性能瓶颈,区分慢速存储后端或集群内过载的 worker。
histogram_quantile( 0.90, sum( rate(alluxio_worker_netty_read_storage_response_time_ms_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, instance) )
ms
PageStore IO Latency - P90
显示 worker 上 page store I/O 操作(数据操作)的第 90 百分位延迟。有助于识别磁盘 I/O 性能慢以及集群内潜在的存储瓶颈。
histogram_quantile( 0.90, sum( rate(alluxio_page_store_io_latency_microseconds_bucket{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) ) by (le, op, instance) )
µs
Read - Throughput
显示每个 worker 的读取吞吐量。用于监控各 worker 的读取活动和负载。
irate(alluxio_data_throughput_bytes_total{job="worker",method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Write - Throughput
显示每个 worker 的写入吞吐量。用于监控各 worker 的写入活动和负载。
irate(alluxio_data_throughput_bytes_total{job="worker",method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Read - Request/s
显示每个 worker 的读取请求速率。有助于了解各 worker 之间的读取工作负载分布。
irate(alluxio_data_access_bytes_count{method="read",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Write - Request/s
显示每个 worker 的写入请求速率。有助于了解各 worker 之间的写入工作负载分布。
irate(alluxio_data_access_bytes_count{method="write",job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Read - Threads
显示每个 worker 读取线程池的活跃线程数和队列长度。用于诊断读取性能瓶颈。
alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="netty-reader",instance=~"$instance",cluster_name=~"$cluster"}
short
Write - Threads
显示每个 worker 写入线程池的活跃线程数和队列长度。用于诊断写入性能瓶颈。
alluxio_rpc_executor_max_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="netty-writer",instance=~"$instance",cluster_name=~"$cluster"}
short
Metadata - Request/s
显示每个 worker 的元数据请求速率。有助于识别元数据负载较高的 worker。
irate(alluxio_meta_operation_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
reqps
Metadata - Threads
显示每个 worker 元数据线程池的活跃线程数和队列长度。用于诊断元数据性能瓶颈。
alluxio_rpc_executor_max_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="grpc-metadata",instance=~"$instance",cluster_name=~"$cluster"}
short
Request Failure
显示各 worker 内部元数据操作的失败率。有助于定位出现内部错误的 worker。
irate(alluxio_meta_operation_errors_total{job="worker",cluster_name=~"$cluster", instance=~"$instance"}[5m])
short
Off Heap Memory
显示每个 worker 的堆外内存使用量。对于监控内存资源、防止内存溢出错误非常重要。
sum(alluxio_rocksdb_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(alluxio_netty_direct_memory_usage_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster"}) by (instance) + sum(jvm_memory_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",area="nonheap"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="direct"}) by (instance) + sum(jvm_buffer_pool_used_bytes{job="worker",instance=~"$instance",cluster_name=~"$cluster",pool="mapped"}) by (instance)
bytes
PageStore Errors
显示每个 worker 上 PageStore 操作的错误率。可能表明本地缓存存储层存在问题。
irate(alluxio_page_store_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m])
short
PageStore Disk Error Rate
显示每个 worker 上磁盘相关 PageStore 操作的错误率。可能表明底层磁盘硬件或文件系统存在问题。
sum by (instance, dir) (irate(alluxio_page_store_dir_operation_errors_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]) or on() vector(0)) / sum by (instance, dir) (irate(alluxio_page_store_dir_operations_total{job="worker", cluster_name=~"$cluster", instance=~"$instance"}[5m]))
percentunit
yellow: > 0.1, red: > 0.3
UFS 行
Request Latency - P90
显示各 worker 上 UFS 操作的第 90 百分位延迟。有助于识别特定 worker 上 UFS 交互缓慢的情况。
histogram_quantile(0.90, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))
ms
Request Latency - P99
显示各 worker 上 UFS 操作的第 99 百分位延迟。有助于识别特定 worker 上 UFS 的最差情况性能。
histogram_quantile(0.99, rate(alluxio_ufs_latency_ms_bucket{instance=~"$instance",cluster_name=~"$cluster"}[5m]))
ms
Throughput - Read
显示各 worker 从 UFS 读取的吞吐量。用于监控从底层存储读取的数据量。
irate(alluxio_ufs_data_access_bytes_total{method="read",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Throughput - Write
显示各 worker 向 UFS 写入的吞吐量。用于监控向底层存储写入的数据量。
irate(alluxio_ufs_data_access_bytes_total{method="write",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Errors
显示各 worker 上 UFS 操作的错误率。对于诊断底层存储系统的问题至关重要。
irate(alluxio_ufs_error_total{instance=~"$instance",cluster_name=~"$cluster"}[5m])
none
Job 行
Jobs
Job 执行模式的时序可视化。对于了解集群后台活动水平和识别可能影响性能的高维护活动时段非常重要。
alluxio_active_job_count{job="coordinator",type="running"}, alluxio_active_job_count{job="coordinator",type="waiting"}
short
Job Tasks per Worker
各 worker 上 job 执行任务的时序可视化。对于了解 worker 活动水平和识别可能影响性能的瓶颈非常重要。
alluxio_worker_job_task_count{}
short
Job Threads per Worker
显示每个 worker 上 load job 线程池的活跃线程数和队列长度。用于诊断 load job 性能瓶颈。
alluxio_rpc_executor_max_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_active_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_queue_length{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}, alluxio_rpc_executor_current_threads{job="worker",executor_name="load-executor",instance=~"$instance",cluster_name=~"$cluster"}
short
Job Dispatched Per Second
显示每个 worker 的 job 分发速率。
sum by(worker)(irate(alluxio_distributed_load_job_dispatched_size_total{job="coordinator", instance=~"$instance"} [5m]))
Distributed Load Throughput
显示每个 worker 的 load job 吞吐量。用于监控各 worker 上的 load job 活动。
sum(irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])), irate(alluxio_distributed_load_data_loaded_from_ufs_bytes_total{job="worker",instance=~"$instance",cluster_name=~"$cluster"}[5m])
binBps
Bytes Loaded on Workers Per Second
显示各 worker 每秒加载的字节速率。
sum by(instance) (irate(alluxio_distributed_worker_bytes_loaded_bytes_total{job="worker", instance=~"$instance"} [5m])), sum by()(irate(alluxio_distributed_load_job_loaded_bytes_total{job="coordinator", instance=~"$instance"} [5m]))
binBps
Distributed Load Operation Counts Per Second
显示分布式加载操作的速率。
irate(alluxio_distributed_load_job_scanned_total{job="coordinator", instance=~"$instance"}[5m]), irate(alluxio_distributed_load_job_processed_total{job="coordinator", instance=~"$instance"}[5m]), irate(alluxio_distributed_load_job_skipped_total{job="coordinator", instance=~"$instance"}[5m]), sum by()(irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"}[5m]))
short
Distributed Load Failure Breakdowns
按原因和 worker 显示失败的细分情况。
sum by(reason, worker) (irate(alluxio_distributed_load_job_failure_total{job="coordinator", instance=~"$instance"} [5m]))
Process 行
Active Worker Membership
显示每个实例上 worker 成员关系刷新的总次数。有助于跟踪集群拓扑更新以及集群内 worker 频繁重配置或潜在不稳定的情况。
irate(alluxio_worker_membership_refresh_count_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) > 0
short
Resource Pool Usage
显示动态资源池的利用率,计算方式为当前资源量除以最大容量。有助于识别过度使用或利用不足的资源池以及集群内潜在的资源瓶颈。
sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_current_resources{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}) / sum by (instance, pool_kind, pool_instance) (alluxio_dynamic_resource_pool_capacity{type="max",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"})
percentunit
yellow: > 0.8
Resource Pool - Acquisition Timeout
显示动态资源池获取超时的速率。有助于识别资源分配中的争用或延迟以及集群内潜在的性能瓶颈。
irate(alluxio_dynamic_resource_pool_acquisition_timeouts_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])
short
Resource Pool - Resource Creation Latency (P99)
显示在动态资源池中创建新资源的第 99 百分位延迟。有助于识别资源分配缓慢以及集群内潜在的性能瓶颈。
histogram_quantile( 0.99, sum( rate(alluxio_dynamic_resource_pool_create_new_resource_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, pool_kind, pool_instance, instance) )
ms
CPU time spent
显示每个进程消耗的 CPU 时间。
irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])
µs
Threads
显示每个进程的线程数。
jvm_threads_current{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}
Heap Usage
显示每个进程的堆内存使用量。
jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}
bytes
Heap Usage(%)
显示每个进程的堆内存使用率百分比。
jvm_memory_used_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"} / jvm_memory_max_bytes{area="heap",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}
percentunit
yellow: > 0.9
young GC time(per minute)
显示每分钟年轻代垃圾回收时间。
irate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60
s
young GC rate(per minute)
显示每分钟年轻代垃圾回收速率。
irate(jvm_gc_collection_seconds_count{gc="G1 Young Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60
old GC time(per minute)
显示每分钟老年代垃圾回收时间。
irate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60
s
old GC rate(per minute)
显示每分钟老年代垃圾回收速率。
irate(jvm_gc_collection_seconds_count{gc="G1 Old Generation",job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) * 60
ETCD 行
Up
显示 ETCD 集群中的节点数量。
sum(etcd_server_has_leader{job="etcd"})
yellow: > 1, green: > 3
RPC Rate
显示 RPC 请求速率和失败速率。
sum(rate(grpc_server_started_total{job="etcd",grpc_type="unary"}[5m])), sum(rate(grpc_server_handled_total{job="etcd",grpc_type="unary",grpc_code!="OK"}[5m]))
ops
Active Streams
显示活跃的 watch 流和 lease 流数量。
sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}), sum(grpc_server_started_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{job="etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})
ETCD Client Call Failure Rate
显示 Etcd 客户端调用的错误率。有助于识别失败的 Etcd 请求以及集群与 Etcd 服务之间潜在的可靠性或连接问题。
(sum by (instance, server) (irate(alluxio_etcd_call_errors_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])) or on() vector(0)) / (sum by (instance, server) (irate(alluxio_etcd_client_calls_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])))
percentunit
yellow: > 0.05, red: > 0.2
ETCD Client Call Latency (P99)
显示 Etcd 客户端调用的第 99 百分位延迟。有助于识别 Etcd 操作缓慢以及集群内 Etcd 通信中潜在的性能瓶颈。
histogram_quantile( 0.99, sum( rate(alluxio_etcd_client_call_latency_ms_bucket{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m]) ) by (le, instance) )
ms
DB Size
显示 ETCD 数据库的大小。
etcd_mvcc_db_total_size_in_bytes{job="etcd"}
bytes
Disk Sync Duration
显示 WAL 和后端操作的磁盘同步持续时间。
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le)), histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))
Memory
显示 ETCD 进程的常驻内存使用量。
process_resident_memory_bytes{job="etcd"}
bytes
Client Traffic In
显示客户端流入流量速率。
rate(etcd_network_client_grpc_received_bytes_total{job="etcd"}[5m])
binBps
Client Traffic Out
显示客户端流出流量速率。
rate(etcd_network_client_grpc_sent_bytes_total{job="etcd"}[5m])
binBps
Peer Traffic In
显示节点间流入流量速率。
sum(rate(etcd_network_peer_received_bytes_total{job="etcd"}[5m])) by (instance)
binBps
Peer Traffic Out
显示节点间流出流量速率。
sum(rate(etcd_network_peer_sent_bytes_total{job="etcd"}[5m])) by (instance)
binBps
Raft Proposals
显示 raft 提案相关指标,包括失败速率、待处理总数、提交速率和应用速率。
sum(rate(etcd_server_proposals_failed_total{job="etcd"}[5m])), sum(etcd_server_proposals_pending{job="etcd"}), sum(rate(etcd_server_proposals_committed_total{job="etcd"}[5m])), sum(rate(etcd_server_proposals_applied_total{job="$cluster"}[5m]))
none
Total Leader Elections Per Day
显示每天的 leader 选举总次数。
changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])
Last updated