> For the complete documentation index, see [llms.txt](https://documentation.alluxio.io/ee-ai-cn/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://documentation.alluxio.io/ee-ai-cn/administration/monitoring-alluxio.md).

# 监控

Alluxio 以 [Prometheus 导出格式](https://prometheus.io/docs/instrumenting/exposition_formats/)公开指标，可与标准监控栈集成。本指南涵盖 Prometheus 配置、Grafana 仪表板导入、告警规则及直接查询指标的方法，适用于 Kubernetes（Operator）和 Docker/裸机两种部署方式。

## Prometheus 配置

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Alluxio Operator 会在部署集群时自动部署 Prometheus 实例，无需手动配置。

验证 Prometheus 是否正在运行：

```shell
kubectl -n alx-ns get pod -l app.kubernetes.io/component=prometheus
```

```console
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-prometheus-6f697b6db8-sbvvg   1/1     Running   0          2m
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
在 Coordinator 节点上运行 Prometheus，并配置指向所有 Alluxio 组件的静态抓取配置。

**第一步：创建 Prometheus 配置文件**

```shell
mkdir -p ~/monitoring/prometheus
```

创建 `~/monitoring/prometheus/prometheus.yml`：

```yaml
global:
  scrape_interval: 60s

scrape_configs:
  - job_name: "coordinator"
    static_configs:
      - targets: ["<COORDINATOR_PRIVATE_IP>:19999"]
  - job_name: "workers"
    static_configs:
      - targets: ["<WORKER1_PRIVATE_IP>:30000", "<WORKER2_PRIVATE_IP>:30000"]
  - job_name: "fuse"
    static_configs:
      - targets: ["<FUSE_PRIVATE_IP>:49999"]
```

`targets` 中为每个 worker 添加一条记录。

**第二步：启动 Prometheus**

```shell
docker run -d --net=host --name=prometheus \
  -v ~/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus --config.file=/etc/prometheus/prometheus.yml
```

**第三步：验证目标状态为 UP**

{% hint style="info" %}
Prometheus 按上述 `scrape_interval`（60 秒）抓取指标。在首次抓取完成前，目标状态显示为 `unknown`——请等待最多 60 秒后再检查。
{% endhint %}

在浏览器中打开 `http://localhost:9090/targets`（或通过 SSH 隧道访问——参见 [Grafana 配置](#grafana-配置) 中的访问说明），或通过 API 查询：

```shell
curl -s 'http://localhost:9090/api/v1/targets' | \
  python3 -c "import sys,json; [print(t['labels']['job'], t['health']) for t in json.load(sys.stdin)['data']['activeTargets']]"
```

```console
coordinator up
workers up
fuse up
```

{% endtab %}
{% endtabs %}

### Kubernetes：使用已有 Prometheus

如果集群中已有 Prometheus 实例，可以禁用 Operator 托管的 Prometheus，并通过 Kubernetes 服务发现来抓取 Alluxio 指标。

禁用 Operator 托管的 Prometheus：

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  prometheus:
    enabled: false
```

在已有的 `prometheus.yml` 中添加如下抓取配置，通过注解自动发现 Alluxio Pod：

```yaml
scrape_configs:
  - job_name: 'alluxio-components'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 仅保留带有 prometheus.io/scrape=true 注解的 Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # 仅保留 Alluxio 组件
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: alluxio
      # 使用注解中的 metrics 路径，默认为 /metrics
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      # 使用注解中的端口
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      # 从组件名称设置 job 标签
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: replace
        target_label: job
      # 传播集群名称，供多集群 Grafana 仪表板使用
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'etcd'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: etcd
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - target_label: job
        replacement: etcd
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name
```

Alluxio Pod 必须带有以下标签和注解，服务发现才能正常工作：

```yaml
# Alluxio worker Pod 元数据示例
metadata:
  labels:
    app.kubernetes.io/name: alluxio
    app.kubernetes.io/component: worker   # 或 coordinator、fuse
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "30000"           # coordinator 为 19999，fuse 为 49999
    prometheus.io/path: "/metrics/"
```

## Grafana 配置

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Operator 会随集群自动部署 Grafana。

#### 通过端口转发访问（推荐）

```shell
kubectl -n alx-ns port-forward \
  $(kubectl -n alx-ns get pod -l app.kubernetes.io/component=grafana -o jsonpath="{.items[0].metadata.name}") \
  3000:3000
```

然后在浏览器中打开 `http://localhost:3000`。

#### 通过节点主机名访问

如果 Kubernetes 节点在网络上可直接访问，查询 Grafana 所在节点：

```shell
kubectl -n alx-ns get pod \
  $(kubectl -n alx-ns get pod -l app.kubernetes.io/component=grafana --no-headers -o custom-columns=:metadata.name) \
  -o jsonpath='{.spec.nodeName}'
```

然后通过 `http://<node-hostname>:8080/` 访问 Grafana。

#### 禁用默认 Grafana

如需使用自己的 Grafana 实例，可禁用 Operator 托管的 Grafana：

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  grafana:
    enabled: false
```

{% hint style="info" %}
Prometheus 是 Operator 部署的核心组件，不能单独禁用。
{% endhint %}
{% endtab %}

{% tab title="Docker / Bare-Metal" %}
在 Coordinator 节点上与 Prometheus 一同运行 Grafana。通过预置数据源配置，Grafana 启动后无需手动配置 Prometheus 连接。

**第一步：创建数据源预置文件**

```shell
mkdir -p ~/monitoring/grafana/provisioning/datasources
```

创建 `~/monitoring/grafana/provisioning/datasources/prometheus.yml`：

```yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://localhost:9090
    isDefault: true
    access: proxy
    editable: true
```

**第二步：启动 Grafana**

```shell
docker run -d --net=host --name=grafana \
  -v ~/monitoring/grafana/provisioning:/etc/grafana/provisioning \
  -e GF_SECURITY_ADMIN_USER=admin \
  -e GF_SECURITY_ADMIN_PASSWORD=grafana \
  grafana/grafana
```

**第三步：访问 Grafana**

{% hint style="info" %}
EC2 安全组默认关闭 3000（Grafana）和 9090（Prometheus）端口。可以在安全组中开放这些端口，或使用 SSH 隧道：

```shell
ssh -L 3000:localhost:3000 -L 9090:localhost:9090 user@<COORDINATOR_PUBLIC_IP>
```

然后通过 `http://localhost:3000` 访问 Grafana。
{% endhint %}

如果安全组已开放对应端口，可直接访问 `http://<COORDINATOR_PUBLIC_IP>:3000`（登录账号：`admin` / `grafana`）。
{% endtab %}
{% endtabs %}

## 导入仪表板

下载官方 Alluxio 仪表板模板并导入 Grafana：

```shell
wget -O /tmp/alluxio-dashboard.json \
  https://alluxio-binaries.s3.amazonaws.com/artifactsBundle/ee/AI-3.8-15.1.0/alluxio-ai-dashboard-template.json
```

在 Grafana 中：**Dashboards → Import → Upload JSON file** → 选择 `/tmp/alluxio-dashboard.json` → 选择 **Prometheus** 数据源 → 点击 **Import**。详细导入选项请参阅 [Grafana 导入指南](https://grafana.com/docs/grafana/latest/dashboards/export-import/#importing-a-dashboard)。

### 仪表板说明

* **Cluster** 区域提供集群状态的整体概览。
* **Process** 区域展示每个组件的资源消耗（CPU、内存）和 JVM 指标。
* 其他区域提供 Coordinator、Worker 和缓存的详细指标。

## 告警规则

以下查询可用于构建 Prometheus 告警规则或 Grafana 告警面板。阈值为建议初始值——请根据业务负载和集群规模进行调整。

### 进程可用性 — ETCD

| 字段       | 取值                                        |
| -------- | ----------------------------------------- |
| 组件       | 进程可用性 - ETCD                              |
| 指标       | `etcd_server_has_leader`                  |
| 指标说明     | 显示每个 etcd 成员当前是否有 leader                  |
| 查询       | `sum(etcd_server_has_leader{job="etcd"})` |
| 查询说明     | 汇总当前拥有 leader 的所有 etcd 成员数量               |
| 触发条件     | value < 3                                 |
| 阈值 / 参考值 | 期望为 3 个成员                                 |
| 含义       | 一个或多个 etcd Pod 异常，或者 etcd 集群失去多数派（quorum） |
| 备注       |                                           |

| 字段       | 取值                                                               |
| -------- | ---------------------------------------------------------------- |
| 组件       | 进程可用性 - ETCD                                                     |
| 指标       | `etcd_server_leader_changes_seen_total`                          |
| 指标说明     | 统计 leader 发生变更的次数                                                |
| 查询       | `changes(etcd_server_leader_changes_seen_total{job="etcd"}[5m])` |
| 查询说明     | 计算最近 5 分钟内发生了多少次 leader 选举事件                                     |
| 触发条件     | > 0 且持续 5 分钟以上                                                   |
| 阈值 / 参考值 | 任意大于 0 的变化                                                       |
| 含义       | leader 不稳定，通常意味着 etcd 自身不稳定或网络问题                                 |
| 备注       | 仪表盘上的查询区间需要从 1d 修改为 5m                                           |

### 进程可用性 — Worker 数量

| 字段       | 取值                                    |
| -------- | ------------------------------------- |
| 组件       | 进程可用性 - Worker 数量                     |
| 指标       | `up{job="worker"}`                    |
| 指标说明     | 显示 worker 实例是否存活（Prometheus 抓取时是否有响应） |
| 查询       | `sum(up{job="worker"})`               |
| 查询说明     | 统计当前处于存活状态（up=1）的 worker 目标数量         |
| 触发条件     | value < 期望的 worker 数量                 |
| 阈值 / 参考值 | 小于期望的 worker 数量                       |
| 含义       | 一个或多个 worker 已经宕机或无法响应抓取              |
| 备注       | 将期望的 worker 数量设置为生产环境集群的实际规模          |

### 进程资源

| 字段       | 取值                                                                     |
| -------- | ---------------------------------------------------------------------- |
| 组件       | 进程资源                                                                   |
| 指标       | `jvm_memory_used_bytes`                                                |
| 指标说明     | 显示当前 JVM 堆内存使用量占最大堆内存的百分比                                              |
| 查询       | `jvm_memory_used_bytes{area="heap"}/jvm_memory_max_bytes{area="heap"}` |
| 查询说明     | 计算当前堆内存使用量 / 最大堆内存，得到堆使用率                                              |
| 触发条件     | > 0.75 且持续 5 分钟以上                                                      |
| 阈值 / 参考值 | 75–80%                                                                 |
| 含义       | 组件使用了较高比例的堆内存，可能存在内存压力或即将出现频繁 GC 抖动                                    |
| 备注       | 适用于所有 JVM 组件（coordinator、workers、fuse 等）                               |

| 字段       | 取值                                                                |
| -------- | ----------------------------------------------------------------- |
| 组件       | 进程资源                                                              |
| 指标       | `jvm_gc_collection_seconds_sum`                                   |
| 指标说明     | 老年代 GC（Old GC）耗费的总时间                                              |
| 查询       | `rate(jvm_gc_collection_seconds_sum{gc="G1 Old Generation"}[5m])` |
| 查询说明     | 计算最近 5 分钟内老年代 / Full GC 的每秒耗时                                     |
| 触发条件     | > 5s/min 且持续 5 分钟以上                                               |
| 阈值 / 参考值 | > 0.083                                                           |
| 含义       | JVM 频繁执行 Full GC，存在明显的停顿风险                                        |
| 备注       | 建议结合老年代 GC 次数一起使用，以进一步确认问题                                        |

| 字段       | 取值                                                                  |
| -------- | ------------------------------------------------------------------- |
| 组件       | 进程资源                                                                |
| 指标       | `jvm_gc_collection_seconds_count`                                   |
| 指标说明     | 老年代 GC 的触发频率                                                        |
| 查询       | `rate(jvm_gc_collection_seconds_count{gc="G1 Old Generation"}[5m])` |
| 查询说明     | 计算最近 5 分钟内老年代 / Full GC 的每分钟次数                                      |
| 触发条件     | > 1 次/分钟 且持续 5 分钟以上                                                 |
| 阈值 / 参考值 | > 1                                                                 |
| 含义       | JVM 经常触发 Full GC，通常由内存压力等造成                                         |
| 备注       | 作为内存压力的早期预警指标                                                       |

| 字段       | 取值                                                                  |
| -------- | ------------------------------------------------------------------- |
| 组件       | 进程资源                                                                |
| 指标       | `jvm_gc_collection_seconds_sum`                                     |
| 指标说明     | 年轻代 GC（Young GC）耗费的总时间                                              |
| 查询       | `rate(jvm_gc_collection_seconds_sum{gc="G1 Young Generation"}[5m])` |
| 查询说明     | 计算最近 5 分钟内年轻代 GC 的每秒耗时                                              |
| 触发条件     | > 10s/min 且持续 5 分钟以上                                                |
| 阈值 / 参考值 | > 0.166                                                             |
| 含义       | GC 开销较高，会明显拖慢吞吐                                                     |
| 备注       | 仅在问题持续存在时才报警                                                        |

| 字段       | 取值                                                                                                     |
| -------- | ------------------------------------------------------------------------------------------------------ |
| 组件       | 进程资源                                                                                                   |
| 指标       | `process_cpu_seconds_total`                                                                            |
| 指标说明     | 进程累计消耗的用户态 + 内核态 CPU 时间                                                                                |
| 查询       | `irate(process_cpu_seconds_total{job=~"$service",instance=~"$instance",cluster_name=~"$cluster"}[5m])` |
| 查询说明     | 计算最近 5 分钟内进程的每秒 CPU 使用率                                                                                |
| 触发条件     | CPU 使用率持续保持在较高水平超过 5 分钟                                                                                |
| 阈值 / 参考值 | > 单核 CPU 的 80%（约 0.8）                                                                                  |
| 含义       | 进程处于 CPU 绑定（CPU bound）状态或卡死在高 CPU 消耗的逻辑中                                                               |
| 备注       | 阈值可根据节点 vCPU 数量调整；在使用率接近饱和且平稳时才告警                                                                      |

### 缓存 — 命中率

| 字段       | 取值                                                                                                                                                                                                                                                                                          |
| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 组件       | 缓存 - 命中率                                                                                                                                                                                                                                                                                    |
| 指标       | `alluxio_cached_data_read_bytes_total` 与 `alluxio_missed_data_read_bytes_total`                                                                                                                                                                                                             |
| 指标说明     | 度量读取数据中，有多少是从缓存返回的、多少是从 UFS 拉取的                                                                                                                                                                                                                                                             |
| 查询       | `sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])) + sum(irate(alluxio_missed_data_read_bytes_total{job="worker",cluster_name=~"$cluster"}[5m])))` |
| 查询说明     | 计算 5 分钟窗口内的缓存命中率                                                                                                                                                                                                                                                                            |
| 触发条件     | 缓存命中率持续偏低超过 5 分钟                                                                                                                                                                                                                                                                            |
| 阈值 / 参考值 | < 80%                                                                                                                                                                                                                                                                                       |
| 含义       | 大量访问 UFS 的读流量，缓存未被有效利用                                                                                                                                                                                                                                                                      |
| 备注       | 阈值可根据实际业务调整（如 70–90%）                                                                                                                                                                                                                                                                       |

### 缓存 — 利用率

| 字段       | 取值                                                                                                                                                     |
| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| 组件       | 缓存 - 利用率                                                                                                                                               |
| 指标       | `alluxio_cached_storage_bytes` 与 `alluxio_cached_capacity_bytes`                                                                                       |
| 指标说明     | 显示当前缓存已用容量占配置总容量的比例                                                                                                                                    |
| 查询       | `sum(alluxio_cached_storage_bytes{job="worker",cluster_name=~"$cluster"}) / sum(alluxio_cached_capacity_bytes{job="worker",cluster_name=~"$cluster"})` |
| 查询说明     | 计算当前缓存使用量 / 总缓存容量                                                                                                                                      |
| 触发条件     | > 0.85（告警），> 0.95（严重） 且持续 5 分钟以上                                                                                                                       |
| 阈值 / 参考值 | 85–95% 利用率                                                                                                                                             |
| 含义       | 缓存接近打满，可能面临频繁淘汰或写失败风险                                                                                                                                  |
| 备注       | 阈值需要结合集群规模和业务模式调整                                                                                                                                      |

### 缓存 — 淘汰相关性

| 字段       | 取值                                                                                                                                                                                                    |
| -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 组件       | 缓存 - 淘汰与压力相关性                                                                                                                                                                                         |
| 指标       | `alluxio_cached_evicted_data_bytes_total` 与 `alluxio_block_store_used_bytes`                                                                                                                          |
| 指标说明     | 通过同时观察被淘汰的字节数和当前缓存使用率来判断缓存是否存在压力                                                                                                                                                                      |
| 查询       | `(sum(irate(alluxio_cached_evicted_data_bytes_total{job="worker"}[5m])) > 0) and ((sum(alluxio_block_store_used_bytes{job="worker"}) / sum(alluxio_block_store_capacity_bytes{job="worker"})) > 0.8)` |
| 查询说明     | 检查在缓存使用率高于 80% 时是否存在缓存淘汰行为                                                                                                                                                                            |
| 触发条件     | 在使用率 > 80% 时，Evictions > 0 且持续 5 分钟以上                                                                                                                                                                 |
| 阈值 / 参考值 | 使用率 > 80%，且淘汰字节数 > 0                                                                                                                                                                                  |
| 含义       | 在高缓存利用率下仍不断出现淘汰，说明缓存处于抖动或强压力状态                                                                                                                                                                        |
| 备注       | 需要在监控中手动新增该面板                                                                                                                                                                                         |

### FUSE — UFS 回退

| 字段       | 取值                                                                                                  |
| -------- | --------------------------------------------------------------------------------------------------- |
| 组件       | Fuse - UFS 回退                                                                                       |
| 指标       | `alluxio_ufs_data_access_bytes_total`                                                               |
| 指标说明     | 跟踪 Fuse Pod 直接访问 UFS 的读流量（绕过 Alluxio 缓存）                                                            |
| 查询       | `irate(alluxio_ufs_data_access_bytes_total{job="fuse",method="read",cluster_name=~"$cluster"}[5m])` |
| 查询说明     | 计算 Fuse 发起的 UFS 读流量在 5 分钟窗口内的吞吐率                                                                    |
| 触发条件     | Fuse 的 UFS 回退读流量持续上升 / 保持在较高水平                                                                      |
| 阈值 / 参考值 | 持续 > 10 MiB/s 且持续时间 > 5 分钟                                                                          |
| 含义       | Fuse 客户端频繁绕过 Alluxio 缓存直读 UFS，回退占比高                                                                 |
| 备注       | 建议结合缓存命中率和请求速率一起观察；通常当回退持续 > 10–20 MiB/s 时值得重点排查                                                    |

### 读取吞吐

| 字段       | 取值                                                                                                         |
| -------- | ---------------------------------------------------------------------------------------------------------- |
| 组件       | 读取吞吐                                                                                                       |
| 指标       | `alluxio_data_throughput_bytes_total`                                                                      |
| 指标说明     | 度量 worker 侧对外提供的读取吞吐                                                                                       |
| 查询       | `sum(irate(alluxio_data_throughput_bytes_total{job="worker",method="read",cluster_name=~"$cluster"}[5m]))` |
| 查询说明     | 计算 worker 在 5 分钟窗口内的总读吞吐                                                                                   |
| 触发条件     | worker 读吞吐显著下降                                                                                             |
| 阈值 / 参考值 | 低于设定基线（例如 < 10 MiB/s）且同时 UFS 读吞吐上升                                                                         |
| 含义       | 缓存未有效提供数据，业务流量更多地直接打到了 UFS                                                                                 |
| 备注       | 阈值需要根据业务正常访问模式进行调优                                                                                         |

### 数据 — 读请求速率

| 字段       | 取值                                                                       |
| -------- | ------------------------------------------------------------------------ |
| 组件       | 数据访问                                                                     |
| 指标       | `alluxio_data_access_bytes_count{method="read"}`                         |
| 指标说明     | 统计 worker 处理的读请求（read 操作）的次数                                             |
| 查询       | `irate(alluxio_data_access_bytes_count{method="read",job="worker"}[5m])` |
| 查询说明     | 计算 5 分钟窗口内的读请求速率（请求数/秒）                                                  |
| 触发条件     | 在预期有业务流量的时间段内，请求速率降为 0                                                   |
| 阈值 / 参考值 | 接近 0 且持续 > 5 分钟                                                          |
| 含义       | worker 没有在提供数据服务 —— 可能是 worker 崩溃、不可用或缓存整体不可用                            |
| 备注       | 需要结合业务调度时间窗口来避免误报（如非业务时段的正常"空闲"）                                         |

### 许可证 — 到期时间

| 字段       | 取值                                                                           |
| -------- | ---------------------------------------------------------------------------- |
| 组件       | 许可证 - 到期时间                                                                   |
| 指标       | `alluxio_license_expiration_date`                                            |
| 指标说明     | 显示 Alluxio 许可证到期时间（UNIX 时间戳）                                                 |
| 查询       | `(max by (cluster_name) (alluxio_license_expiration_date) - time()) / 86400` |
| 查询说明     | 用许可证到期时间减去当前时间，计算到许可证到期的天数                                                   |
| 触发条件     | < 30（告警），< 7（严重）                                                             |
| 阈值 / 参考值 | 30 天、7 天                                                                     |
| 含义       | 许可证即将到期，需要尽快续约，避免服务中断                                                        |
| 备注       | 需要在监控系统中手动新增该面板                                                              |

### 许可证 — 版本不一致

| 字段       | 取值                                                     |
| -------- | ------------------------------------------------------ |
| 组件       | 许可证 - 版本不一致                                            |
| 指标       | `alluxio_version_info`                                 |
| 指标说明     | 通过标签中的 version 字段展示各个 Alluxio 组件的运行版本                  |
| 查询       | `count(count by (version) (alluxio_version_info)) > 1` |
| 查询说明     | 检查当前是否存在多于一个不同的 Alluxio 版本                             |
| 触发条件     | > 1                                                    |
| 阈值 / 参考值 | 版本数量大于 1                                               |
| 含义       | 集群中存在 Alluxio 组件版本不一致的情况                               |
| 备注       | 需要在监控系统中手动新增该面板                                        |

## 直接查询指标

用于高级分析或调试时，可直接查询 Prometheus 或组件端点。

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
进入 Prometheus Pod 的 shell：

```shell
kubectl -n alx-ns exec -it \
  $(kubectl -n alx-ns get pod -l app.kubernetes.io/component=prometheus --no-headers -o custom-columns=:metadata.name) \
  -- /bin/sh
```

使用 `promtool` 执行即时查询：

```shell
# 列出所有可用的 Alluxio 指标
promtool query instant http://localhost:9090 'count({__name__=~".+"}) by (__name__)' | grep alluxio_

# 查询所有 worker 的总缓存容量
promtool query instant http://localhost:9090 'alluxio_cached_capacity_bytes'
# 示例输出：
# alluxio_cached_capacity_bytes{instance="worker:30000", job="worker"} => 10737418240 @[...]
```

直接查询 Pod 内的组件端点：

```shell
# Coordinator 指标
kubectl -n alx-ns exec alluxio-cluster-coordinator-0 -- curl -s http://localhost:19999/metrics/ | head -20

# Worker 指标
kubectl -n alx-ns exec alluxio-cluster-worker-0 -- curl -s http://localhost:30000/metrics/ | head -20
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
通过 Prometheus HTTP API 查询（在 Coordinator 主机上执行）：

```shell
# 查询所有 worker 的缓存容量
curl -s 'http://localhost:9090/api/v1/query?query=alluxio_cached_capacity_bytes' | python3 -m json.tool

# 查询在线 worker 数量
curl -s 'http://localhost:9090/api/v1/query?query=sum(up{job="workers"})' | python3 -m json.tool
```

直接查询组件端点：

```shell
# Coordinator 指标（在 Coordinator 主机上执行）
curl http://localhost:19999/metrics/

# Worker 指标（在 Worker 主机上执行）
curl http://localhost:30000/metrics/

# FUSE 指标（在 FUSE 主机上执行）
curl http://localhost:49999/metrics/
```

{% endtab %}
{% endtabs %}

有关可用指标及其说明的完整列表，请参阅[指标参考](/ee-ai-cn/reference/metrics.md)。

## Datadog 集成

Datadog 可以直接从 Alluxio 的 Prometheus 端点采集指标。

1. 确保 Datadog agent 可以访问 Alluxio 指标端口：`19999`（coordinator）、`30000`（workers）、`49999`（FUSE）。
2. 在 `conf.d/prometheus.d/conf.yaml` 中添加如下配置：

```yaml
instances:
  - prometheus_url: http://<alluxio-coordinator-hostname>:19999/metrics
    namespace: alluxio
    metrics:
      - "*"
  - prometheus_url: http://<alluxio-worker-1-hostname>:30000/metrics
    namespace: alluxio
    metrics:
      - "*"
  # 为每个 worker 添加一条记录
```

此配置将指示 Datadog agent 采集并上报全部 Alluxio 指标。