# 监控和指标

指标提供了对集群内部发生的事情的洞察。它们是监控和调试的宝贵资源。 Alluxio设置了一个基于[Prometheus官方指标库](https://github.com/prometheus/client_java)的可配置指标系统。 该指标系统以Prometheus的公开格式来陈列指标。

Alluxio的指标被划分为根据Alluxio组件相对应的不同实例。当前支持以下实例：

* Master: Alluxio master 进程。
* Worker: Alluxio worker 进程。
* FUSE进程: Alluxio FUSE进程，无论作为daemon set process还是通过CSI运行

## 使用

向目标Alluxio进程的`/metrics/`发送HTTP请求，获取所有指标的快照。

```shell
# Get the metrics from Alluxio processes
$ curl <MASTER_HOSTNAME>:<MASTER_WEB_PORT>/metrics/
$ curl <WORKER_HOSTNAME>:<WORKER_WEB_PORT>/metrics/
$ curl <FUSE_HOSTNAME>:<FUSE_WEB_PORT>/metrics/
```

例如，对于本地进程：

```shell
# Get the local master metrics with its default web port 19999
$ curl 127.0.0.1:19999/metrics/
# Get the local worker metrics with its default web port 30000
$ curl 127.0.0.1:30000/metrics/
# Get the local fuse metrics with its default web port 49999
$ curl 127.0.0.1:49999/metrics/
```

## 集成

### Prometheus

使用示例`prometheus.yml`配置Prometheus服务以抓取相关指标。注意，如果需要Grafana集成，则不应更改`job_name`。

```yaml
global:
  scrape_interval: 60s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: [ 'localhost:9090' ]
  - job_name: "master"
    static_configs:
      - targets: [ '<MASTER_HOSTNAME>:<MASTER_WEB_PORT>' ]
  - job_name: "worker"
    static_configs:
      - targets: [ '<WORKER_HOSTNAME>:<WORKER_WEB_PORT>' ]
  - job_name: "fuse"
    static_configs:
      - targets: [ '<FUSE_HOSTNAME>:<FUSE_WEB_PORT>' ]
```

### Grafana

Grafana是一个用于可视化时间序列数据的指标分析和可视化软件。 您可以使用Grafana更好地将Alluxio收集的各种指标可视化展示。 该软件允许用户更容易地查看Alluxio中内存、存储和完成运行操作的变化。

Grafana支持从Prometheus可视化数据。以下步骤可以帮助您基于Grafana和Prometheus轻松构建Alluxio监控系统。

1. 下载Alluxio的Grafana模板JSON文件：[alluxio-dashboard-template.json](https://alluxio-binaries.s3.amazonaws.com/artifactsBundle/ee/AI-3.1-3.3.2/alluxio-dashboard-template.json)
2. 将模板JSON文件导入以创建仪表板。请参阅此[示例](https://grafana.com/docs/grafana/latest/dashboards/export-import/#importing-a-dashboard)以导入仪表板。
3. 以自定义名称，例如*prometheus-alluxio*，将Prometheus数据源添加到Grafana。请参阅[教程](https://grafana.com/docs/grafana/latest/datasources/add-a-data-source/#add-a-data-source)以获取导入仪表板的帮助。

如果您的Grafana仪表板看起来像下面的截图，那么您已成功构建了监控系统。

<figure><img src="/files/ZXEsZiP6sFePKKgZ5RO9" alt=""><figcaption></figcaption></figure>

默认情况下，只有*集群*行被展开，以显示当前状态的摘要。 *进程*行显示资源消耗和与JVM相关的指标，可以在顶部通过服务或实例进行过滤。 其他行显示某些组件的详细信息，可以通过实例进行过滤。

### Kubernetes Operator

Operator支持使用内置的Prometheus和Grafana构建集群。配置和Grafana模板已经包括在内。只需在`AlluxioCluster`配置中设置以下开关：

```yaml
spec:
  alluxio-monitor:
    enabled: true
```

Grafana会将在其主机的8080端口上公开其服务。使用kubectl获取主机名：

```shell
kubectl get pod $(kubectl get pod -l name=alluxio-monitor-grafana --no-headers -o custom-columns=:metadata.name) -o jsonpath='{.spec.nodeName}'
```

假设主机名是`foo.kubernetes.org`，那么您可以在以下地址访问Grafana服务：

```
http://foo.kubernetes.org:8080/
```

### Kubernetes中的Prometheus

将以下代码片段添加到Prometheus配置中。该配置将使Prometheus从具有特定注解的Kubernetes pod中抓取数据。

```yaml
scrape_configs:
  - job_name: 'prometheus'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio-monitor-prometheus)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: prometheus
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'master'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: master
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'worker'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: worker
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'fuse'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: (csi-)?fuse
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name
```

请注意，`scrape_configs`中的`job_name`需要保持不变，因为它将用作仪表板中的过滤器。

以下是所需的元数据：

```yaml
labels:
  app.kubernetes.io/instance: alluxio # used to distinguish different alluxio cluster
  app.kubernetes.io/component: worker # values from operator deployment are master, worker, fuse, and csi-fuse. depends on the pod
annotations:
  prometheus.io/scrape: "true"
  # values should match with the port of the component. By default, it's 19999 for master, 30000 for worker, and 49999 for fuse
  prometheus.io/port: "30000"
  prometheus.io/path: "/metrics/"
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-cn/ai-3.2/start/monitoring-and-metrics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
