# 监控和指标

指标提供了对集群内部发生的事情的洞察。它们是监控和调试的宝贵资源。 Alluxio设置了一个基于[Prometheus官方指标库](https://github.com/prometheus/client_java)的可配置指标系统。 该指标系统以Prometheus的公开格式来陈列指标。

Alluxio的指标被划分为根据Alluxio组件相对应的不同实例。当前支持以下实例：

* Coordinator: Alluxio coordinator 进程。
* Worker: Alluxio worker 进程。

## 使用

向目标Alluxio进程的`/metrics/`发送HTTP请求，获取所有指标的快照。

```shell
# Get the metrics from Alluxio processes
$ curl <COORDINATOR_HOSTNAME>:<COORDINATOR_WEB_PORT>/metrics/
$ curl <WORKER_HOSTNAME>:<WORKER_WEB_PORT>/metrics/
```

例如，对于本地进程：

```shell
# Get the local coordinator metrics with its default web port 19999
$ curl 127.0.0.1:19999/metrics/
# Get the local worker metrics with its default web port 30000
$ curl 127.0.0.1:30000/metrics/
```

[Metrics](/ee-da-cn/reference/metrics.md)页面提供了更详细的指标的描述。

## 集成

### Prometheus

使用示例`prometheus.yml`配置Prometheus服务以抓取相关指标。注意，如果需要Grafana集成，则不应更改`job_name`。

```yaml
global:
  scrape_interval: 60s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: [ 'localhost:9090' ]
  - job_name: "coordinator"
    static_configs:
      - targets: [ '<COORDINATOR_HOSTNAME>:<COORDINATOR_WEB_PORT>' ]
  - job_name: "worker"
    static_configs:
      - targets: [ '<WORKER_HOSTNAME>:<WORKER_WEB_PORT>' ]
```

### Grafana

Grafana是一个用于可视化时间序列数据的指标分析和可视化软件。 您可以使用Grafana更好地将Alluxio收集的各种指标可视化展示。 该软件允许用户更容易地查看Alluxio中内存、存储和完成运行操作的变化。

Grafana支持从Prometheus可视化数据。以下步骤可以帮助您基于Grafana和Prometheus轻松构建Alluxio监控系统。

1. 下载Alluxio的Grafana模板JSON文件：[alluxio-dashboard-template.json](https://alluxio-binaries.s3.amazonaws.com/artifactsBundle/ee/AI-3.5-10.0.0/alluxio-dashboard-template.json)
2. 将模板JSON文件导入以创建仪表板。请参阅此[示例](https://grafana.com/docs/grafana/latest/dashboards/export-import/#importing-a-dashboard)以导入仪表板。
3. 以自定义名称，例如\_prometheus-alluxio\_，将Prometheus数据源添加到Grafana。请参阅[教程](https://grafana.com/docs/grafana/latest/datasources/add-a-data-source/#add-a-data-source)以获取导入仪表板的帮助。

如果您的Grafana仪表板看起来像下面的截图，那么您已成功构建了监控系统。

<figure><img src="/files/6usXdpnzhbTXtCm769YA" alt=""><figcaption></figcaption></figure>

默认情况下，只有\_集群\_行被展开，以显示当前状态的摘要。 \_进程\_行显示资源消耗和与JVM相关的指标，可以在顶部通过服务或实例进行过滤。 其他行显示某些组件的详细信息，可以通过实例进行过滤。

### Kubernetes Operator

Operator支持使用内置的Prometheus和Grafana构建集群。配置和Grafana模板已经包括在内。只需在`AlluxioCluster`配置中设置以下开关：

```yaml
spec:
  alluxio-monitor:
    enabled: true
```

#### 通过节点主机名访问Grafana

Grafana会将在其主机的8080端口上公开其服务。使用kubectl获取主机名：

```shell
kubectl get pod $(kubectl get pod -l app.kubernetes.io/component=grafana --no-headers -o custom-columns=:metadata.name) -o jsonpath='{.spec.nodeName}'
```

假设主机名是`foo.kubernetes.org`，那么您可以在以下地址访问Grafana服务：

```
http://foo.kubernetes.org:8080/
```

#### 通过端口转发访问Grafana

如果因网络问题无法通过节点主机名直接访问 Grafana，可以使用端口转发将 Grafana 的端口映射到本地，从而通过本地端口进行访问。

执行 `kubectl port-forward` 命令进行端口转发

```console
kubectl port-forward $(kubectl get pod -l app.kubernetes.io/component=grafana -o jsonpath="{.items[0].metadata.name}") 3000:3000
```

您可以通过以下地址在本地直接访问Grafana服务：

```
http://localhost:3000
```

### Kubernetes中的Prometheus

将以下代码片段添加到Prometheus配置中。该配置将使Prometheus从具有特定注解的Kubernetes pod中抓取数据。

```yaml
scrape_configs:
  - job_name: 'prometheus'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: prometheus
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'coordinator'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: coordinator
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'worker'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: worker
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

```

请注意，`scrape_configs`中的`job_name`需要保持不变，因为它将用作仪表板中的过滤器。

以下是所需的元数据：

```yaml
labels:
  app.kubernetes.io/instance: alluxio # used to distinguish different alluxio cluster
  app.kubernetes.io/component: worker # values from operator deployment are coordinator, worker. depends on the pod
annotations:
  prometheus.io/scrape: "true"
  # values should match with the port of the component. By default, it's 19999 for coordinator and 30000 for worker
  prometheus.io/port: "30000"
  prometheus.io/path: "/metrics/"
```

### Datadog

Alluxio会导出Prometheus格式的指标，这使得Datadog可以直接和Alluxio集成。

1. 确保Datadog可以访问在[Prometheus集成](#Prometheus)中列出的端口
2. 在Datadog配置文件的`instances`字段下增加多条`prometheus_url`配置

以下是从多个组件获取指标的配置片段：

```yaml
instances:
  - prometheus_url: <http://<alluxio-coordinator-instance>>:19999/metrics
  - prometheus_url: <http://<alluxio-worker-1-instance>>:30000/metrics
  - prometheus_url: <http://<alluxio-worker-2-instance>>:30000/metrics
  ...
```

按照以上步骤，就能让Datadog无缝地收集和监控Alluxio的指标，为您的 Alluxio 集群性能和健康状况提供深入分析与全面监控。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-da-cn/start/monitoring-and-metrics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
