# 监控和指标

指标提供了对集群内部发生的事情的洞察。它们是监控和调试的宝贵资源。 Alluxio设置了一个基于[Prometheus官方指标库](https://github.com/prometheus/client_java)的可配置指标系统。 该指标系统以Prometheus的公开格式来陈列指标。

Alluxio的指标被划分为根据Alluxio组件相对应的不同实例。当前支持以下实例：

* Coordinator: Alluxio coordinator 进程。
* Worker: Alluxio worker 进程。
* FUSE进程: Alluxio FUSE进程，无论作为daemon set process还是通过CSI运行

## 使用

向目标Alluxio进程的`/metrics/`发送HTTP请求，获取所有指标的快照。

```shell
# Get the metrics from Alluxio processes
$ curl <COORDINATOR_HOSTNAME>:<COORDINATOR_WEB_PORT>/metrics/
$ curl <WORKER_HOSTNAME>:<WORKER_WEB_PORT>/metrics/
$ curl <FUSE_HOSTNAME>:<FUSE_WEB_PORT>/metrics/
```

例如，对于本地进程：

```shell
# Get the local coordinator metrics with its default web port 19999
$ curl 127.0.0.1:19999/metrics/
# Get the local worker metrics with its default web port 30000
$ curl 127.0.0.1:30000/metrics/
# Get the local fuse metrics with its default web port 49999
$ curl 127.0.0.1:49999/metrics/
```

## 集成

### Prometheus

使用示例`prometheus.yml`配置Prometheus服务以抓取相关指标。注意，如果需要Grafana集成，则不应更改`job_name`。

```yaml
global:
  scrape_interval: 60s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: [ 'localhost:9090' ]
  - job_name: "coordinator"
    static_configs:
      - targets: [ '<COORDINATOR_HOSTNAME>:<COORDINATOR_WEB_PORT>' ]
  - job_name: "worker"
    static_configs:
      - targets: [ '<WORKER_HOSTNAME>:<WORKER_WEB_PORT>' ]
  - job_name: "fuse"
    static_configs:
      - targets: [ '<FUSE_HOSTNAME>:<FUSE_WEB_PORT>' ]
```

### Grafana

Grafana是一个用于可视化时间序列数据的指标分析和可视化软件。 您可以使用Grafana更好地将Alluxio收集的各种指标可视化展示。 该软件允许用户更容易地查看Alluxio中内存、存储和完成运行操作的变化。

Grafana支持从Prometheus可视化数据。以下步骤可以帮助您基于Grafana和Prometheus轻松构建Alluxio监控系统。

1. 下载Alluxio的Grafana模板JSON文件：[alluxio-dashboard-template.json](https://alluxio-binaries.s3.amazonaws.com/artifactsBundle/ee/AI-3.1-3.3.2/alluxio-dashboard-template.json)
2. 将模板JSON文件导入以创建仪表板。请参阅此[示例](https://grafana.com/docs/grafana/latest/dashboards/export-import/#importing-a-dashboard)以导入仪表板。
3. 以自定义名称，例如*prometheus-alluxio*，将Prometheus数据源添加到Grafana。请参阅[教程](https://grafana.com/docs/grafana/latest/datasources/add-a-data-source/#add-a-data-source)以获取导入仪表板的帮助。

如果您的Grafana仪表板看起来像下面的截图，那么您已成功构建了监控系统。

<figure><img src="https://389466660-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FnORAZCwFKk0vcsnYaCTW%2Fuploads%2Fgit-blob-f08ccc4154c11f5c1871e9ba42541a23eb6c4551%2Fscreenshot_grafana_webui.png?alt=media" alt=""><figcaption></figcaption></figure>

默认情况下，只有*集群*行被展开，以显示当前状态的摘要。 *进程*行显示资源消耗和与JVM相关的指标，可以在顶部通过服务或实例进行过滤。 其他行显示某些组件的详细信息，可以通过实例进行过滤。

### Kubernetes Operator

Operator支持使用内置的Prometheus和Grafana构建集群。配置和Grafana模板已经包括在内。只需在`AlluxioCluster`配置中设置以下开关：

```yaml
spec:
  alluxio-monitor:
    enabled: true
```

#### 通过节点主机名访问Grafana

Grafana会将在其主机的8080端口上公开其服务。使用kubectl获取主机名：

```shell
kubectl get pod $(kubectl get pod -l name=alluxio-monitor-grafana --no-headers -o custom-columns=:metadata.name) -o jsonpath='{.spec.nodeName}'
```

假设主机名是`foo.kubernetes.org`，那么您可以在以下地址访问Grafana服务：

```
http://foo.kubernetes.org:8080/
```

#### 通过端口转发访问Grafana

如果因网络问题无法通过节点主机名直接访问 Grafana，可以使用端口转发将 Grafana 的端口映射到本地，从而通过本地端口进行访问。

执行 `kubectl port-forward` 命令进行端口转发

```console
kubectl port-forward $(kubectl get pod -l app.kubernetes.io/component=grafana -o jsonpath="{.items[0].metadata.name}") 3000:3000
```

您可以通过以下地址在本地直接访问Grafana服务：

```
http://localhost:3000
```

### Kubernetes中的Prometheus

将以下代码片段添加到Prometheus配置中。该配置将使Prometheus从具有特定注解的Kubernetes pod中抓取数据。

```yaml
scrape_configs:
  - job_name: 'prometheus'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio-monitor-prometheus)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: prometheus
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'coordinator'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: coordinator
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'worker'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: worker
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name

  - job_name: 'fuse'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: keep
        regex: (?:alluxio)
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: (csi-)?fuse
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
        action: replace
        target_label: cluster_name
```

请注意，`scrape_configs`中的`job_name`需要保持不变，因为它将用作仪表板中的过滤器。

以下是所需的元数据：

```yaml
labels:
  app.kubernetes.io/instance: alluxio # used to distinguish different alluxio cluster
  app.kubernetes.io/component: worker # values from operator deployment are coordinator, worker, fuse, and csi-fuse. depends on the pod
annotations:
  prometheus.io/scrape: "true"
  # values should match with the port of the component. By default, it's 19999 for coordinator, 30000 for worker, and 49999 for fuse
  prometheus.io/port: "30000"
  prometheus.io/path: "/metrics/"
```

### Datadog

Alluxio会导出Prometheus格式的指标，这使得Datadog可以直接和Alluxio集成。

1. 确保Datadog可以访问在[Prometheus集成](#Prometheus)中列出的端口
2. 在Datadog配置文件的`instances`字段下增加多条`prometheus_url`配置

以下是从多个组件获取指标的配置片段：

```yaml
instances:
  - prometheus_url: <http://<alluxio-coordinator-instance>>:19999/metrics
  - prometheus_url: <http://<alluxio-worker-1-instance>>:30000/metrics
  - prometheus_url: <http://<alluxio-worker-2-instance>>:30000/metrics
  ...
```

按照以上步骤，就能让Datadog无缝地收集和监控Alluxio的指标，为您的 Alluxio 集群性能和健康状况提供深入分析与全面监控。
