# 集群运维

本节对 Alluxio 集群的管理进行了高级概述，涵盖了从日常管理和监控到安全和故障排除的关键领域。

## 1. 管理集群

集群管理拆分到多个聚焦的页面。可根据下表中的"任务/问题"找到对应页面。

| 问题 / 任务                                    | 页面                                                                                                                                              |
| ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| 将基础安装加固为生产部署（节点固定、HA、资源调优）                 | [集群管理 — 生产部署配置](https://documentation.alluxio.io/ee-ai-cn/pages/cuG1KYD7QZM6zV5EoHQ1#id-1.-sheng-chan-bu-shu-pei-zhi)                           |
| 扩容或缩容集群                                    | [集群管理 — 扩展集群](/ee-ai-cn/administration/managing-alluxio.md#kuo-zhan-ji-qun)                                                                     |
| 升级 Alluxio 到新版本                            | [集群管理 — 升级](/ee-ai-cn/administration/managing-alluxio.md#sheng-ji-alluxio)                                                                      |
| 在运行中的集群修改属性                                | [集群管理 — 动态更新配置](/ee-ai-cn/administration/managing-alluxio.md#dong-tai-geng-xin-pei-zhi)                                                         |
| 调优一致性哈希环（模式、虚拟节点、capacity）                 | [哈希环与 Worker 生命周期 — 配置](https://documentation.alluxio.io/ee-ai-cn/pages/sUN6vEZj9iQUjo7KenSh#id-2.-ha-xi-huan-pei-zhi)                          |
| 诊断过期 `OFFLINE` Worker / 哈希环膨胀              | [哈希环 — 诊断哈希环膨胀](/ee-ai-cn/administration/managing-ring.md#zhen-duan-offline-tiao-mu-dao-zhi-de-ha-xi-huan-peng-zhang)                           |
| 添加、移除、重启 Worker 或持久化身份                     | [哈希环 — Worker 在环上的生命周期](https://documentation.alluxio.io/ee-ai-cn/pages/sUN6vEZj9iQUjo7KenSh#id-3.-worker-zai-huan-shang-de-sheng-ming-zhou-qi) |
| 配置 Worker page store（hostPath / PVC、容量、多盘） | [Worker 配置 — Worker 存储](https://documentation.alluxio.io/ee-ai-cn/pages/c4TWyXEiWHUuqlML94ZH#id-1.-worker-cun-chu)                              |
| 配置异构 Worker                                | [Worker 配置 — 异构 Worker](/ee-ai-cn/administration/managing-worker.md#yi-gou-worker)                                                              |
| Worker 资源 / JVM 调优、OOM 诊断                  | [Worker 配置 — 资源与 JVM 调优](https://documentation.alluxio.io/ee-ai-cn/pages/c4TWyXEiWHUuqlML94ZH#id-2.-zi-yuan-yu-jvm-diao-you)                    |
| 将 Worker 绑定到指定网卡                           | [Worker 配置 — 网络](https://documentation.alluxio.io/ee-ai-cn/pages/c4TWyXEiWHUuqlML94ZH#id-3.-worker-wang-luo-pei-zhi)                            |
| Worker 崩溃或重启后恢复缓存覆盖率                       | [哈希环 — Worker 重启后的缓存恢复](/ee-ai-cn/administration/managing-ring.md#worker-zhong-qi-hou-de-huan-cun-hui-fu)                                       |
| 新增 Worker 后重新均衡缓存数据                        | [集群管理 — 扩容第 4 步](/ee-ai-cn/administration/managing-alluxio.md#fang-an-b-job-rebalance-zhu-dong-jun-heng)                                        |
| 作业提交、调度、生命周期、HA、恢复                         | [Job Service](/ee-ai-cn/administration/managing-job-service.md)                                                                                 |
| 小文件或大目录的加载吞吐调优                             | [Job Service — 小文件加载优化](/ee-ai-cn/administration/managing-job-service.md#xiao-wen-jian-jia-zai-you-hua)                                         |
| 多租户隔离与集群联邦                                 | [集群管理 — 多租户](https://documentation.alluxio.io/ee-ai-cn/pages/cuG1KYD7QZM6zV5EoHQ1#id-3.-duo-zu-hu-he-lian-bang)                                 |

## 2. 监控和可观察性

Alluxio 以 Prometheus 格式公开了广泛的指标，从而可以深入了解集群的健康状况和性能。

* **默认监控堆栈**：Alluxio Operator 可以自动部署一个完整的监控堆栈，包括用于指标收集的 Prometheus 和用于可视化的 Grafana，并带有预配置的仪表板。
* **与现有系统集成**：您可以轻松地将 Alluxio 与您现有的监控基础设施集成，无论是中央 Prometheus、Grafana 还是像 Datadog 这样的第三方服务。

了解更多关于[监控 Alluxio](/ee-ai-cn/administration/monitoring-alluxio.md) 的信息...

## 3. 安全

Alluxio 提供多层安全模型来保护您的数据和基础设施。

* **身份验证**：通过与符合 OIDC 的身份提供商（如 Okta）集成，使用 JSON Web 令牌 (JWT) 对用户和服务进行身份验证，从而保护您的集群。
* **授权**：实施细粒度的访问控制。使用 **Apache Ranger** 进行数据访问策略（S3、HDFS），使用 **Open Policy Agent (OPA)** 进行管理 API 策略（网关）。
* **加密**：通过启用 TLS 来加密 Alluxio 组件之间以及客户端和集群之间的通信，从而保护传输中的数据。

了解更多关于[安全](/ee-ai-cn/administration/security.md)的信息...

## 4. 审计与访问日志

Alluxio 通过两类互补的日志记录集群活动，支撑安全审计、合规以及缓存可观测性。

* **审计日志**：以 JSON 结构化记录管理操作（Gateway）和数据访问（S3、HDFS、FUSE、Python SDK）。
* **访问日志**：记录 Worker 缓存生命周期事件（`LOAD`、`HOT_READ`、`COLD_READ`、`EVICT`、`DELETE`）并经去重处理，便于治理和缓存分析。

了解更多关于[审计与访问日志](/ee-ai-cn/administration/audit-access-logs.md)的信息...

## 5. 故障排除

当出现问题时，Alluxio 提供了工具和程序来帮助您快速诊断和解决问题。

* **健康检查**：首先检查 Alluxio 组件（Coordinator、Worker、FUSE）的状态并验证与 UFS 的连接性。
* **诊断**：检查来自 Alluxio 进程和 Kubernetes CSI 驱动程序的日志。对于复杂问题，生成一个全面的诊断快照，该快照捆绑了日志、配置和指标以供离线分析。
* **恢复**：遵循指导性程序从常见故障中恢复，例如Coordinator、worker 故障或 etcd 集群损坏。

了解更多关于[故障排除 Alluxio](/ee-ai-cn/administration/troubleshooting-alluxio.md) 的信息...


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-cn/administration.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.