# 集群管理

本文档介绍如何对在 Kubernetes 上运行的 Alluxio 集群进行管理操作，如升级到新版本和添加新 Worker。

## 在运行的集群中动态更新 Alluxio 配置

1. 获取 configmap

```
$ kubectl -n alx-ns get configmap
NAME                              DATA   AGE
alluxio-cluster-alluxio-conf      4      7m48s
...
```

2. 运行 `edit configmap` 命令以更新 Alluxio 配置：

```
$ kubectl -n alx-ns edit configmap alluxio-cluster-alluxio-conf
```

ConfigMap 中应包含 4 个文件: `alluxio-env.sh`, `alluxio-site.properties`, `log4j2.xml`, 和 `metrics.properties`.\
根据需要编辑内容并保存该 ConfigMap。

```
configmap/alluxio-cluster-alluxio-conf edited
```

3. 根据需要重启 Alluxio 组件（以 `alx-ns` 命名空间下的 `alluxio-cluster` 为例）：

* coordinator: `kubectl -n alx-ns rollout restart statefulset alluxio-cluster-coordinator`
* worker: `kubectl -n alx-ns rollout restart deployment alluxio-cluster-worker`
* daemonset fuse (`fuse.type = daemonSet`): `kubectl -n alx-ns rollout restart daemonset alluxio-fuse`
* csi fuse (`fuse.type = csi`): csi fuse pod 不支持通过 `rollout restart` 重启，必须通过终止用户端 Pod 或者使用 `kubectl -n alx-ns delete pod alluxio-fuse-xxx` 命令手动删除 pod 的方式重启。

## 升级到新的 Alluxio 版本

### 升级 Operator

1. 将与新版 Alluxio operator 对应的新 docker 镜像上传到镜像仓库（image registry），并解压该 operator 的 helm chart。\
   具体请参看 [安装文档](https://documentation.alluxio.io/ee-ai-cn/ai-3.6/install/install-alluxio-on-kubernetes#准备)。
2. 运行以下命令，以将新的更改应用到集群。

```shell
# 卸载 operator。 operator 是独立的，operator 的状态不会影响现有的 Alluxio 集群
$ helm uninstall operator
release "operator" uninstalled

# 检查是否所有资源都已移除。命名空间是最后移除的资源
$ kubectl get ns alluxio-operator
Error from server (NotFound): namespaces "alluxio-operator" not found

# 在新的 helm chart 目录下运行以下命令，来首先升级 CRD
$ kubectl apply -f alluxio-operator/crds 2>/dev/null
customresourcedefinition.apiextensions.k8s.io/alluxioclusters.k8s-operator.alluxio.com configured
customresourcedefinition.apiextensions.k8s.io/underfilesystems.k8s-operator.alluxio.com configured

# 使用相同的 operator-config.yaml，仅更改镜像的标签，以重启 operator
$ helm install operator -f operator-config.yaml alluxio-operator
NAME: operator
LAST DEPLOYED: Thu Jun 27 15:47:44 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
```

### 升级 Alluxio 集群

在操作之前应了解如下内容：

* 升级操作开始后，coordinator、worker 和 DaemonSet FUSE 将执行滚动升级以使用新的镜像。现有的 CSI FUSE pod 不会重新启动和升级，只有新的 pod 才会使用新的镜像。
* 在集群升级期间，缓存命中率可能会略有下降，但在集群再次完全运行后就会完全恢复。

按照以下步骤升级集群：

1. 将与新版 Alluxio 对应的新 docker 镜像上传到镜像仓库。具体请参看[安装文档](https://documentation.alluxio.io/ee-ai-cn/ai-3.6/install/install-alluxio-on-kubernetes#准备)。
2. 更新 `alluxio-cluster.yaml` 中的 `imageTag` 字段，以反映新的 Alluxio 版本。在下面的示例中，新的 `imageTag` 是 `AI-3.6-12.0.2`。
3. 运行以下命令将新更改应用到集群。

```shell
# 将更改应用到 Kubernetes
$ kubectl apply -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured

# 验证是否升级。应能看到新的 pod 正在生成
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS     RESTARTS   AGE
alluxio-cluster-coordinator-0                 0/1     Init:0/2   0          7s
alluxio-cluster-etcd-0                        1/1     Running    0          10m
alluxio-cluster-etcd-1                        1/1     Running    0          10m
alluxio-cluster-etcd-2                        1/1     Running    0          10m
alluxio-cluster-grafana-b89bf9dbb-77pb6       1/1     Running    0          10m
alluxio-cluster-prometheus-59b7b8bd64-b95jh   1/1     Running    0          10m
alluxio-cluster-worker-58999f8ddd-cd6r2       0/1     Init:0/2   0          7s
alluxio-cluster-worker-5d6786f5bf-cxv5j       1/1     Running    0          10m

# 检查集群状态
$ kubectl -n alx-ns get alluxiocluster
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Updating       10m

# 等待集群再次准备就绪
$ kubectl -n alx-ns get alluxiocluster
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Ready          12m

# 检查集群的 pod。可看到 alluxio pod 的 age 已发生变化
$ kubectl get pod
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-coordinator-0                 1/1     Running   0          93s
alluxio-cluster-etcd-0                        1/1     Running   0          12m
alluxio-cluster-etcd-1                        1/1     Running   0          12m
alluxio-cluster-etcd-2                        1/1     Running   0          12m
alluxio-cluster-grafana-b89bf9dbb-77pb6       1/1     Running   0          12m
alluxio-cluster-prometheus-59b7b8bd64-b95jh   1/1     Running   0          12m
alluxio-cluster-worker-58999f8ddd-cd6r2       1/1     Running   0          93s
alluxio-cluster-worker-58999f8ddd-rtftk       1/1     Running   0          33s

# 核对版本字符串
$ kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- alluxio info version 2>/dev/null
AI-3.6-12.0.2
```

## 扩容集群

### 扩容 worker

在操作之前应了解如下内容：

* 在集群升级期间，缓存命中率可能会略有下降，但在集群再次完全运行后就会完全恢复。

按照以下步骤扩容 worker：

1. 更改 `alluxio-cluster.yaml`，以增加 `worker` 中的 `count`。在下面的示例中，我们将从 2 个 worker 扩展到 3 个 worker。
2. 运行以下命令将新更改应用到集群。

```shell
# 将更改应用到 Kubernetes
$ kubectl apply -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured

# 验证集群是否正在升级。应能看到新的 pod 正在生成
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS            RESTARTS   AGE
alluxio-cluster-coordinator-0                 1/1     Running           0          4m51s
alluxio-cluster-etcd-0                        1/1     Running           0          15m
alluxio-cluster-etcd-1                        1/1     Running           0          15m
alluxio-cluster-etcd-2                        1/1     Running           0          15m
alluxio-cluster-grafana-b89bf9dbb-77pb6       1/1     Running           0          15m
alluxio-cluster-prometheus-59b7b8bd64-b95jh   1/1     Running           0          15m
alluxio-cluster-worker-58999f8ddd-cd6r2       1/1     Running           0          4m51s
alluxio-cluster-worker-58999f8ddd-rtftk       1/1     Running           0          3m51s
alluxio-cluster-worker-58999f8ddd-p6n59       0/1     PodInitializing   0          4s

# 检查新的实例是否已就绪
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-coordinator-0                 1/1     Running   0          5m21s
alluxio-cluster-etcd-0                        1/1     Running   0          16m
alluxio-cluster-etcd-1                        1/1     Running   0          16m
alluxio-cluster-etcd-2                        1/1     Running   0          16m
alluxio-cluster-grafana-b89bf9dbb-77pb6       1/1     Running   0          16m
alluxio-cluster-prometheus-59b7b8bd64-b95jh   1/1     Running   0          16m
alluxio-cluster-worker-58999f8ddd-cd6r2       1/1     Running   0          5m21s
alluxio-cluster-worker-58999f8ddd-rtftk       1/1     Running   0          4m21s
alluxio-cluster-worker-58999f8ddd-p6n59       1/1     Running   0          34s
```