Multi-Availability Zone (Multi-AZ) Deployments
Overview
Alluxio enhances high availability by supporting deployments across multiple availability zones (AZs). If an entire AZ becomes unavailable, Alluxio clients can automatically failover to a healthy cluster in another AZ, ensuring uninterrupted service and data access. This capability, combined with file replication and UFS fallback, provides robust I/O resiliency for mission-critical workloads.
How it Works
When an Alluxio client needs to read data, it prioritizes workers in its local AZ to minimize latency. If all local workers are unavailable (e.g., due to a zone-wide outage), the client automatically redirects its request to a worker in a different AZ. This failover is seamless to the application.
For files replicated across multiple AZs, the client will intelligently select the best data source, prioritizing fully cached replicas to optimize performance. If no Alluxio workers are available in any zone, the client will fall back to reading from the UFS as a final step.
Enable Multi-AZ on Kubernetes
The Alluxio Operator simplifies deploying multiple clusters in Kubernetes using the ClusterGroup
custom resource. This allows you to manage multiple Alluxio clusters with a consistent configuration. For instructions on installing the operator, see the Alluxio Operator Installation Guide.
Deploying a multi-AZ setup on Kubernetes involves the following steps:
Step 1: Prepare Namespaces
This guide assumes you are deploying three clusters (alluxio-a
, alluxio-b
, alluxio-c
) across two namespaces (alx-ns
, alx-ns-2
). First, create the namespaces if they don't exist:
kubectl create namespace alx-ns
kubectl create namespace alx-ns-2
Step 2: Define and Deploy the ClusterGroup
ClusterGroup
Create a ClusterGroup
manifest file. This YAML file defines the Alluxio clusters and includes the necessary properties to enable multi-AZ mode.
Note: Clusters created with a
ClusterGroup
share the same template for properties and resource allocation. To ensure clusters are deployed to different availability zones, use thenodeSelector
field to assign each cluster to a distinct set of nodes.
Kubernetes Deployment Patterns
There are three supported deployment patterns for the ETCD service that coordinates the clusters.
Pattern 1: Independent ETCD per Cluster
Each Alluxio cluster operates with its own dedicated ETCD instance. This pattern provides the highest isolation.

1. Define Topology: Create a multi-az-clusters.json
file with unique ETCD endpoints for each cluster. The service names should follow the pattern <cluster-name>-etcd.<namespace>
.
[
{
"clusterNames": ["alx-ns-alluxio-a"],
"endpoints": ["http://alluxio-a-etcd.alx-ns:2379"]
},
{
"clusterNames": ["alx-ns-alluxio-b"],
"endpoints": ["http://alluxio-b-etcd.alx-ns:2379"]
},
{
"clusterNames": ["alx-ns-2-alluxio-c"],
"endpoints": ["http://alluxio-c-etcd.alx-ns-2:2379"]
}
]
2. Create ConfigMap: Package the file into a ConfigMap named multi-cluster
in each namespace.
kubectl create configmap multi-cluster --from-file=multi-az-clusters.json -n alx-ns
kubectl create configmap multi-cluster --from-file=multi-az-clusters.json -n alx-ns-2
3. Define the ClusterGroup
: In the ClusterGroup
manifest, do not include etcd
in the dependencies
section. Each cluster definition will automatically provision its own ETCD instance. Key points:
The multi-cluster JSON configuration path must be defined in each cluster’s properties.
The ConfigMap must be mounted to the /multi-az directory in all components of the Alluxio cluster.
apiVersion: k8s-operator.alluxio.com/v1
kind: ClusterGroup
metadata:
name: alluxio-cluster-group
namespace: alx-ns
spec:
dependencies:
dashboard:
image: alluxio/alluxio-dashboard
imageTag: AI-3.7-13.0.0
license: "licenseString"
gateway:
image: alluxio/alluxio-gateway
imageTag: AI-3.7-13.0.0
groups:
- name: alluxio-a
namespace: alx-ns
nodeSelector:
region: az-1
- name: alluxio-b
namespace: alx-ns
nodeSelector:
region: az-2
- name: alluxio-c
namespace: alx-ns-2
nodeSelector:
region: az-3
template:
spec:
image: alluxio/alluxio-enterprise
imageTag: AI-3.7-13.0.0
properties:
alluxio.multi.cluster.enabled: "true"
alluxio.multi.cluster.config.path: "/multi-az/multi-az-clusters.json"
worker:
count: 2
configMaps:
coordinator:
multi-cluster: /multi-az
worker:
multi-cluster: /multi-az
fuse:
multi-cluster: /multi-az
etcd:
replicaCount: 1
After applying this manifest, you can verify that each cluster has its own etcd
pod in its respective namespace.
Pattern 2: Shared ETCD Managed by the Operator
All Alluxio clusters share a single ETCD cluster that is deployed and managed by the ClusterGroup
.

1. Define Topology: Create a multi-az-clusters.json
file where all clusters point to the same shared ETCD service. The service name follows the pattern <cluster-group-name>-etcd.<namespace>
.
[
{
"clusterNames": ["alx-ns-alluxio-a"],
"endpoints": ["http://alluxio-cg-etcd.default:2379"]
},
{
"clusterNames": ["alx-ns-alluxio-b"],
"endpoints": ["http://alluxio-cg-etcd.default:2379"]
},
{
"clusterNames": ["alx-ns-2-alluxio-c"],
"endpoints": ["http://alluxio-cg-etcd.default:2379"]
}
]
2. Create ConfigMap: Create the multi-cluster
ConfigMap in each namespace.
kubectl create configmap multi-cluster --from-file=multi-az-clusters.json -n alx-ns
kubectl create configmap multi-cluster --from-file=multi-az-clusters.json -n alx-ns-2
3. Define the ClusterGroup
: In the ClusterGroup
manifest, include etcd
in the dependencies
section. This instructs the operator to deploy a shared ETCD cluster. Key points:
The multi-cluster JSON configuration path must be defined in each cluster’s properties.
The ConfigMap must be mounted to the /multi-az directory in all components of the Alluxio cluster.
apiVersion: k8s-operator.alluxio.com/v1
kind: ClusterGroup
metadata:
name: alluxio-cg
namespace: default
spec:
dependencies:
etcd:
replicaCount: 3
dashboard:
image: alluxio/alluxio-dashboard
imageTag: AI-3.7-13.0.0
license: "licenseString"
gateway:
image: alluxio/alluxio-gateway
imageTag: AI-3.7-13.0.0
groups:
- name: alluxio-a
namespace: alx-ns
nodeSelector:
region: az-1
- name: alluxio-b
namespace: alx-ns
nodeSelector:
region: az-2
- name: alluxio-c
namespace: alx-ns-2
nodeSelector:
region: az-3
template:
spec:
image: alluxio/alluxio-enterprise
imageTag: AI-3.7-13.0.0
properties:
alluxio.multi.cluster.enabled: "true"
alluxio.multi.cluster.config.path: "/multi-az/multi-az-clusters.json"
worker:
count: 2
configMaps:
coordinator:
multi-cluster: /multi-az
worker:
multi-cluster: /multi-az
fuse:
multi-cluster: /multi-az
After applying, a single ETCD cluster will be created in the ClusterGroup
's namespace (default
in this case), and all Alluxio clusters will connect to it.
Pattern 3: External ETCD
All Alluxio clusters connect to a pre-existing, externally managed ETCD cluster. This is common in environments where a central ETCD is already available.

1. Define Topology: Create a multi-az-clusters.json
file where all clusters point to the external ETCD endpoint.
[
{
"clusterNames": ["alx-ns-alluxio-a"],
"endpoints": ["http://external-etcd.default:2379"]
},
{
"clusterNames": ["alx-ns-alluxio-b"],
"endpoints": ["http://external-etcd.default:2379"]
},
{
"clusterNames": ["alx-ns-2-alluxio-c"],
"endpoints": ["http://external-etcd.default:2379"]
}
]
2. Create ConfigMap: Create the multi-cluster
ConfigMap in each namespace.
kubectl create configmap multi-cluster --from-file=multi-az-clusters.json -n alx-ns
kubectl create configmap multi-cluster --from-file=multi-az-clusters.json -n alx-ns-2
3. Define the ClusterGroup
: In the ClusterGroup
manifest, disable the operator's ETCD deployment and point Alluxio to the external service.
Omit
etcd
from thedependencies
section.Set
etcd.enabled: false
in thetemplate.spec
.Add the
alluxio.etcd.endpoints
property to thetemplate.spec.properties
.The multi-cluster JSON configuration path must be defined in each cluster’s properties.
The ConfigMap must be mounted to the /multi-az directory in all components of the Alluxio cluster.
apiVersion: k8s-operator.alluxio.com/v1
kind: ClusterGroup
metadata:
name: alluxio-cg
namespace: default
spec:
dependencies: # ETCD is not included here
dashboard:
image: alluxio/alluxio-dashboard
imageTag: AI-3.7-13.0.0
license: "licenseString"
gateway:
image: alluxio/alluxio-gateway
imageTag: AI-3.7-13.0.0
groups:
- name: alluxio-a
namespace: alx-ns
nodeSelector:
region: az-1
- name: alluxio-b
namespace: alx-ns
nodeSelector:
region: az-2
- name: alluxio-c
namespace: alx-ns-2
nodeSelector:
region: az-3
template:
spec:
image: alluxio/alluxio-enterprise
imageTag: AI-3.7-13.0.0
properties:
alluxio.multi.cluster.enabled: "true"
alluxio.multi.cluster.config.path: "/multi-az/multi-az-clusters.json"
alluxio.etcd.endpoints: "http://external-etcd.default:2379"
worker:
count: 2
configMaps:
coordinator:
multi-cluster: /multi-az
worker:
multi-cluster: /multi-az
fuse:
multi-cluster: /multi-az
etcd:
enabled: false # Explicitly disable internal ETCD
After applying, the Alluxio clusters will start and connect to the specified external ETCD service.
Step 3: Verify the Deployment
You can verify that the clusters are connected by running the info nodes
command from any coordinator pod:
$ kubectl exec -it -n <namespace> <coordinator-pod-name> -- alluxio info nodes
A successful configuration will list the worker nodes from all participating clusters.
Optimizing I/O in Multi-AZ
Optimized I/O for Replicated Files
Alluxio supports optimized I/O for multi-replicated files. For files replicated across AZs, clients intelligently choose the best data source. The client will choose a preferred data source according to the following order:
A local worker that has fully cached the file.
A remote worker that has fully cached the file.
A local worker (if no fully cached replicas are available).
A remote worker (if no local workers are available).
The UFS (as a last resort).
This logic ensures that clients prioritize the fastest and most complete data copies, even across zones.
Passive Caching
In a multi-AZ deployment, it is recommended to enable Passive Cache. When a client reads a file from a remote cluster, passive caching will automatically create a replica of that file in the client's local cluster. This ensures that subsequent reads of the same file can be served locally, improving performance and data locality.
To enable optimized I/O and passive caching, add the following properties to alluxio-site.properties
:
# Note that multi-replica optimized IO must be enabled for passive cache to take effect
alluxio.user.replica.prefer.cached.replicas=true
alluxio.user.file.passive.cache.enabled=true
Last updated