# Docker Installation

Deploy Alluxio on bare-metal Linux hosts, EC2 instances, or Slurm-managed clusters using Docker — no Kubernetes required.

## Overview

### Architecture

Each Alluxio component runs in its own Docker container using `--net=host`, sharing the host network stack. Components communicate directly by IP — no port mapping needed.

| Host             | Containers                                     |
| ---------------- | ---------------------------------------------- |
| Coordinator node | ETCD, Alluxio Coordinator, Prometheus, Grafana |
| Worker node(s)   | Alluxio Worker (one per host)                  |
| FUSE client node | Alluxio FUSE (one per host)                    |

### Artifacts

You will receive a download link for two artifacts:

| Artifact      | Filename                                                  | Purpose                                              |
| ------------- | --------------------------------------------------------- | ---------------------------------------------------- |
| Alluxio image | `alluxio-enterprise-AI-3.9-16.0.0-linux-amd64-docker.tar` | Single image for coordinator, worker, and FUSE roles |
| License       |                                                           | Required to activate the cluster                     |

> **Version**: The version string in this filename is an example. The download link you receive will contain the exact version your sales representative provisioned. **Platform**: Use `-linux-amd64-docker.tar` for x86 hosts, `-linux-arm64-docker.tar` for ARM.

The same image is loaded on every node. The role — coordinator, worker, or fuse — is set by the argument passed to `docker run`.

## Before You Start

Run these checks before starting. Skipping them is the most common cause of deployment failures.

* [ ] **Docker** installed and running on every host:

  ```shell
  docker --version
  docker info
  ```
* [ ] **Linux host** with `--net=host` support. Standard on Linux bare-metal and cloud VMs; macOS Docker Desktop does not support it — use a Linux VM or EC2 instance instead.
* [ ] **Network connectivity** between all hosts
* [ ] **Firewall / security groups** open the required ports between all hosts. See [Prerequisites → Networking](/ee-ai-en/start/prerequisites.md#networking) for the full port list.
* [ ] **Alluxio image `.tar` file** downloaded and available
* [ ] **Alluxio license string** available
* [ ] **UFS credentials** ready (S3 access key/secret, or IAM role attached to the instance)

> **EC2 + IAM roles**: Attach the IAM instance profile before launch. No access keys are needed in the `docker run` commands.

> **FUSE client hosts**: libfuse 3.10 or higher is required on any host running the Alluxio FUSE client (not needed on Kubernetes — libfuse is bundled in the container image).
>
> ```shell
> fusermount3 --version
> # If not installed:
> apt-get install -y fuse3   # Ubuntu / Debian
> yum install -y fuse3       # RHEL / CentOS / Amazon Linux
> ```

For resource sizing (CPU, RAM, cache disk per component), see [Prerequisites → Resource Sizing](/ee-ai-en/start/prerequisites.md#resource-sizing).

## Installation Steps

### 0. Load the Alluxio Image

Run on **every host** (coordinator, each worker, FUSE client):

```shell
docker load -i alluxio-enterprise-AI-3.9-16.0.0-linux-amd64-docker.tar
```

**✅ Verify:**

```shell
docker images | grep alluxio
```

```console
alluxio/alluxio-enterprise   AI-3.9-16.0.0   4091c3d8dbc4   ...   2.58GB
```

Note the image name and tag — you will use them in every subsequent `docker run` command.

### 1. Start ETCD

SSH into the **ETCD node**:

```shell
docker run -p 2379:2379 -p 2380:2380 -d --name etcd-standalone \
  quay.io/coreos/etcd:v3.5.9 etcd \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://<COORDINATOR_PRIVATE_IP>:2379
```

Use the **private IP** for `--advertise-client-urls` — not the public DNS or hostname — to ensure workers on other hosts can reach ETCD reliably.

> For production, run a 3-node ETCD cluster for high availability. Single-node is suitable for evaluation only.

**✅ Verify:**

```shell
docker exec etcd-standalone etcdctl endpoint status --cluster -w table
```

```console
+-----------------------------------+------------------+---------+---------+-----------+
|             ENDPOINT              |        ID        | VERSION | DB SIZE | IS LEADER |
+-----------------------------------+------------------+---------+---------+-----------+
| http://172.31.31.133:2379         | 8e9e05c52164694d |   3.5.9 |   20 kB |      true |
+-----------------------------------+------------------+---------+---------+-----------+
```

### 2. Start Coordinator (Coordinator host)

Put the license and cluster settings in `alluxio-site.properties` on the host, then mount the file into the container.

```shell
# Create the config file
sudo mkdir -p /etc/alluxio
cat <<EOF | sudo tee /etc/alluxio/alluxio-site.properties
alluxio.license=<YOUR_LICENSE>
alluxio.etcd.endpoints=http://<COORDINATOR_PRIVATE_IP>:2379
alluxio.coordinator.hostname=<COORDINATOR_PRIVATE_IP>
alluxio.mount.table.source=ETCD
EOF

# Start the coordinator
docker run -d --net=host --name=alluxio-coordinator \
  -v /etc/alluxio/alluxio-site.properties:/opt/alluxio/conf/alluxio-site.properties \
  alluxio/alluxio-enterprise:AI-3.9-16.0.0 coordinator
```

Key properties (set in `alluxio-site.properties`):

| Property                          | Purpose                                     |
| --------------------------------- | ------------------------------------------- |
| `alluxio.license`                 | License string                              |
| `alluxio.etcd.endpoints`          | ETCD address (private IP + port 2379)       |
| `alluxio.coordinator.hostname`    | Private IP workers use to register          |
| `alluxio.mount.table.source=ETCD` | Persist mount table in ETCD across restarts |

**✅ Verify:**

```shell
docker logs alluxio-coordinator 2>&1 | grep -i "started\|listening\|etcd"
```

No `ERROR` lines in the first 30 seconds indicates a healthy start.

### 3. Start Workers (each Worker host)

SSH into each **worker host**. Create the same `alluxio-site.properties` pattern as the coordinator, with two additional worker-specific properties for the page store, then mount it into the container:

```shell
# Create the cache directory (no sudo needed under /tmp)
mkdir -p /tmp/alluxio-cache

# Create the config file
sudo mkdir -p /etc/alluxio
cat <<EOF | sudo tee /etc/alluxio/alluxio-site.properties
alluxio.license=<YOUR_LICENSE>
alluxio.etcd.endpoints=http://<COORDINATOR_PRIVATE_IP>:2379
alluxio.coordinator.hostname=<COORDINATOR_PRIVATE_IP>
alluxio.mount.table.source=ETCD
alluxio.worker.page.store.dirs=/tmp/alluxio-cache
alluxio.worker.page.store.sizes=<CACHE_SIZE>
EOF

# Start the worker
docker run -d --net=host --name=alluxio-worker \
  -v /tmp/alluxio-cache:/tmp/alluxio-cache \
  -v /etc/alluxio/alluxio-site.properties:/opt/alluxio/conf/alluxio-site.properties \
  -e ALLUXIO_JAVA_OPTS="-Xmx8g -Xms2g -XX:MaxDirectMemorySize=8g" \
  alluxio/alluxio-enterprise:AI-3.9-16.0.0 worker
```

Set `<CACHE_SIZE>` to \~80% of available space on the cache path (e.g., `50GB`, `200GB`, `1TB`).

> **`/tmp` note**: `/tmp/alluxio-cache` is cleared on host reboot, making it suitable for evaluation. For production, use a persistent path (e.g., `/data/alluxio-cache`) — see [Worker Configuration](/ee-ai-en/administration/managing-worker.md) for setup and sizing guidance.

> **JVM sizing**: The values above (`-Xmx8g -XX:MaxDirectMemorySize=8g`) are suitable for a 32 GB host. Alluxio stores cached data on disk, not in heap — scale `-Xmx` and `-XX:MaxDirectMemorySize` roughly to 25% of host RAM. See [Worker Configuration](/ee-ai-en/administration/managing-worker.md#2-resource-and-jvm-tuning) for details.

**✅ Verify (from coordinator host):**

```shell
docker exec alluxio-coordinator alluxio info nodes
```

```console
WorkerId                                         Address                          Status
worker-15ed4a17-2154-4454-ba0b-32b46ff06bfb     ip-172-31-26-247...:29999        ONLINE
worker-704c4d42-d189-41ff-b6f5-775a7d1551b3     ip-172-31-18-67...:29999         ONLINE
```

Workers may take 10–15 seconds to register after starting.

### 4. Mount Storage

Run `alluxio mount add` from any host that has access to the coordinator. For full UFS configuration options, see [Underlying Storage](/ee-ai-en/ufs.md).

**S3 with IAM role (recommended on EC2):**

If an IAM instance profile is attached to the EC2 host, no access keys are needed — the coordinator picks up credentials automatically.

```shell
docker exec alluxio-coordinator alluxio mount add \
  --path /s3 \
  --ufs-uri s3://<S3_BUCKET>/ \
  --option alluxio.underfs.s3.region=<S3_REGION>
```

If this fails with a credential error, see [S3 credential errors](#s3-mount-fails-with-credential-error) in Appendix A.

**S3 with access key/secret:**

```shell
docker exec alluxio-coordinator alluxio mount add \
  --path /s3 \
  --ufs-uri s3://<S3_BUCKET>/ \
  --option alluxio.underfs.s3.region=<S3_REGION> \
  --option s3a.accessKeyId=<ACCESS_KEY> \
  --option s3a.secretKey=<SECRET_KEY>
```

**✅ Verify:**

```shell
docker exec alluxio-coordinator alluxio mount list
```

```console
Listing all mount points
s3://<S3_BUCKET>/  on  /s3/  properties={alluxio.underfs.s3.region=<S3_REGION>}
```

### 5. Start FUSE (FUSE client host)

SSH into the **FUSE client host**. If reinstalling after a previous run, unmount any stale FUSE first:

```shell
sudo umount -l /mnt/alluxio/fuse 2>/dev/null || true
```

Create the mount directory:

```shell
sudo mkdir -p /mnt/alluxio/fuse
sudo chown $(whoami) /mnt/alluxio/fuse
chmod 755 /mnt/alluxio/fuse
```

The `-o allow_other` flag (passed to FUSE at the end of the `docker run` command) requires `user_allow_other` to be set in `/etc/fuse.conf` on the FUSE host. Without it, FUSE will fail with `fusermount: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf`. Enable it:

```shell
grep -q user_allow_other /etc/fuse.conf || echo user_allow_other | sudo tee -a /etc/fuse.conf
```

Start the FUSE container:

```shell
docker run -d --privileged --net=host --name=alluxio-fuse \
  -v /mnt/alluxio:/mnt/alluxio:shared \
  -e ALLUXIO_JAVA_OPTS="\
    -Xmx4g -Xms1g -XX:MaxDirectMemorySize=4g \
    -Dalluxio.etcd.endpoints=http://<COORDINATOR_PRIVATE_IP>:2379 \
    -Dalluxio.coordinator.hostname=<COORDINATOR_PRIVATE_IP> \
    -Dalluxio.mount.table.source=ETCD" \
  alluxio/alluxio-enterprise:AI-3.9-16.0.0 fuse -o allow_other /mnt/alluxio/fuse
```

> `--privileged` is required for FUSE to mount inside the container and propagate to the host via `-v /mnt/alluxio:/mnt/alluxio:shared`.

**✅ Verify** (wait \~10 seconds):

```shell
ls /mnt/alluxio/fuse/
```

```console
s3
```

Each Alluxio mount point appears as a subdirectory. `/mnt/alluxio/fuse/s3/` maps directly to `s3://<S3_BUCKET>/`.

> The FUSE container above is a minimal smoke test — `--privileged` + evaluation-sized JVM. For production deployment with fine-grained capabilities (no `--privileged`), high-throughput JVM sizing, tuned mount options, and multi-client setups, see [POSIX API → Docker / Bare-Metal](/ee-ai-en/data-access/fuse-based-posix-api.md#method-3-docker-bare-metal).

### 6. Verify Data Access

```shell
docker exec alluxio-coordinator alluxio fs ls /s3/
```

**✅ Success:** The command returns a listing of files and directories in your S3 bucket. An empty bucket returns an empty listing without errors. Example:

```console
-rwx------  0  0  0  B  PERSISTED  01-01-2024 00:00:00:000  /s3/dataset/
-rwx------  0  0  1234  B  PERSISTED  01-01-2024 00:00:00:000  /s3/README.md
```

If the command fails, see [Appendix A: Troubleshooting](#a-troubleshooting).

**Test read and write via FUSE:**

```shell
# Read an existing file
cat /mnt/alluxio/fuse/s3/test.txt

# Write a new file
echo "hello alluxio" > /mnt/alluxio/fuse/s3/hello.txt

# Verify it appeared in S3
aws s3 ls s3://<S3_BUCKET>/hello.txt
```

## Restart / Recovery

After an EC2 reboot or host restart, Docker containers do not restart automatically unless configured to do so. The ETCD container also loses its data if it was not started with a persistent volume.

### Recover After Host Reboot

Run the following on each host in order. If ETCD was started **without** a persistent volume (`-v`), the mount table is lost — re-run Step 4 after restarting the coordinator.

**ETCD node:**

```shell
docker start etcd-standalone
```

**Coordinator node:**

```shell
docker start alluxio-coordinator
```

**Each worker node:**

```shell
docker start alluxio-worker
```

**FUSE client node:**

```shell
# /mnt/alluxio/fuse must exist before starting the container
sudo mkdir -p /mnt/alluxio/fuse
docker start alluxio-fuse
```

**✅ Verify all components are back:**

```shell
# From coordinator node
docker exec alluxio-coordinator alluxio info nodes
```

### Persist the ETCD Mount Table

By default, ETCD stores data in its container filesystem. A `docker rm` or host reboot without `--restart` loses all mount points. To persist the mount table across restarts, add a volume mount to the ETCD `docker run` command:

```shell
docker run -p 2379:2379 -p 2380:2380 -d --name etcd-standalone \
  -v /data/etcd:/etcd-data \
  quay.io/coreos/etcd:v3.5.9 etcd \
  --data-dir /etcd-data \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://<COORDINATOR_PRIVATE_IP>:2379
```

## Uninstall

Stop and remove containers on each host in reverse order:

**FUSE client host:**

```shell
docker stop alluxio-fuse && docker rm alluxio-fuse
sudo umount -l /mnt/alluxio/fuse   # if mount is stuck after container removal
```

**Each worker host:**

```shell
docker stop alluxio-worker && docker rm alluxio-worker
```

**Coordinator host:**

```shell
docker stop alluxio-coordinator && docker rm alluxio-coordinator
docker stop etcd-standalone && docker rm etcd-standalone
```

## Monitoring (Optional)

First, create the Prometheus scrape config and Grafana datasource files — see [Monitoring → Prometheus Setup](/ee-ai-en/administration/monitoring-alluxio.md#prometheus-setup) for the file contents and directory paths.

Then start Prometheus and Grafana on the coordinator node:

```shell
docker run -d --net=host --name=prometheus \
  -v ~/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus --config.file=/etc/prometheus/prometheus.yml

docker run -d --net=host --name=grafana \
  -v ~/monitoring/grafana/provisioning:/etc/grafana/provisioning \
  -e GF_SECURITY_ADMIN_USER=admin \
  -e GF_SECURITY_ADMIN_PASSWORD=grafana \
  grafana/grafana
```

Open Grafana at `http://localhost:3000` (EC2: use an SSH tunnel or open ports 3000 and 9090 in your Security Group). For dashboard import, alert rules, and Datadog integration, see [Monitoring](/ee-ai-en/administration/monitoring-alluxio.md).

## Appendix

### A. Troubleshooting

**`License checksum error` on coordinator or worker startup**

This almost always means the license string was corrupted in transit. Base64-encoded Alluxio license strings contain `+`, `/`, and `=` — characters that survive one shell layer but get mangled through nested layers (typical path: local shell → `ssh` → `docker -e ALLUXIO_JAVA_OPTS="...-Dalluxio.license=${LICENSE}..."` → Java). Both Step 2 (coordinator) and Step 3 (workers) avoid this by writing the license into `/etc/alluxio/alluxio-site.properties` and mounting the file into the container with `-v`, skipping all shell quoting. If you had set the license via `-Dalluxio.license=...` in `ALLUXIO_JAVA_OPTS`, switch to the file-mount pattern shown in the corresponding step.

**Workers not appearing in `alluxio info nodes`**

1. Verify ETCD is reachable from the worker host:

   ```shell
   curl http://<COORDINATOR_PRIVATE_IP>:2379/health
   ```

   Expected: `{"health":"true"}`
2. Check worker logs:

   ```shell
   docker logs alluxio-worker 2>&1 | grep -i "error\|etcd\|register" | tail -20
   ```

**FUSE mount not visible after container starts**

1. Check that the mount directory exists:

   ```shell
   ls -la /mnt/alluxio/fuse
   ```

   If missing, the container starts successfully but the mount is silently skipped.
2. Check container logs:

   ```shell
   docker logs alluxio-fuse 2>&1 | tail -20
   ```

   `Mount point '/mnt/alluxio/fuse' does not exist` confirms the directory was missing.
3. Fix the directory, then recreate the container:

   ```shell
   docker rm -f alluxio-fuse
   # re-run the docker run command from Step 6
   ```

**`Transport endpoint is not connected` after FUSE container is removed**

The FUSE filesystem stays registered with the kernel after the container exits. Unmount manually:

```shell
sudo umount -l /mnt/alluxio/fuse
```

**`alluxio mount add` fails with `unknown command`**

Use named flags — the old positional syntax is no longer supported:

```shell
alluxio mount add --path /s3 --ufs-uri s3://<BUCKET>/ --option alluxio.underfs.s3.region=<REGION>
```

**S3 mount fails with credential error**

The coordinator uses `--net=host` and inherits the host's network stack, so it can reach the EC2 Instance Metadata Service (IMDS) — a built-in endpoint on every EC2 instance that vends short-lived credentials to processes running on that host. First, verify that an IAM role is actually attached:

```shell
# Should return the IAM role name; empty response means no role is attached
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/
```

If no role is attached, remount using explicit access keys (see Step 4). If a role is attached but the error persists, check that the role's policy grants `s3:GetObject`, `s3:ListBucket`, and `s3:PutObject` on the target bucket.

### B. Worker Identity

On first start, Alluxio automatically creates `/opt/alluxio/conf/worker_identity` and writes a UUID into it. If this file is lost, the worker restarts with a new UUID — its ring slots are remapped, previously cached data becomes unreachable, and the old UUID remains as a stale entry in etcd until it is manually removed or automatically purged (dynamic mode only).

**Triggers for identity loss:**

* Host reboot (container filesystem is reset)
* Container recreation: `docker rm` + `docker run` — the most common trigger when updating `ALLUXIO_JAVA_OPTS`

**Cleaning up OFFLINE entries** (after identity loss has already occurred):

```shell
docker exec alluxio-coordinator alluxio process remove-worker -n <WORKER_ID>
```

Worker IDs are shown in `alluxio info nodes`.

**Preventing identity loss** — redirect the identity file to a host-mounted directory by setting `alluxio.worker.identity.uuid.file.path`. On first start, Alluxio writes the UUID to that path; because the path is inside a bind-mounted directory, the file lands on the host immediately and survives all future `docker rm` + `docker run` cycles.

On each worker host, run the following **before** starting the worker:

```shell
# Create a persistent directory owned by the Alluxio container user (UID 1000)
sudo mkdir -p /etc/alluxio/identity
sudo chown 1000 /etc/alluxio/identity
```

Add the following to `/etc/alluxio/alluxio-site.properties`:

```properties
alluxio.worker.identity.uuid.file.path=/etc/alluxio/identity/worker_identity
```

Add `-v /etc/alluxio/identity:/etc/alluxio/identity` to the worker `docker run` command (Step 3):

```shell
docker run -d --net=host --name=alluxio-worker \
  -v /tmp/alluxio-cache:/tmp/alluxio-cache \
  -v /etc/alluxio/alluxio-site.properties:/opt/alluxio/conf/alluxio-site.properties \
  -v /etc/alluxio/identity:/etc/alluxio/identity \
  -e ALLUXIO_JAVA_OPTS="-Xmx8g -Xms2g -XX:MaxDirectMemorySize=8g" \
  alluxio/alluxio-enterprise:AI-3.9-16.0.0 worker
```

On first start, Alluxio creates `/etc/alluxio/identity/worker_identity` on the host. On every subsequent start, it reads the UUID back from that file — no separate copy step needed.

{% hint style="info" %}
If you cannot modify `alluxio-site.properties` before first start, an alternative is to start the worker once without any identity mount, copy the generated file out with `docker cp alluxio-worker:/opt/alluxio/conf/worker_identity /etc/alluxio/worker_identity && sudo chmod 666 /etc/alluxio/worker_identity`, then recreate the container with `-v /etc/alluxio/worker_identity:/opt/alluxio/conf/worker_identity`. Note that bind-mounting a file path that does not yet exist on the host causes Docker to create a directory there instead, which will make the worker fail with `IsADirectoryException`.
{% endhint %}

For the full explanation of why identity persistence matters and the impact on the hash ring, see [Restarting a Worker](/ee-ai-en/administration/managing-ring.md#restarting-a-worker).

### C. Collecting Logs for Support

```shell
# Coordinator logs
docker cp alluxio-coordinator:/opt/alluxio/logs /tmp/coordinator-logs

# Worker logs (run on each worker host)
docker cp alluxio-worker:/opt/alluxio/logs /tmp/worker-logs
```

### D. Updating Configuration

The main setup flow already mounts `/etc/alluxio/alluxio-site.properties` into both the coordinator and worker containers (see [Start Coordinator](#id-2.-start-coordinator-coordinator-host) and [Start Workers](#id-3.-start-workers-each-worker-host)). To change any Alluxio property after install, edit the host file and restart the container — `docker restart` preserves the worker ID, unlike recreating the container with `-e ALLUXIO_JAVA_OPTS`, which generates a new ID and leaves stale `OFFLINE` entries in ETCD (see [Appendix B](#b-worker-identity)).

```shell
# Edit /etc/alluxio/alluxio-site.properties on the host, then restart:
docker restart alluxio-worker     # or alluxio-coordinator
```

After restart, verify the worker rejoins with the same identity:

```shell
docker exec alluxio-coordinator alluxio info nodes
# Expected: same WorkerId as before, status ONLINE
```

## Related Documentation

* [Cluster Management](/ee-ai-en/administration/managing-alluxio.md) — Post-deployment operations: scaling, hash ring tuning, worker lifecycle, and UFS mount management
* [Amazon S3 UFS](/ee-ai-en/ufs/s3.md) — S3 credentials and configuration options
* [POSIX API (FUSE)](/ee-ai-en/data-access/fuse-based-posix-api.md) — FUSE mount options and tuning
* [S3 API](/ee-ai-en/data-access/s3-api.md) — Using the Alluxio S3-compatible endpoint
* [Monitoring](/ee-ai-en/administration/monitoring-alluxio.md) — Alert rules, Datadog integration, and metrics reference


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/start/installing-on-docker.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.