# Docker Installation

Deploy Alluxio on bare-metal Linux hosts, EC2 instances, or Slurm-managed clusters using Docker — no Kubernetes required.

## Overview

### Architecture

Each Alluxio component runs in its own Docker container using `--net=host`, sharing the host network stack. Components communicate directly by IP — no port mapping needed.

| Host             | Containers                                     |
| ---------------- | ---------------------------------------------- |
| Coordinator node | ETCD, Alluxio Coordinator, Prometheus, Grafana |
| Worker node(s)   | Alluxio Worker (one per host)                  |
| FUSE client node | Alluxio FUSE (one per host)                    |

### Artifacts

You will receive a download link for two artifacts:

| Artifact      | Filename                                                  | Purpose                                              |
| ------------- | --------------------------------------------------------- | ---------------------------------------------------- |
| Alluxio image | `alluxio-enterprise-AI-3.8-15.1.2-linux-amd64-docker.tar` | Single image for coordinator, worker, and FUSE roles |
| License       |                                                           | Required to activate the cluster                     |

> **Platform**: Use `-linux-amd64-docker.tar` for x86 hosts, `-linux-arm64-docker.tar` for ARM.

The same image is loaded on every node. The role — coordinator, worker, or fuse — is set by the argument passed to `docker run`.

## Before You Start

Run these checks before starting. Skipping them is the most common cause of deployment failures.

* [ ] **Docker** installed and running on every host:

  ```shell
  docker --version
  docker info
  ```
* [ ] **Network connectivity** between all hosts
* [ ] **Firewall / security groups** open the required ports between all hosts. See [Prerequisites → Networking](https://documentation.alluxio.io/ee-ai-en/prerequisites#networking) for the full port list.
* [ ] **Alluxio image `.tar` file** downloaded and available
* [ ] **Alluxio license string** available
* [ ] **UFS credentials** ready (S3 access key/secret, or IAM role attached to the instance)

> **EC2 + IAM roles**: Attach the IAM instance profile before launch. No access keys are needed in the `docker run` commands.

## Installation Steps

### 0. Load the Alluxio Image

Run on **every host** (coordinator, each worker, FUSE client):

```shell
docker load -i alluxio-enterprise-AI-3.8-15.1.2-linux-amd64-docker.tar
```

**✅ Verify:**

```shell
docker images | grep alluxio
```

```console
alluxio/alluxio-enterprise   AI-3.8-15.1.2   4091c3d8dbc4   ...   2.58GB
```

Note the image name and tag — you will use them in every subsequent `docker run` command.

### 1. Start ETCD

SSH into the **ETCD node**:

```shell
docker run -p 2379:2379 -p 2380:2380 -d --name etcd-standalone \
  quay.io/coreos/etcd:v3.5.9 etcd \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://<COORDINATOR_PRIVATE_IP>:2379
```

Use the **private IP** for `--advertise-client-urls` — not the public DNS or hostname — to ensure workers on other hosts can reach ETCD reliably.

> For production, run a 3-node ETCD cluster for high availability. Single-node is suitable for evaluation only.

**✅ Verify:**

```shell
docker exec etcd-standalone etcdctl endpoint status --cluster -w table
```

```console
+-----------------------------------+------------------+---------+---------+-----------+
|             ENDPOINT              |        ID        | VERSION | DB SIZE | IS LEADER |
+-----------------------------------+------------------+---------+---------+-----------+
| http://172.31.31.133:2379         | 8e9e05c52164694d |   3.5.9 |   20 kB |      true |
+-----------------------------------+------------------+---------+---------+-----------+
```

### 2. Start Coordinator (Coordinator host)

```shell
docker run -d --net=host --name=alluxio-coordinator \
  -e ALLUXIO_JAVA_OPTS="\
    -Dalluxio.license=<YOUR_LICENSE> \
    -Dalluxio.etcd.endpoints=http://<COORDINATOR_PRIVATE_IP>:2379 \
    -Dalluxio.coordinator.hostname=<COORDINATOR_PRIVATE_IP> \
    -Dalluxio.mount.table.source=ETCD" \
  alluxio/alluxio-enterprise:AI-3.8-15.1.2 coordinator
```

Key properties:

| Property                          | Purpose                                     |
| --------------------------------- | ------------------------------------------- |
| `alluxio.license`                 | License string                              |
| `alluxio.etcd.endpoints`          | ETCD address (private IP + port 2379)       |
| `alluxio.coordinator.hostname`    | Private IP workers use to register          |
| `alluxio.mount.table.source=ETCD` | Persist mount table in ETCD across restarts |

**✅ Verify:**

```shell
docker logs alluxio-coordinator 2>&1 | grep -i "started\|listening\|etcd"
```

No `ERROR` lines in the first 30 seconds indicates a healthy start.

### 3. Start Workers (each Worker host)

SSH into each **worker host**:

```shell
# Create the cache directory
sudo mkdir -p /data/alluxio-cache

# Start the worker
docker run -d --net=host --name=alluxio-worker \
  -e ALLUXIO_JAVA_OPTS="\
    -Xmx8g -Xms2g -XX:MaxDirectMemorySize=8g \
    -Dalluxio.license=<YOUR_LICENSE> \
    -Dalluxio.etcd.endpoints=http://<COORDINATOR_PRIVATE_IP>:2379 \
    -Dalluxio.coordinator.hostname=<COORDINATOR_PRIVATE_IP> \
    -Dalluxio.mount.table.source=ETCD \
    -Dalluxio.worker.page.store.dirs=/data/alluxio-cache \
    -Dalluxio.worker.page.store.sizes=<CACHE_SIZE>" \
  alluxio/alluxio-enterprise:AI-3.8-15.1.2 worker
```

Set `<CACHE_SIZE>` to \~80% of available space on `/data/alluxio-cache` (e.g., `50GB`, `200GB`, `1TB`). Avoid `/tmp` — it is cleared on reboot.

> **Cache persistence**: `docker restart` preserves the page store — cached data survives. Only `docker rm` loses the container state (the host directory is unaffected).

> To update worker configuration later without recreating the container, see [Appendix D](#d-updating-configuration).

> **S3 API (optional)**: Add `-Dalluxio.worker.s3.api.enabled=true` to enable the S3-compatible endpoint on each worker (port 29998). Only needed if clients will access Alluxio via the S3 API.

> **JVM sizing** — Alluxio stores cached data on disk, not in heap. The JVM heap does not need to be large relative to cache size:

| Host RAM | `-Xmx` | `-XX:MaxDirectMemorySize` |
| -------- | ------ | ------------------------- |
| 16 GB    | 4g     | 4g                        |
| 32 GB    | 8g     | 8g                        |
| 64 GB    | 16g    | 16g                       |
| 128 GB+  | 32g    | 32g                       |

**✅ Verify (from coordinator host):**

```shell
docker exec alluxio-coordinator alluxio info nodes
```

```console
WorkerId                                         Address                          Status
worker-15ed4a17-2154-4454-ba0b-32b46ff06bfb     ip-172-31-26-247...:29999        ONLINE
worker-704c4d42-d189-41ff-b6f5-775a7d1551b3     ip-172-31-18-67...:29999         ONLINE
```

Workers may take 10–15 seconds to register after starting.

### 4. Mount Storage

Run `alluxio mount add` from any host that has access to the coordinator. For full UFS configuration options, see [Underlying Storage](https://documentation.alluxio.io/ee-ai-en/ufs).

**S3 with IAM role (recommended on EC2):**

```shell
docker exec alluxio-coordinator alluxio mount add \
  --path /s3 \
  --ufs-uri s3://<S3_BUCKET>/ \
  --option alluxio.underfs.s3.region=<S3_REGION>
```

**S3 with access key/secret:**

```shell
docker exec alluxio-coordinator alluxio mount add \
  --path /s3 \
  --ufs-uri s3://<S3_BUCKET>/ \
  --option alluxio.underfs.s3.region=<S3_REGION> \
  --option s3a.accessKeyId=<ACCESS_KEY> \
  --option s3a.secretKey=<SECRET_KEY>
```

**✅ Verify:**

```shell
docker exec alluxio-coordinator alluxio mount list
```

```console
Listing all mount points
s3://<S3_BUCKET>/  on  /s3/  properties={alluxio.underfs.s3.region=<S3_REGION>}
```

### 5. Verify Data Access

```shell
docker exec alluxio-coordinator alluxio fs ls /s3/
```

**✅ Success:** The command returns a listing of files and directories in your S3 bucket. An empty bucket returns an empty listing without errors. Example:

```console
-rwx------  0  0  0  B  PERSISTED  01-01-2024 00:00:00:000  /s3/dataset/
-rwx------  0  0  1234  B  PERSISTED  01-01-2024 00:00:00:000  /s3/README.md
```

If the command fails, see [Appendix A: Troubleshooting](#a-troubleshooting).

### 6. Start FUSE (FUSE client host)

SSH into the **FUSE client host**. Create the mount directory:

```shell
sudo mkdir -p /mnt/alluxio
sudo chown $(whoami) /mnt/alluxio
chmod 755 /mnt/alluxio
```

The `-o allow_other` flag (passed to FUSE at the end of the `docker run` command) requires `user_allow_other` to be set in `/etc/fuse.conf` on the FUSE host. Without it, FUSE will fail with `fusermount: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf`. Enable it:

```shell
grep -q user_allow_other /etc/fuse.conf || echo user_allow_other | sudo tee -a /etc/fuse.conf
```

Start the FUSE container:

```shell
docker run -d --privileged --net=host --name=alluxio-fuse \
  -v /mnt/alluxio:/mnt/alluxio:shared \
  -e ALLUXIO_JAVA_OPTS="\
    -Xmx4g -Xms1g -XX:MaxDirectMemorySize=4g \
    -Dalluxio.etcd.endpoints=http://<COORDINATOR_PRIVATE_IP>:2379 \
    -Dalluxio.coordinator.hostname=<COORDINATOR_PRIVATE_IP> \
    -Dalluxio.mount.table.source=ETCD" \
  alluxio/alluxio-enterprise:AI-3.8-15.1.2 fuse -o allow_other /mnt/alluxio/fuse
```

> `--privileged` is required for FUSE to mount inside the container and propagate to the host via `-v /mnt/alluxio:/mnt/alluxio:shared`.

**✅ Verify** (wait \~10 seconds):

```shell
ls /mnt/alluxio/fuse/
```

```console
s3
```

Each Alluxio mount point appears as a subdirectory. `/mnt/alluxio/fuse/s3/` maps directly to `s3://<S3_BUCKET>/`.

**Test read and write:**

```shell
# Read an existing file
cat /mnt/alluxio/fuse/s3/test.txt

# Write a new file
echo "hello alluxio" > /mnt/alluxio/fuse/s3/hello.txt

# Verify it appeared in S3
aws s3 ls s3://<S3_BUCKET>/hello.txt
```

## Uninstall

Stop and remove containers on each host in reverse order:

**FUSE client host:**

```shell
docker stop alluxio-fuse && docker rm alluxio-fuse
sudo umount -l /mnt/alluxio/fuse   # if mount is stuck after container removal
```

**Each worker host:**

```shell
docker stop alluxio-worker && docker rm alluxio-worker
```

**Coordinator host:**

```shell
docker stop alluxio-coordinator && docker rm alluxio-coordinator
docker stop etcd-standalone && docker rm etcd-standalone
```

> **Mount table persistence**: ETCD stores the mount table in its container filesystem. Running `docker rm` on the ETCD container will lose all mount points. To persist them across restarts, add `-v /data/etcd:/etcd-data --data-dir /etcd-data` to the ETCD `docker run` command.

## Monitoring (Optional)

```shell
mkdir -p ~/monitoring/prometheus ~/monitoring/grafana
```

Create `~/monitoring/prometheus/prometheus.yml`:

```yaml
global:
  scrape_interval: 60s

scrape_configs:
  - job_name: "coordinator"
    static_configs:
      - targets: ["<COORDINATOR_PRIVATE_IP>:19999"]
  - job_name: "workers"
    static_configs:
      - targets: ["<WORKER1_PRIVATE_IP>:30000", "<WORKER2_PRIVATE_IP>:30000"]
  - job_name: "fuse"
    static_configs:
      - targets: ["<FUSE_PRIVATE_IP>:49999"]
```

Create `~/monitoring/grafana/datasource.yml`:

```yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
    access: proxy
    editable: true
```

Create `~/monitoring/compose.yaml`:

```yaml
services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    ports:
      - 9090:9090
    restart: unless-stopped
    volumes:
      - ./prometheus:/etc/prometheus
      - prom_data:/prometheus
  grafana:
    image: grafana/grafana
    container_name: grafana
    ports:
      - 3000:3000
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=grafana
    volumes:
      - ./grafana:/etc/grafana/provisioning/datasources
volumes:
  prom_data:
```

```shell
cd ~/monitoring && docker compose up -d
```

Access Grafana at `http://<COORDINATOR_PUBLIC_IP>:3000` (login: `admin` / `grafana`).

Import the Alluxio dashboard JSON from the [monitoring documentation](https://documentation.alluxio.io/ee-ai-en/administration/monitoring-alluxio).

## Appendix

### A. Troubleshooting

**Workers not appearing in `alluxio info nodes`**

1. Verify ETCD is reachable from the worker host:

   ```shell
   curl http://<COORDINATOR_PRIVATE_IP>:2379/health
   ```

   Expected: `{"health":"true"}`
2. Check worker logs:

   ```shell
   docker logs alluxio-worker 2>&1 | grep -i "error\|etcd\|register" | tail -20
   ```
3. Confirm `alluxio.coordinator.hostname` is set to an IP reachable from the worker. If unreachable, registration silently fails.

**FUSE mount not visible after container starts**

1. Check that the mount directory exists:

   ```shell
   ls -la /mnt/alluxio/fuse
   ```

   If missing, the container starts successfully but the mount is silently skipped.
2. Check container logs:

   ```shell
   docker logs alluxio-fuse 2>&1 | tail -20
   ```

   `Mount point '/mnt/alluxio/fuse' does not exist` confirms the directory was missing.
3. Fix the directory, then recreate the container:

   ```shell
   docker rm -f alluxio-fuse
   # re-run the docker run command from Step 6
   ```

**`Transport endpoint is not connected` after FUSE container is removed**

The FUSE filesystem stays registered with the kernel after the container exits. Unmount manually:

```shell
sudo umount -l /mnt/alluxio/fuse
```

**`alluxio mount add` fails with `unknown command`**

Use named flags — the old positional syntax is no longer supported:

```shell
alluxio mount add --path /s3 --ufs-uri s3://<BUCKET>/ --option alluxio.underfs.s3.region=<REGION>
```

### B. Worker Identity

Each worker generates a unique identity file (`/opt/alluxio/conf/worker_identity`) on first start. If this file is lost, the worker registers as a new instance and the old entry stays in ETCD as `OFFLINE`.

**Triggers for identity loss:**

* Host reboot (container filesystem is reset)
* Container recreation: `docker rm` + `docker run` — the most common trigger when updating `ALLUXIO_JAVA_OPTS`

**Cleaning up OFFLINE entries** (after identity loss has already occurred):

```shell
docker exec alluxio-coordinator alluxio process remove-worker -n <WORKER_ID>
```

Worker IDs are shown in `alluxio info nodes`.

**Preventing identity loss** — persist the identity file via a volume mount so it survives container recreation:

```shell
# Copy the identity file out of the running container first:
docker cp alluxio-worker:/opt/alluxio/conf/worker_identity /etc/alluxio/worker_identity

# Then add this flag to the worker docker run command:
-v /etc/alluxio/worker_identity:/opt/alluxio/conf/worker_identity
```

With this mount, `docker restart` and even `docker rm` + `docker run` preserve the worker's identity.

For the full explanation of why identity persistence matters and the impact on the hash ring, see [Restarting a Worker](https://documentation.alluxio.io/ee-ai-en/administration/managing-ring#restarting-a-worker).

### C. Collecting Logs for Support

```shell
# Coordinator logs
docker cp alluxio-coordinator:/opt/alluxio/logs /tmp/coordinator-logs

# Worker logs (run on each worker host)
docker cp alluxio-worker:/opt/alluxio/logs /tmp/worker-logs

# Run collectInfo (from coordinator)
docker exec alluxio-coordinator bash -c "cd /tmp && /opt/alluxio/bin/collectinfo.sh"
docker cp alluxio-coordinator:/tmp/output.tar.gz /tmp/alluxio-collectinfo.tar.gz
```

### D. Updating Configuration

Changing `ALLUXIO_JAVA_OPTS` requires recreating the container, which generates a new worker ID and leaves stale `OFFLINE` entries in ETCD (see [Appendix B](#b-worker-identity)). To avoid this, use `alluxio-site.properties` instead and update it in-place — `docker restart` preserves the worker ID.

**Option A — `docker cp` (no volume mount needed):**

```shell
# Edit your local copy of alluxio-site.properties, then copy it in:
docker cp alluxio-site.properties alluxio-worker:/opt/alluxio/conf/alluxio-site.properties
docker restart alluxio-worker
```

**Option B — volume mount (recommended for repeated updates):**

```shell
# Add this flag to the original docker run command:
-v /etc/alluxio/alluxio-site.properties:/opt/alluxio/conf/alluxio-site.properties
```

Then edit the host file and restart:

```shell
# Edit /etc/alluxio/alluxio-site.properties, then:
docker restart alluxio-worker
```

After restart, verify the worker rejoins with the same identity:

```shell
docker exec alluxio-coordinator alluxio info nodes
# Expected: same WorkerId as before, status ONLINE
```

## Related Documentation

* [Cluster Management](https://documentation.alluxio.io/ee-ai-en/administration/managing-alluxio) — Post-deployment operations: scaling, hash ring tuning, worker lifecycle, and UFS mount management
* [Amazon S3 UFS](https://documentation.alluxio.io/ee-ai-en/ufs/s3) — S3 credentials and configuration options
* [POSIX API (FUSE)](https://documentation.alluxio.io/ee-ai-en/data-access/fuse-based-posix-api) — FUSE mount options and tuning
* [S3 API](https://documentation.alluxio.io/ee-ai-en/data-access/s3-api) — Using the Alluxio S3-compatible endpoint
* [Monitoring](https://documentation.alluxio.io/ee-ai-en/administration/monitoring-alluxio) — Grafana dashboard import and metrics reference
