Docker Installation

Deploy Alluxio on bare-metal Linux hosts, EC2 instances, or Slurm-managed clusters using Docker — no Kubernetes required.

Overview

Architecture

Each Alluxio component runs in its own Docker container using --net=host, sharing the host network stack. Components communicate directly by IP — no port mapping needed.

Host
Containers

Coordinator node

ETCD, Alluxio Coordinator, Prometheus, Grafana

Worker node(s)

Alluxio Worker (one per host)

FUSE client node

Alluxio FUSE (one per host)

Artifacts

You will receive a download link for two artifacts:

Artifact
Filename
Purpose

Alluxio image

alluxio-enterprise-AI-3.8-15.1.2-linux-amd64-docker.tar

Single image for coordinator, worker, and FUSE roles

License

Required to activate the cluster

Version: The version string in this filename is an example. The download link you receive will contain the exact version your sales representative provisioned. Platform: Use -linux-amd64-docker.tar for x86 hosts, -linux-arm64-docker.tar for ARM.

The same image is loaded on every node. The role — coordinator, worker, or fuse — is set by the argument passed to docker run.

Before You Start

Run these checks before starting. Skipping them is the most common cause of deployment failures.

EC2 + IAM roles: Attach the IAM instance profile before launch. No access keys are needed in the docker run commands.

FUSE client hosts: libfuse 3.10 or higher is required on any host running the Alluxio FUSE client (not needed on Kubernetes — libfuse is bundled in the container image).

For resource sizing (CPU, RAM, cache disk per component), see Prerequisites → Resource Sizing.

Installation Steps

0. Load the Alluxio Image

Run on every host (coordinator, each worker, FUSE client):

✅ Verify:

Note the image name and tag — you will use them in every subsequent docker run command.

1. Start ETCD

SSH into the ETCD node:

Use the private IP for --advertise-client-urls — not the public DNS or hostname — to ensure workers on other hosts can reach ETCD reliably.

For production, run a 3-node ETCD cluster for high availability. Single-node is suitable for evaluation only.

✅ Verify:

2. Start Coordinator (Coordinator host)

Put the license and cluster settings in alluxio-site.properties on the host, then mount the file into the container.

Key properties (set in alluxio-site.properties):

Property
Purpose

alluxio.license

License string

alluxio.etcd.endpoints

ETCD address (private IP + port 2379)

alluxio.coordinator.hostname

Private IP workers use to register

alluxio.mount.table.source=ETCD

Persist mount table in ETCD across restarts

✅ Verify:

No ERROR lines in the first 30 seconds indicates a healthy start.

3. Start Workers (each Worker host)

SSH into each worker host. Create the same alluxio-site.properties pattern as the coordinator, with two additional worker-specific properties for the page store, then mount it into the container:

Set <CACHE_SIZE> to ~80% of available space on the cache path (e.g., 50GB, 200GB, 1TB).

/tmp note: /tmp/alluxio-cache is cleared on host reboot, making it suitable for evaluation. For production, use a persistent path (e.g., /data/alluxio-cache) — see Worker Configuration for setup and sizing guidance.

JVM sizing: The values above (-Xmx8g -XX:MaxDirectMemorySize=8g) are suitable for a 32 GB host. Alluxio stores cached data on disk, not in heap — scale -Xmx and -XX:MaxDirectMemorySize roughly to 25% of host RAM. See Worker Configuration for details.

✅ Verify (from coordinator host):

Workers may take 10–15 seconds to register after starting.

4. Mount Storage

Run alluxio mount add from any host that has access to the coordinator. For full UFS configuration options, see Underlying Storage.

S3 with IAM role (recommended on EC2):

If an IAM instance profile is attached to the EC2 host, no access keys are needed — the coordinator picks up credentials automatically.

If this fails with a credential error, see S3 credential errors in Appendix A.

S3 with access key/secret:

✅ Verify:

5. Start FUSE (FUSE client host)

SSH into the FUSE client host. If reinstalling after a previous run, unmount any stale FUSE first:

Create the mount directory:

The -o allow_other flag (passed to FUSE at the end of the docker run command) requires user_allow_other to be set in /etc/fuse.conf on the FUSE host. Without it, FUSE will fail with fusermount: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf. Enable it:

Start the FUSE container:

--privileged is required for FUSE to mount inside the container and propagate to the host via -v /mnt/alluxio:/mnt/alluxio:shared.

✅ Verify (wait ~10 seconds):

Each Alluxio mount point appears as a subdirectory. /mnt/alluxio/fuse/s3/ maps directly to s3://<S3_BUCKET>/.

The FUSE container above is a minimal smoke test — --privileged + evaluation-sized JVM. For production deployment with fine-grained capabilities (no --privileged), high-throughput JVM sizing, tuned mount options, and multi-client setups, see POSIX API → Docker / Bare-Metal.

6. Verify Data Access

✅ Success: The command returns a listing of files and directories in your S3 bucket. An empty bucket returns an empty listing without errors. Example:

If the command fails, see Appendix A: Troubleshooting.

Test read and write via FUSE:

Restart / Recovery

After an EC2 reboot or host restart, Docker containers do not restart automatically unless configured to do so. The ETCD container also loses its data if it was not started with a persistent volume.

Recover After Host Reboot

Run the following on each host in order. If ETCD was started without a persistent volume (-v), the mount table is lost — re-run Step 4 after restarting the coordinator.

ETCD node:

Coordinator node:

Each worker node:

FUSE client node:

✅ Verify all components are back:

Persist the ETCD Mount Table

By default, ETCD stores data in its container filesystem. A docker rm or host reboot without --restart loses all mount points. To persist the mount table across restarts, add a volume mount to the ETCD docker run command:

Uninstall

Stop and remove containers on each host in reverse order:

FUSE client host:

Each worker host:

Coordinator host:

Monitoring (Optional)

First, create the Prometheus scrape config and Grafana datasource files — see Monitoring → Prometheus Setup for the file contents and directory paths.

Then start Prometheus and Grafana on the coordinator node:

Open Grafana at http://localhost:3000 (EC2: use an SSH tunnel or open ports 3000 and 9090 in your Security Group). For dashboard import, alert rules, and Datadog integration, see Monitoring.

Appendix

A. Troubleshooting

License checksum error on coordinator or worker startup

This almost always means the license string was corrupted in transit. Base64-encoded Alluxio license strings contain +, /, and = — characters that survive one shell layer but get mangled through nested layers (typical path: local shell → sshdocker -e ALLUXIO_JAVA_OPTS="...-Dalluxio.license=${LICENSE}..." → Java). Both Step 2 (coordinator) and Step 3 (workers) avoid this by writing the license into /etc/alluxio/alluxio-site.properties and mounting the file into the container with -v, skipping all shell quoting. If you had set the license via -Dalluxio.license=... in ALLUXIO_JAVA_OPTS, switch to the file-mount pattern shown in the corresponding step.

Workers not appearing in alluxio info nodes

  1. Verify ETCD is reachable from the worker host:

    Expected: {"health":"true"}

  2. Check worker logs:

FUSE mount not visible after container starts

  1. Check that the mount directory exists:

    If missing, the container starts successfully but the mount is silently skipped.

  2. Check container logs:

    Mount point '/mnt/alluxio/fuse' does not exist confirms the directory was missing.

  3. Fix the directory, then recreate the container:

Transport endpoint is not connected after FUSE container is removed

The FUSE filesystem stays registered with the kernel after the container exits. Unmount manually:

alluxio mount add fails with unknown command

Use named flags — the old positional syntax is no longer supported:

S3 mount fails with credential error

The coordinator uses --net=host and inherits the host's network stack, so it can reach the EC2 Instance Metadata Service (IMDS) — a built-in endpoint on every EC2 instance that vends short-lived credentials to processes running on that host. First, verify that an IAM role is actually attached:

If no role is attached, remount using explicit access keys (see Step 4). If a role is attached but the error persists, check that the role's policy grants s3:GetObject, s3:ListBucket, and s3:PutObject on the target bucket.

B. Worker Identity

On first start, Alluxio automatically creates /opt/alluxio/conf/worker_identity and writes a UUID into it. If this file is lost, the worker restarts with a new UUID — its ring slots are remapped, previously cached data becomes unreachable, and the old UUID remains as a stale entry in etcd until it is manually removed or automatically purged (dynamic mode only).

Triggers for identity loss:

  • Host reboot (container filesystem is reset)

  • Container recreation: docker rm + docker run — the most common trigger when updating ALLUXIO_JAVA_OPTS

Cleaning up OFFLINE entries (after identity loss has already occurred):

Worker IDs are shown in alluxio info nodes.

Preventing identity loss — redirect the identity file to a host-mounted directory by setting alluxio.worker.identity.uuid.file.path. On first start, Alluxio writes the UUID to that path; because the path is inside a bind-mounted directory, the file lands on the host immediately and survives all future docker rm + docker run cycles.

On each worker host, run the following before starting the worker:

Add the following to /etc/alluxio/alluxio-site.properties:

Add -v /etc/alluxio/identity:/etc/alluxio/identity to the worker docker run command (Step 3):

On first start, Alluxio creates /etc/alluxio/identity/worker_identity on the host. On every subsequent start, it reads the UUID back from that file — no separate copy step needed.

If you cannot modify alluxio-site.properties before first start, an alternative is to start the worker once without any identity mount, copy the generated file out with docker cp alluxio-worker:/opt/alluxio/conf/worker_identity /etc/alluxio/worker_identity && sudo chmod 666 /etc/alluxio/worker_identity, then recreate the container with -v /etc/alluxio/worker_identity:/opt/alluxio/conf/worker_identity. Note that bind-mounting a file path that does not yet exist on the host causes Docker to create a directory there instead, which will make the worker fail with IsADirectoryException.

For the full explanation of why identity persistence matters and the impact on the hash ring, see Restarting a Worker.

C. Collecting Logs for Support

D. Updating Configuration

The main setup flow already mounts /etc/alluxio/alluxio-site.properties into both the coordinator and worker containers (see Start Coordinator and Start Workers). To change any Alluxio property after install, edit the host file and restart the container — docker restart preserves the worker ID, unlike recreating the container with -e ALLUXIO_JAVA_OPTS, which generates a new ID and leaves stale OFFLINE entries in ETCD (see Appendix B).

After restart, verify the worker rejoins with the same identity:

  • Cluster Management — Post-deployment operations: scaling, hash ring tuning, worker lifecycle, and UFS mount management

  • Amazon S3 UFS — S3 credentials and configuration options

  • POSIX API (FUSE) — FUSE mount options and tuning

  • S3 API — Using the Alluxio S3-compatible endpoint

  • Monitoring — Alert rules, Datadog integration, and metrics reference

Last updated