Docker Installation
Deploy Alluxio on bare-metal Linux hosts, EC2 instances, or Slurm-managed clusters using Docker — no Kubernetes required.
Overview
Architecture
Each Alluxio component runs in its own Docker container using --net=host, sharing the host network stack. Components communicate directly by IP — no port mapping needed.
Coordinator node
ETCD, Alluxio Coordinator, Prometheus, Grafana
Worker node(s)
Alluxio Worker (one per host)
FUSE client node
Alluxio FUSE (one per host)
Artifacts
You will receive a download link for two artifacts:
Alluxio image
alluxio-enterprise-AI-3.8-15.1.2-linux-amd64-docker.tar
Single image for coordinator, worker, and FUSE roles
License
Required to activate the cluster
Version: The version string in this filename is an example. The download link you receive will contain the exact version your sales representative provisioned. Platform: Use
-linux-amd64-docker.tarfor x86 hosts,-linux-arm64-docker.tarfor ARM.
The same image is loaded on every node. The role — coordinator, worker, or fuse — is set by the argument passed to docker run.
Before You Start
Run these checks before starting. Skipping them is the most common cause of deployment failures.
EC2 + IAM roles: Attach the IAM instance profile before launch. No access keys are needed in the
docker runcommands.
FUSE client hosts: libfuse 3.10 or higher is required on any host running the Alluxio FUSE client (not needed on Kubernetes — libfuse is bundled in the container image).
For resource sizing (CPU, RAM, cache disk per component), see Prerequisites → Resource Sizing.
Installation Steps
0. Load the Alluxio Image
Run on every host (coordinator, each worker, FUSE client):
✅ Verify:
Note the image name and tag — you will use them in every subsequent docker run command.
1. Start ETCD
SSH into the ETCD node:
Use the private IP for --advertise-client-urls — not the public DNS or hostname — to ensure workers on other hosts can reach ETCD reliably.
For production, run a 3-node ETCD cluster for high availability. Single-node is suitable for evaluation only.
✅ Verify:
2. Start Coordinator (Coordinator host)
Put the license and cluster settings in alluxio-site.properties on the host, then mount the file into the container.
Key properties (set in alluxio-site.properties):
alluxio.license
License string
alluxio.etcd.endpoints
ETCD address (private IP + port 2379)
alluxio.coordinator.hostname
Private IP workers use to register
alluxio.mount.table.source=ETCD
Persist mount table in ETCD across restarts
✅ Verify:
No ERROR lines in the first 30 seconds indicates a healthy start.
3. Start Workers (each Worker host)
SSH into each worker host. Create the same alluxio-site.properties pattern as the coordinator, with two additional worker-specific properties for the page store, then mount it into the container:
Set <CACHE_SIZE> to ~80% of available space on the cache path (e.g., 50GB, 200GB, 1TB).
/tmpnote:/tmp/alluxio-cacheis cleared on host reboot, making it suitable for evaluation. For production, use a persistent path (e.g.,/data/alluxio-cache) — see Worker Configuration for setup and sizing guidance.
JVM sizing: The values above (
-Xmx8g -XX:MaxDirectMemorySize=8g) are suitable for a 32 GB host. Alluxio stores cached data on disk, not in heap — scale-Xmxand-XX:MaxDirectMemorySizeroughly to 25% of host RAM. See Worker Configuration for details.
✅ Verify (from coordinator host):
Workers may take 10–15 seconds to register after starting.
4. Mount Storage
Run alluxio mount add from any host that has access to the coordinator. For full UFS configuration options, see Underlying Storage.
S3 with IAM role (recommended on EC2):
If an IAM instance profile is attached to the EC2 host, no access keys are needed — the coordinator picks up credentials automatically.
If this fails with a credential error, see S3 credential errors in Appendix A.
S3 with access key/secret:
✅ Verify:
5. Start FUSE (FUSE client host)
SSH into the FUSE client host. If reinstalling after a previous run, unmount any stale FUSE first:
Create the mount directory:
The -o allow_other flag (passed to FUSE at the end of the docker run command) requires user_allow_other to be set in /etc/fuse.conf on the FUSE host. Without it, FUSE will fail with fusermount: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf. Enable it:
Start the FUSE container:
--privilegedis required for FUSE to mount inside the container and propagate to the host via-v /mnt/alluxio:/mnt/alluxio:shared.
✅ Verify (wait ~10 seconds):
Each Alluxio mount point appears as a subdirectory. /mnt/alluxio/fuse/s3/ maps directly to s3://<S3_BUCKET>/.
The FUSE container above is a minimal smoke test —
--privileged+ evaluation-sized JVM. For production deployment with fine-grained capabilities (no--privileged), high-throughput JVM sizing, tuned mount options, and multi-client setups, see POSIX API → Docker / Bare-Metal.
6. Verify Data Access
✅ Success: The command returns a listing of files and directories in your S3 bucket. An empty bucket returns an empty listing without errors. Example:
If the command fails, see Appendix A: Troubleshooting.
Test read and write via FUSE:
Restart / Recovery
After an EC2 reboot or host restart, Docker containers do not restart automatically unless configured to do so. The ETCD container also loses its data if it was not started with a persistent volume.
Recover After Host Reboot
Run the following on each host in order. If ETCD was started without a persistent volume (-v), the mount table is lost — re-run Step 4 after restarting the coordinator.
ETCD node:
Coordinator node:
Each worker node:
FUSE client node:
✅ Verify all components are back:
Persist the ETCD Mount Table
By default, ETCD stores data in its container filesystem. A docker rm or host reboot without --restart loses all mount points. To persist the mount table across restarts, add a volume mount to the ETCD docker run command:
Uninstall
Stop and remove containers on each host in reverse order:
FUSE client host:
Each worker host:
Coordinator host:
Monitoring (Optional)
First, create the Prometheus scrape config and Grafana datasource files — see Monitoring → Prometheus Setup for the file contents and directory paths.
Then start Prometheus and Grafana on the coordinator node:
Open Grafana at http://localhost:3000 (EC2: use an SSH tunnel or open ports 3000 and 9090 in your Security Group). For dashboard import, alert rules, and Datadog integration, see Monitoring.
Appendix
A. Troubleshooting
License checksum error on coordinator or worker startup
This almost always means the license string was corrupted in transit. Base64-encoded Alluxio license strings contain +, /, and = — characters that survive one shell layer but get mangled through nested layers (typical path: local shell → ssh → docker -e ALLUXIO_JAVA_OPTS="...-Dalluxio.license=${LICENSE}..." → Java). Both Step 2 (coordinator) and Step 3 (workers) avoid this by writing the license into /etc/alluxio/alluxio-site.properties and mounting the file into the container with -v, skipping all shell quoting. If you had set the license via -Dalluxio.license=... in ALLUXIO_JAVA_OPTS, switch to the file-mount pattern shown in the corresponding step.
Workers not appearing in alluxio info nodes
Verify ETCD is reachable from the worker host:
Expected:
{"health":"true"}Check worker logs:
FUSE mount not visible after container starts
Check that the mount directory exists:
If missing, the container starts successfully but the mount is silently skipped.
Check container logs:
Mount point '/mnt/alluxio/fuse' does not existconfirms the directory was missing.Fix the directory, then recreate the container:
Transport endpoint is not connected after FUSE container is removed
The FUSE filesystem stays registered with the kernel after the container exits. Unmount manually:
alluxio mount add fails with unknown command
Use named flags — the old positional syntax is no longer supported:
S3 mount fails with credential error
The coordinator uses --net=host and inherits the host's network stack, so it can reach the EC2 Instance Metadata Service (IMDS) — a built-in endpoint on every EC2 instance that vends short-lived credentials to processes running on that host. First, verify that an IAM role is actually attached:
If no role is attached, remount using explicit access keys (see Step 4). If a role is attached but the error persists, check that the role's policy grants s3:GetObject, s3:ListBucket, and s3:PutObject on the target bucket.
B. Worker Identity
On first start, Alluxio automatically creates /opt/alluxio/conf/worker_identity and writes a UUID into it. If this file is lost, the worker restarts with a new UUID — its ring slots are remapped, previously cached data becomes unreachable, and the old UUID remains as a stale entry in etcd until it is manually removed or automatically purged (dynamic mode only).
Triggers for identity loss:
Host reboot (container filesystem is reset)
Container recreation:
docker rm+docker run— the most common trigger when updatingALLUXIO_JAVA_OPTS
Cleaning up OFFLINE entries (after identity loss has already occurred):
Worker IDs are shown in alluxio info nodes.
Preventing identity loss — redirect the identity file to a host-mounted directory by setting alluxio.worker.identity.uuid.file.path. On first start, Alluxio writes the UUID to that path; because the path is inside a bind-mounted directory, the file lands on the host immediately and survives all future docker rm + docker run cycles.
On each worker host, run the following before starting the worker:
Add the following to /etc/alluxio/alluxio-site.properties:
Add -v /etc/alluxio/identity:/etc/alluxio/identity to the worker docker run command (Step 3):
On first start, Alluxio creates /etc/alluxio/identity/worker_identity on the host. On every subsequent start, it reads the UUID back from that file — no separate copy step needed.
If you cannot modify alluxio-site.properties before first start, an alternative is to start the worker once without any identity mount, copy the generated file out with docker cp alluxio-worker:/opt/alluxio/conf/worker_identity /etc/alluxio/worker_identity && sudo chmod 666 /etc/alluxio/worker_identity, then recreate the container with -v /etc/alluxio/worker_identity:/opt/alluxio/conf/worker_identity. Note that bind-mounting a file path that does not yet exist on the host causes Docker to create a directory there instead, which will make the worker fail with IsADirectoryException.
For the full explanation of why identity persistence matters and the impact on the hash ring, see Restarting a Worker.
C. Collecting Logs for Support
D. Updating Configuration
The main setup flow already mounts /etc/alluxio/alluxio-site.properties into both the coordinator and worker containers (see Start Coordinator and Start Workers). To change any Alluxio property after install, edit the host file and restart the container — docker restart preserves the worker ID, unlike recreating the container with -e ALLUXIO_JAVA_OPTS, which generates a new ID and leaves stale OFFLINE entries in ETCD (see Appendix B).
After restart, verify the worker rejoins with the same identity:
Related Documentation
Cluster Management — Post-deployment operations: scaling, hash ring tuning, worker lifecycle, and UFS mount management
Amazon S3 UFS — S3 credentials and configuration options
POSIX API (FUSE) — FUSE mount options and tuning
S3 API — Using the Alluxio S3-compatible endpoint
Monitoring — Alert rules, Datadog integration, and metrics reference
Last updated