Release Notes

Alluxio Enterprise AI 3.8

New Features

High-Performance Write Caching for S3 API

circle-exclamation

Alluxio now buffers S3 Put operations in a write cache before flushing to UFS, reducing write latency from ~50 ms to under 10 ms and delivering up to 6 GB/s per worker with linear scalability. The write cache supports full Multipart Upload (MPU) operations, drains safely before worker decommission, and persist-on-complete frees cache storage automatically once data reaches UFS. Write cache path policies are managed via the dynamic configuration system and take effect without restarts.

See S3 Write Cache for configuration details.

Optimized .safetensors Model Loading

Alluxio now merges thousands of small random reads into large sequential reads when loading .safetensors models, achieving within ~10% of NVMe local disk speed and up to 18× faster than AWS FSx Lustre. Symlinks to .safetensors files receive the same optimization as direct file access. Supported through the FUSE interface only.

See Optimizing AI Model Loading for configuration details.

Job Service High Availability

Multiple coordinators can now share a single etcd-backed job queue, eliminating the Job Service as a single point of failure. Job state survives coordinator restarts — pre-restart jobs are automatically restored to the waiting queue, so manual resubmission is no longer needed. Any coordinator can pick up work from a failed peer within seconds.

See Job Service for configuration details.


Enhancements

FUSE Reliability and Performance

Hard timeout for hung requestsalluxio.fuse.request.hard.timeout (default: disabled) forcibly terminates a stalled FUSE request after the configured duration and returns a clean error to the caller. Previously, a hung read from a slow or unreachable worker would block the FUSE mount indefinitely, causing training jobs to hang through NCCL or framework-level timeouts rather than receiving a clean failure.

Local worker preference for colocated deploymentsalluxio.user.replica.prioritize.local.worker (default: false) reorders the candidate worker list to put the colocated worker first with zero additional RPCs. When a FUSE client and an Alluxio worker run on the same Kubernetes node, all reads that can be served locally are sent to that node-local worker, eliminating cross-node traffic for cache hits. This property is not available in AI-3.7.

Parallel directory operationsalluxio.fuse.parallel.dirops.enabled allows directory listing and metadata operations to execute in parallel through FUSE, improving throughput for workloads that scan large model repositories.

Non-disruptive FUSE hot migration — improved logging, cleaner shutdown sequencing, and a fix that prevents write streams from failing mid-migration during live FUSE mount migration to a new cluster.

Security

Alluxio STS — AssumeRoleWithWebIdentity — Implements the full AWS STS AssumeRoleWithWebIdentity flow. An OIDC identity token (from Kubernetes service accounts, Okta, Keycloak, or any OIDC IdP) can be exchanged with the Alluxio STS endpoint for a short-lived session credential. S3 clients (Spark, Presto, or any SigV4-compatible client) can then authenticate to the Worker S3 API with per-user identity and no custom client JAR required.

TokenAuthenticator — Direct OIDC for Worker S3 API — An alternative to the STS flow: clients set an OIDC JWT directly as the S3 session token (X-Amz-Security-Token). Alluxio validates the JWT directly, bypassing SigV4 signature verification. No STS infrastructure required.

Independent HTTPS port for S3 Gateway — The S3 Gateway can listen on a dedicated HTTPS port separate from the plaintext HTTP port. Both ports can be active simultaneously, enabling gradual HTTP-to-HTTPS migration. TLS certificates can be rotated at runtime without restarting the gateway.

PEM certificate support — Multi-level CA PEM certificates are now supported for TLS on all Alluxio server and client connections, including etcd TLS. Previously only JKS/PKCS12 keystores were fully supported.

Authorization caching — Authorization decisions are cached with configurable TTL, reducing latency on hot paths where the same user repeatedly accesses the same paths.

Dynamic Configuration

AI-3.7 required cluster restarts to change most configuration properties. AI-3.8 introduces a dynamic configuration system backed by etcd. The following can be changed at runtime and take effect across all workers and clients without restarts:

What can change dynamically
Notes

Cache quota and TTL

Per-path quotas and expiry

Priority eviction policy

Which data gets evicted first

Path mapping / FUSE path config

Virtual path remapping

Replica rules

Which files get extra replicas

Job metadata store

All load/free/pin job state

Access and audit log config

Log verbosity without restart

Write cache filter policies

Which S3 paths use write cache

CLI: alluxio config dynamic set/get/edit REST API: GET/PUT /api/v1/config/dynamic

Requires etcd. Not available in single-node or non-HA deployments without etcd.


Bug Fixes

Bug
Impact in AI-3.7

Coordinator blocked under high concurrency

Coordinator could become unresponsive with many concurrent client requests

alluxio fs free hang

free could hang indefinitely on large directories

S3 data corruption on concurrent read+write

Reading and writing the same S3 object simultaneously could produce corrupted data

Worker client leak

Under certain error conditions, worker clients were not returned to the connection pool

NPE in UfsConnectivityMonitor

Worker could crash on UFS connectivity check if UFS was transiently unavailable

S3 list with non-existent prefix in write cache

Returned 500 instead of empty list

S3 rename fails with file not found

Rename on recently-written files could fail intermittently

OOM on listing large directories in write cache

List operations could exhaust heap on directories with >100k objects

Async prefetch cache race condition

Race in buffer handling could cause incorrect data to be returned

Parent+child UFS mount conflict

Could mount both a parent path and a child path simultaneously, causing undefined behavior


Breaking Changes

Dynamic configuration requires etcd — The new dynamic config features require etcd. Customers running single-node or non-HA deployments without etcd cannot use dynamic config.


Configuration Properties Added

Property
Default
Description

alluxio.fuse.request.hard.timeout

-1 (disabled)

Kill hung FUSE requests after this duration

alluxio.user.replica.prioritize.local.worker

false

Route reads to the colocated worker first

alluxio.worker.s3.authenticator.classname

PassAllAuthenticator

Set TokenAuthenticator or AlluxioIamAuthenticator for auth

alluxio.worker.s3.authorization.enabled

false

Enable per-user S3 authorization

alluxio.worker.s3.authorization.cache.enabled

true

Cache authorization decisions

alluxio.worker.s3.http.port

Independent HTTP port for S3 (separate from HTTPS)

alluxio.fuse.parallel.dirops.enabled

Enable parallel FUSE directory operations

Last updated