Release Notes
Alluxio Enterprise AI 3.8
New Features
High-Performance Write Caching for S3 API
Experimental since AI 3.8
Alluxio now buffers S3 Put operations in a write cache before flushing to UFS, reducing write latency from ~50 ms to under 10 ms and delivering up to 6 GB/s per worker with linear scalability. The write cache supports full Multipart Upload (MPU) operations, drains safely before worker decommission, and persist-on-complete frees cache storage automatically once data reaches UFS. Write cache path policies are managed via the dynamic configuration system and take effect without restarts.
See S3 Write Cache for configuration details.
Optimized .safetensors Model Loading
Alluxio now merges thousands of small random reads into large sequential reads when loading .safetensors models, achieving within ~10% of NVMe local disk speed and up to 18× faster than AWS FSx Lustre. Symlinks to .safetensors files receive the same optimization as direct file access. Supported through the FUSE interface only.
See Optimizing AI Model Loading for configuration details.
Job Service High Availability
Multiple coordinators can now share a single etcd-backed job queue, eliminating the Job Service as a single point of failure. Job state survives coordinator restarts — pre-restart jobs are automatically restored to the waiting queue, so manual resubmission is no longer needed. Any coordinator can pick up work from a failed peer within seconds.
See Job Service for configuration details.
Enhancements
FUSE Reliability and Performance
Hard timeout for hung requests — alluxio.fuse.request.hard.timeout (default: disabled) forcibly terminates a stalled FUSE request after the configured duration and returns a clean error to the caller. Previously, a hung read from a slow or unreachable worker would block the FUSE mount indefinitely, causing training jobs to hang through NCCL or framework-level timeouts rather than receiving a clean failure.
Local worker preference for colocated deployments — alluxio.user.replica.prioritize.local.worker (default: false) reorders the candidate worker list to put the colocated worker first with zero additional RPCs. When a FUSE client and an Alluxio worker run on the same Kubernetes node, all reads that can be served locally are sent to that node-local worker, eliminating cross-node traffic for cache hits. This property is not available in AI-3.7.
Parallel directory operations — alluxio.fuse.parallel.dirops.enabled allows directory listing and metadata operations to execute in parallel through FUSE, improving throughput for workloads that scan large model repositories.
Non-disruptive FUSE hot migration — improved logging, cleaner shutdown sequencing, and a fix that prevents write streams from failing mid-migration during live FUSE mount migration to a new cluster.
Security
Alluxio STS — AssumeRoleWithWebIdentity — Implements the full AWS STS AssumeRoleWithWebIdentity flow. An OIDC identity token (from Kubernetes service accounts, Okta, Keycloak, or any OIDC IdP) can be exchanged with the Alluxio STS endpoint for a short-lived session credential. S3 clients (Spark, Presto, or any SigV4-compatible client) can then authenticate to the Worker S3 API with per-user identity and no custom client JAR required.
TokenAuthenticator — Direct OIDC for Worker S3 API — An alternative to the STS flow: clients set an OIDC JWT directly as the S3 session token (X-Amz-Security-Token). Alluxio validates the JWT directly, bypassing SigV4 signature verification. No STS infrastructure required.
Independent HTTPS port for S3 Gateway — The S3 Gateway can listen on a dedicated HTTPS port separate from the plaintext HTTP port. Both ports can be active simultaneously, enabling gradual HTTP-to-HTTPS migration. TLS certificates can be rotated at runtime without restarting the gateway.
PEM certificate support — Multi-level CA PEM certificates are now supported for TLS on all Alluxio server and client connections, including etcd TLS. Previously only JKS/PKCS12 keystores were fully supported.
Authorization caching — Authorization decisions are cached with configurable TTL, reducing latency on hot paths where the same user repeatedly accesses the same paths.
Dynamic Configuration
AI-3.7 required cluster restarts to change most configuration properties. AI-3.8 introduces a dynamic configuration system backed by etcd. The following can be changed at runtime and take effect across all workers and clients without restarts:
Cache quota and TTL
Per-path quotas and expiry
Priority eviction policy
Which data gets evicted first
Path mapping / FUSE path config
Virtual path remapping
Replica rules
Which files get extra replicas
Job metadata store
All load/free/pin job state
Access and audit log config
Log verbosity without restart
Write cache filter policies
Which S3 paths use write cache
CLI: alluxio config dynamic set/get/edit REST API: GET/PUT /api/v1/config/dynamic
Requires etcd. Not available in single-node or non-HA deployments without etcd.
Bug Fixes
Coordinator blocked under high concurrency
Coordinator could become unresponsive with many concurrent client requests
alluxio fs free hang
free could hang indefinitely on large directories
S3 data corruption on concurrent read+write
Reading and writing the same S3 object simultaneously could produce corrupted data
Worker client leak
Under certain error conditions, worker clients were not returned to the connection pool
NPE in UfsConnectivityMonitor
Worker could crash on UFS connectivity check if UFS was transiently unavailable
S3 list with non-existent prefix in write cache
Returned 500 instead of empty list
S3 rename fails with file not found
Rename on recently-written files could fail intermittently
OOM on listing large directories in write cache
List operations could exhaust heap on directories with >100k objects
Async prefetch cache race condition
Race in buffer handling could cause incorrect data to be returned
Parent+child UFS mount conflict
Could mount both a parent path and a child path simultaneously, causing undefined behavior
Breaking Changes
Dynamic configuration requires etcd — The new dynamic config features require etcd. Customers running single-node or non-HA deployments without etcd cannot use dynamic config.
Configuration Properties Added
alluxio.fuse.request.hard.timeout
-1 (disabled)
Kill hung FUSE requests after this duration
alluxio.user.replica.prioritize.local.worker
false
Route reads to the colocated worker first
alluxio.worker.s3.authenticator.classname
PassAllAuthenticator
Set TokenAuthenticator or AlluxioIamAuthenticator for auth
alluxio.worker.s3.authorization.enabled
false
Enable per-user S3 authorization
alluxio.worker.s3.authorization.cache.enabled
true
Cache authorization decisions
alluxio.worker.s3.http.port
—
Independent HTTP port for S3 (separate from HTTPS)
alluxio.fuse.parallel.dirops.enabled
—
Enable parallel FUSE directory operations
Last updated