# Release Notes

## Alluxio Enterprise AI 3.8

### New Features

#### High-Performance Write Caching for S3 API

{% hint style="warning" %}
Experimental since AI 3.8
{% endhint %}

Alluxio now buffers S3 Put operations in a write cache before flushing to UFS, reducing write latency from \~50 ms to under 10 ms and delivering up to 6 GB/s per worker with linear scalability. The write cache supports full Multipart Upload (MPU) operations, drains safely before worker decommission, and persist-on-complete frees cache storage automatically once data reaches UFS. Write cache path policies are managed via the dynamic configuration system and take effect without restarts.

See [S3 Write Cache](/ee-ai-en/performance/s3-write-cache.md) for configuration details.

#### Optimized .safetensors Model Loading

Alluxio now merges thousands of small random reads into large sequential reads when loading `.safetensors` models, achieving within \~10% of NVMe local disk speed and up to 18× faster than AWS FSx Lustre. Symlinks to `.safetensors` files receive the same optimization as direct file access. Supported through the FUSE interface only.

See [Optimizing AI Model Loading](/ee-ai-en/performance/model-loading.md) for configuration details.

#### Job Service High Availability

Multiple coordinators can now share a single etcd-backed job queue, eliminating the Job Service as a single point of failure. Job state survives coordinator restarts — pre-restart jobs are automatically restored to the waiting queue, so manual resubmission is no longer needed. Any coordinator can pick up work from a failed peer within seconds.

See [Job Service](/ee-ai-en/administration/managing-job-service.md) for configuration details.

***

### Enhancements

#### FUSE Reliability and Performance

**Hard timeout for hung requests** — `alluxio.fuse.request.hard.timeout` (default: disabled) forcibly terminates a stalled FUSE request after the configured duration and returns a clean error to the caller. Previously, a hung read from a slow or unreachable worker would block the FUSE mount indefinitely, causing training jobs to hang through NCCL or framework-level timeouts rather than receiving a clean failure.

**Local worker preference for colocated deployments** — `alluxio.user.replica.prioritize.local.worker` (default: false) reorders the candidate worker list to put the colocated worker first with zero additional RPCs. When a FUSE client and an Alluxio worker run on the same Kubernetes node, all reads that can be served locally are sent to that node-local worker, eliminating cross-node traffic for cache hits. This property is not available in AI-3.7.

**Parallel directory operations** — `alluxio.fuse.parallel.dirops.enabled` allows directory listing and metadata operations to execute in parallel through FUSE, improving throughput for workloads that scan large model repositories.

**Non-disruptive FUSE hot migration** — improved logging, cleaner shutdown sequencing, and a fix that prevents write streams from failing mid-migration during live FUSE mount migration to a new cluster.

#### Security

**Alluxio STS — AssumeRoleWithWebIdentity** — Implements the full AWS STS `AssumeRoleWithWebIdentity` flow. An OIDC identity token (from Kubernetes service accounts, Okta, Keycloak, or any OIDC IdP) can be exchanged with the Alluxio STS endpoint for a short-lived session credential. S3 clients (Spark, Presto, or any SigV4-compatible client) can then authenticate to the Worker S3 API with per-user identity and no custom client JAR required.

**TokenAuthenticator — Direct OIDC for Worker S3 API** — An alternative to the STS flow: clients set an OIDC JWT directly as the S3 session token (`X-Amz-Security-Token`). Alluxio validates the JWT directly, bypassing SigV4 signature verification. No STS infrastructure required.

**Independent HTTPS port for S3 Gateway** — The S3 Gateway can listen on a dedicated HTTPS port separate from the plaintext HTTP port. Both ports can be active simultaneously, enabling gradual HTTP-to-HTTPS migration. TLS certificates can be rotated at runtime without restarting the gateway.

**PEM certificate support** — Multi-level CA PEM certificates are now supported for TLS on all Alluxio server and client connections, including etcd TLS. Previously only JKS/PKCS12 keystores were fully supported.

**Authorization caching** — Authorization decisions are cached with configurable TTL, reducing latency on hot paths where the same user repeatedly accesses the same paths.

#### Dynamic Configuration

AI-3.7 required cluster restarts to change most configuration properties. AI-3.8 introduces a dynamic configuration system backed by etcd. The following can be changed at runtime and take effect across all workers and clients without restarts:

| What can change dynamically     | Notes                          |
| ------------------------------- | ------------------------------ |
| Cache quota and TTL             | Per-path quotas and expiry     |
| Priority eviction policy        | Which data gets evicted first  |
| Path mapping / FUSE path config | Virtual path remapping         |
| Replica rules                   | Which files get extra replicas |
| Job metadata store              | All load/free/pin job state    |
| Access and audit log config     | Log verbosity without restart  |
| Write cache filter policies     | Which S3 paths use write cache |

**CLI:** `alluxio config dynamic set/get/edit` **REST API:** `GET/PUT /api/v1/config/dynamic`

Requires etcd. Not available in single-node or non-HA deployments without etcd.

***

### Bug Fixes

| Bug                                             | Impact in AI-3.7                                                                           |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------ |
| Coordinator blocked under high concurrency      | Coordinator could become unresponsive with many concurrent client requests                 |
| `alluxio fs free` hang                          | `free` could hang indefinitely on large directories                                        |
| S3 data corruption on concurrent read+write     | Reading and writing the same S3 object simultaneously could produce corrupted data         |
| Worker client leak                              | Under certain error conditions, worker clients were not returned to the connection pool    |
| NPE in UfsConnectivityMonitor                   | Worker could crash on UFS connectivity check if UFS was transiently unavailable            |
| S3 list with non-existent prefix in write cache | Returned 500 instead of empty list                                                         |
| S3 rename fails with file not found             | Rename on recently-written files could fail intermittently                                 |
| OOM on listing large directories in write cache | List operations could exhaust heap on directories with >100k objects                       |
| Async prefetch cache race condition             | Race in buffer handling could cause incorrect data to be returned                          |
| Parent+child UFS mount conflict                 | Could mount both a parent path and a child path simultaneously, causing undefined behavior |

***

### Breaking Changes

**Dynamic configuration requires etcd** — The new dynamic config features require etcd. Customers running single-node or non-HA deployments without etcd cannot use dynamic config.

***

### Configuration Properties Added

| Property                                        | Default                | Description                                                    |
| ----------------------------------------------- | ---------------------- | -------------------------------------------------------------- |
| `alluxio.fuse.request.hard.timeout`             | -1 (disabled)          | Kill hung FUSE requests after this duration                    |
| `alluxio.user.replica.prioritize.local.worker`  | false                  | Route reads to the colocated worker first                      |
| `alluxio.worker.s3.authenticator.classname`     | `PassAllAuthenticator` | Set `TokenAuthenticator` or `AlluxioIamAuthenticator` for auth |
| `alluxio.worker.s3.authorization.enabled`       | false                  | Enable per-user S3 authorization                               |
| `alluxio.worker.s3.authorization.cache.enabled` | true                   | Cache authorization decisions                                  |
| `alluxio.worker.s3.http.port`                   | —                      | Independent HTTP port for S3 (separate from HTTPS)             |
| `alluxio.fuse.parallel.dirops.enabled`          | —                      | Enable parallel FUSE directory operations                      |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/release-notes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
