Release Notes

AI-3.7-13.0.0

Transparent Distributed S3 Cache with Sub-ms Latency

Increasingly many AI/ML workloads (e.g., PyTorch, TensorFlow) rely on Amazon S3 (or S3 compatible storage) for scalable data access. In addition to throughput bottlenecks, these workloads also encounter latency challenges as well. Alluxio Enterprise AI bridges this gap by enabling single digit millisecond latency while maintaining high throughput data access through the use of our S3 compatible interface and being deployed colocated with GPUs.

Use Cases:

  • Model training accessing datasets through S3 interface

  • Model deployment loading model files through S3 interface

  • Model inference loading features from parquet files on AWS S3

Key Benefits

  • Faster AL/ML Workloads

    • Cache data in NVMe disks on the GPU node, eliminating S3 fetch delays for repeated reads.

    • Achieve near-local NVMe throughput and latency for iterative workloads (e.g., model training).

  • Reduce Cloud Access Costs

    • Co-locating cache with GPUs slashes up to 70% fees due to egress and API calls.

Performance Results

  • Single-digit-milliseconds Latency - Alluxio offers up to 45x lower latency than standard AWS S3 and up to 5x lower latency than AWS S3 Express One Zone

  • High throughput - Alluxio offers up to 11.5GiB/s (or 98.7 Gbps) under 100Gbps network. It is 2x higher read throughput than same region AWS S3

  • Performance linearly scales with more workers in the cluster

Alluxio transforms S3 into a high-throughput, low-latency data hub for AI – eliminating I/O slowdowns while slashing cloud costs. Read more about supported S3 API operations and example code snippets on how to configure various clients.

High-performance AI Data Preprocessing Through Spark

Alluxio offers a new capability to boost AI data preprocessing through Spark Streaming and ETL pipelines. By leveraging native Spark integration and its distributed caching architecture, Alluxio accelerates data processing workflows for AI/ML applications.

Use Cases:

  • In an AI pipeline, accelerate dataset preprocessing using Spark

Key Benefits:

  • Faster AI Workloads : Reduces data loading/transformation time.

  • Simplified Scalability : Handles petabyte-scale datasets without pipeline redesigns.

  • Seamless Integration : Works with existing Spark code and storage systems (HDFS, S3, etc.).

Performance Results: In the TPC-DS SF100 (100GB) benchmark, Alluxio improves query performance by up to 3x compared to direct access to AWS S3. On average, it achieves a 32% speedup across 135 queries.

This enhancement is particularly valuable for ML engineers and data teams managing feature engineering at scale while maintaining compatibility with standard Spark ecosystems. Read more about how to leverage a combination of Alluxio features for this use case.

5x Faster Cache Preloading with Partitioned and Parallel Processing for Large Files

Alluxio supports the ability to preload data from the underlying storage (UFS) to its cache. An enhancement to this functionality introduces a partitioned and parallel data loading mechanism, providing 5x faster performance when dealing with large files (typically >1GB). The new mechanism ensures faster, more efficient data transfers to Alluxio's cache.

Use Cases:

  • Model Training: Requires fast access to preloaded datasets to accelerate the training process and reduce data loading delays.

  • Model Deployment: Demands shorter cold start times by quickly loading large model files, ensuring faster inference and responsiveness.

Key Enhancements:

  • Partitioned Data Loading:

    • Large files are split into smaller, manageable chunks (partitions) for faster loading.

    • Partitioning ensures that each chunk can be handled independently, leading to better scalability and resource utilization.

  • Parallel Data Loading:

    • Each partition is loaded in parallel, drastically reducing the time required to load the entire file.

    • This parallelism maximizes available bandwidth and computational resources, leading to a performance boost.

  • Resource Efficiency:

    • The partitioned approach distributes the load evenly across available compute resources, ensuring balanced utilization of system resources.

    • This results in reduced bottlenecks and increased throughput.

Role-based Access Control (RBAC) S3 Access

Alluxio's new Role-based Access Control (RBAC) S3 access feature enhances data security and control. This functionality allows administrators to define granular access permissions (read/write) or integrate the existing authentication and authorization services for S3 data through Alluxio's unified namespace.

  • Authentication: Supports OIDC/OAuth 2.0-based authentication, such as Okta, Cognito, and Microsoft AD

  • Authorization: Supports Ranger

The feature bridges compliance gaps by extending enterprise-grade authentication and authorization to S3 data while maintaining Alluxio’s caching and acceleration benefits. Read more about the Authentication and Authorization capabilities.

FUSE Non-Disruptive Upgrade

Traditional FUSE updates present significant operational challenges for production environments. When updating Linux FUSE services, administrators must restart the service, which forcibly terminates all active connections and mounted filesystems. This mandatory downtime disrupts running applications and business workflows, particularly problematic for data-intensive operations that rely on continuous access to FUSE-mounted access.

Alluxio's new FUSE non-disruptive upgrade feature fundamentally changes this paradigm. The technology enables in-place upgrades of the FUSE service while maintaining all existing connections and mount points. Applications continue operating normally throughout the update process. This advancement is particularly valuable for enterprises running 24/7 data pipelines or customer-facing applications that cannot tolerate downtime.

As the known limitations in this release, read operations (read, stat) will be maintained (hung and resumed within tens of seconds) during FUSE upgrades. Write (write, mv, delete) and list (readdir) operations will still fail.

See more information about this capability here.

Cluster Management Console Enhancements

Support Deploying Cluster Through Management Console

After installing the Alluxio K8s Operator, the cluster deployment Web UI can guide users through the remaining steps of the setup process to deploy a cluster. The WebUI provides a user-friendly alternative to manual configuration, allowing administrators to visually manage cluster parameters, resource allocation, and deployment workflows. This feature significantly reduces deployment complexity while maintaining the flexibility of Alluxio's distributed architecture.

Read more about how to enable the cluster deployment UI with a walkthrough of the process.

Enhanced Job Management

In this release, Alluxio Management Console enhances job management functionality:

  • Add a meaningful name for a job

  • Support pagination on listing job history

Audit Log

Alluxio has introduced a new audit logging feature to enhance security and compliance monitoring. This functionality systematically records detailed access events, including user identities, operations performed (e.g., read/write), and timestamps. The logs enable administrators to analyze data access patterns, detect anomalies, and meet regulatory requirements. Read more about the audit log feature.

Last updated