Loading Data into the Cache

Alluxio provides two primary methods for loading data into its cache: passive caching on the first read and active preloading. Understanding these methods helps you optimize data access patterns and ensure high performance for your applications.

Passive Caching: Caching on First Read

Passive caching is the default and most common way data enters the Alluxio cache. The process is simple and automatic:

An application requests to read a file through Alluxio.
Alluxio checks if the data for that file is already in its cache.
If the data is not cached (a "cache miss"), Alluxio retrieves the data from the underlying file system (UFS).
As the data is streamed from the UFS to the application, Alluxio simultaneously writes it into the cache on the designated worker node.

Subsequent reads of the same file will be served directly from the Alluxio cache at memory speed, avoiding the need to access the slower UFS. This "cache-on-read" behavior requires no special configuration and works out of the box.

Active Caching: Preloading Data

In some scenarios, you may want to load data into the cache before an application needs it. This process, known as preloading or cache warming, is ideal for performance-critical workloads where the initial cache miss latency is unacceptable. For example, you can preload a large dataset before starting a machine learning training job to ensure the training process runs at maximum speed from the beginning.

Alluxio's distributed load feature allows you to efficiently load data from a UFS into the Alluxio cluster. The load operation is distributed across all worker nodes to maximize parallelism and speed. For scenarios with high concurrency, the distributed load can also leverage file segmentation and replication to further optimize data distribution and availability across the cluster.

Using the `job load` Command

The most common way to trigger a distributed load is through the job load command-line interface (CLI). The CLI sends a request to the Alluxio coordinator, which then orchestrates the load operation across the workers.

bin/alluxio job load [flags] <path>

Example:

To load the contents of the /data directory from the UFS into Alluxio:

$ bin/alluxio job load /data
Progress for loading path '/data':
        Settings:       bandwidth: unlimited    verify: false
        Job State: SUCCEEDED
        Files Processed: 1000
        Bytes Loaded: 125.00MB
        Throughput: 2509.80KB/s
        Block load failure rate: 0.00%
        Files Failed: 0

For a complete list of flags and options, refer to the job load CLI documentation.

Using the REST API

You can also initiate a distributed load programmatically via the REST API. This is useful for integrating cache preloading into automated workflows and data pipelines.

Please refer to the API reference page for more details on using the REST endpoint for distributed load.

Note: The list of historical load tasks is retained for a configurable period. By default, only tasks from the last seven days are shown. This retention time can be adjusted using the alluxio.job.retention.time property.

Last updated 3 months ago

Passive Caching: Caching on First Read

Active Caching: Preloading Data

Using the job load Command

Using the REST API

Using the `job load` Command