Cache Loading

Alluxio populates its cache in two ways: passively on first read (automatic, no setup) and actively via the job load command (explicit preload before your job runs).

Prerequisites

  • A running Alluxio cluster with at least one worker

  • At least one UFS mount configured (alluxio mount list to verify)

Alluxio will automatically evict cached data to make room for new data according to the configured eviction policy. You do not need to pre-clear space before submitting a load job.

Passive Caching

On every cache miss, Alluxio fetches the file from UFS and writes it into the worker cache while streaming it to the application. No configuration needed — subsequent reads are served from cache.

This is the default behavior. Use active preloading when you cannot afford the first-read latency.

Active Preloading with job load

job load submits a distributed load job: the coordinator distributes work across all workers, each pulling its assigned files from UFS directly. For scheduling internals, HA, and advanced tuning, see Job Service.

Submit and Monitor

--path accepts either a UFS path (e.g. s3://my-bucket/dataset/) or an Alluxio virtual path (e.g. /mnt/dataset/). See the CLI reference for details.

# Submit (returns immediately)
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --submit

# Monitor progress
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --progress

Example progress output:

Stop a Running Job

A stopped job can be resumed by submitting it again with --submit. Already-cached files will be skipped if --skip-if-exists is included.

Key Flags

Flag
Description

--submit

Submit the job asynchronously (returns immediately)

--progress

Show progress of a submitted job

--stop

Stop a running job

--verify

After load completes, re-check every file and reload any that are not fully cached.

--replicas <n>

Load n replicas per file (default: 1); useful for high-concurrency reads

--skip-if-exists

Skip files that are already fully cached (safe to re-run a load job)

--load-policy IF_CHANGED

Re-check each cached file against UFS metadata; reload only files whose content has changed. Use for incremental sync of mutable datasets.

--metadata-only

Load file metadata without caching file data

--batch-size <n>

Number of files per batch per worker. Default: 200. For small files (< 1 MB), increase to 2000–5000 for better throughput. For large files (> 100 MB), keep at 200 or lower to avoid worker memory pressure.

--partial-listing

Start loading before the full directory listing completes; useful for very large directories

--index-file <ufs-path>

Load a specific list of files defined in a UFS index file (one path per line)

For the full flag reference, see job load CLI documentation.

Loading from an Index File

For selective loading or when the directory tree is too large to traverse upfront:

Index file format — one UFS path per line, lines starting with # are comments:

Directories must end with / to be loaded recursively.

Incremental Load for Mutable Data

When the underlying dataset changes periodically (e.g., daily model checkpoints, updated training splits), use --load-policy IF_CHANGED to sync only the files that have changed since the last load:

--load-policy IF_CHANGED re-checks UFS metadata for each already-cached file and reloads it only if the content has changed. Files that are not yet cached are loaded unconditionally. This makes it the right choice for periodic sync of a mutable dataset: new files get cached, changed files get refreshed, and unchanged files are skipped.

Flag
Cached files
Uncached files

--submit (no flags)

Reload unconditionally

Load

--skip-if-exists

Skip

Load

--load-policy IF_CHANGED

Reload only if changed

Load

Integrating with ML Training

A typical workflow: load data → verify → run training.

Near-100% cache coverage: For critical datasets, run a second pass with --skip-if-exists after the first job reaches SUCCEEDED. In rare cases — transient worker failures or hash ring boundary timing — a single pass may miss a small fraction of files. A second pass fills those gaps without re-loading already-cached data:

Failure Modes

Job State: FAILED with Files Failed > 0

Check the file-level failure list:

Common causes: UFS access errors, network timeouts, or missing credentials. Fix the underlying issue, then resubmit with --skip-if-exists to avoid re-loading already-cached files.

Job State: FAILED immediately after submit

Run --progress --verbose for detail:

Often caused by: path not found in mount table (verify with alluxio mount list), or insufficient cache quota.

Load succeeds but reads still go to UFS

Verify that specific files are actually cached:

If files show as uncached after a successful load, data may have been evicted. Check cache capacity and eviction settings — see Cache Eviction. For cluster-wide cache hit rate, see Monitoring.

Retention of Historical Jobs

Completed job records are kept for a configurable period. The default is 7 days. To adjust:

Last updated