Cache Loading
Alluxio populates its cache in two ways: passively on first read (automatic, no setup) and actively via the job load command (explicit preload before your job runs).
Prerequisites
A running Alluxio cluster with at least one worker
At least one UFS mount configured (
alluxio mount listto verify)
Alluxio will automatically evict cached data to make room for new data according to the configured eviction policy. You do not need to pre-clear space before submitting a load job.
Passive Caching
On every cache miss, Alluxio fetches the file from UFS and writes it into the worker cache while streaming it to the application. No configuration needed — subsequent reads are served from cache.
This is the default behavior. Use active preloading when you cannot afford the first-read latency.
Active Preloading with job load
job loadjob load submits a distributed load job: the coordinator distributes work across all workers, each pulling its assigned files from UFS directly. For scheduling internals, HA, and advanced tuning, see Job Service.
Submit and Monitor
--path accepts either a UFS path (e.g. s3://my-bucket/dataset/) or an Alluxio virtual path (e.g. /mnt/dataset/). See the CLI reference for details.
# Submit (returns immediately)
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
alluxio job load --path <ufs-or-alluxio-path> --submit
# Monitor progress
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
alluxio job load --path <ufs-or-alluxio-path> --progress# Submit (returns immediately)
bin/alluxio job load --path <ufs-or-alluxio-path> --submit
# Monitor progress
bin/alluxio job load --path <ufs-or-alluxio-path> --progressExample progress output:
Stop a Running Job
A stopped job can be resumed by submitting it again with --submit. Already-cached files will be skipped if --skip-if-exists is included.
Key Flags
--submit
Submit the job asynchronously (returns immediately)
--progress
Show progress of a submitted job
--stop
Stop a running job
--verify
After load completes, re-check every file and reload any that are not fully cached.
--replicas <n>
Load n replicas per file (default: 1); useful for high-concurrency reads
--skip-if-exists
Skip files that are already fully cached (safe to re-run a load job)
--load-policy IF_CHANGED
Re-check each cached file against UFS metadata; reload only files whose content has changed. Use for incremental sync of mutable datasets.
--metadata-only
Load file metadata without caching file data
--batch-size <n>
Number of files per batch per worker. Default: 200. For small files (< 1 MB), increase to 2000–5000 for better throughput. For large files (> 100 MB), keep at 200 or lower to avoid worker memory pressure.
--partial-listing
Start loading before the full directory listing completes; useful for very large directories
--index-file <ufs-path>
Load a specific list of files defined in a UFS index file (one path per line)
For the full flag reference, see job load CLI documentation.
Loading from an Index File
For selective loading or when the directory tree is too large to traverse upfront:
Index file format — one UFS path per line, lines starting with # are comments:
Directories must end with / to be loaded recursively.
Incremental Load for Mutable Data
When the underlying dataset changes periodically (e.g., daily model checkpoints, updated training splits), use --load-policy IF_CHANGED to sync only the files that have changed since the last load:
--load-policy IF_CHANGED re-checks UFS metadata for each already-cached file and reloads it only if the content has changed. Files that are not yet cached are loaded unconditionally. This makes it the right choice for periodic sync of a mutable dataset: new files get cached, changed files get refreshed, and unchanged files are skipped.
--submit (no flags)
Reload unconditionally
Load
--skip-if-exists
Skip
Load
--load-policy IF_CHANGED
Reload only if changed
Load
Integrating with ML Training
A typical workflow: load data → verify → run training.
Near-100% cache coverage: For critical datasets, run a second pass with --skip-if-exists after the first job reaches SUCCEEDED. In rare cases — transient worker failures or hash ring boundary timing — a single pass may miss a small fraction of files. A second pass fills those gaps without re-loading already-cached data:
Failure Modes
Job State: FAILED with Files Failed > 0
Check the file-level failure list:
Common causes: UFS access errors, network timeouts, or missing credentials. Fix the underlying issue, then resubmit with --skip-if-exists to avoid re-loading already-cached files.
Job State: FAILED immediately after submit
Run --progress --verbose for detail:
Often caused by: path not found in mount table (verify with alluxio mount list), or insufficient cache quota.
Load succeeds but reads still go to UFS
Verify that specific files are actually cached:
If files show as uncached after a successful load, data may have been evicted. Check cache capacity and eviction settings — see Cache Eviction. For cluster-wide cache hit rate, see Monitoring.
Retention of Historical Jobs
Completed job records are kept for a configurable period. The default is 7 days. To adjust:
Related
Cache Eviction — manual
job free, version update patterns, and automatic eviction policiesJob Service —
job list, job states, coordinator HA, failure recovery, and configuration tuningMultiple Replicas — load multiple copies per file for fault tolerance
Last updated