# Cache Loading

Alluxio populates its cache in two ways: **passively** on first read (automatic, no setup) and **actively** via the `job load` command (explicit preload before your job runs).

## Prerequisites

* A running Alluxio cluster with at least one worker
* At least one UFS mount configured (`alluxio mount list` to verify)

{% hint style="info" %}
Alluxio will automatically evict cached data to make room for new data according to the configured eviction policy. You do not need to pre-clear space before submitting a load job.
{% endhint %}

## Passive Caching

On every cache miss, Alluxio fetches the file from UFS and writes it into the worker cache while streaming it to the application. No configuration needed — subsequent reads are served from cache.

This is the default behavior. Use active preloading when you cannot afford the first-read latency.

## Active Preloading with `job load`

`job load` submits a distributed load job: the coordinator distributes work across all workers, each pulling its assigned files from UFS directly. For scheduling internals, HA, and advanced tuning, see [Job Service](/ee-ai-en/ai-3.8-15.1.x/administration/managing-job-service.md).

### Submit and Monitor

`--path` accepts either a UFS path (e.g. `s3://my-bucket/dataset/`) or an Alluxio virtual path (e.g. `/mnt/dataset/`). See the [CLI reference](/ee-ai-en/ai-3.8-15.1.x/reference/user-cli.md#job-load) for details.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Submit (returns immediately)
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --submit

# Monitor progress
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Submit (returns immediately)
bin/alluxio job load --path <ufs-or-alluxio-path> --submit

# Monitor progress
bin/alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}
{% endtabs %}

Example progress output:

```console
Progress for loading path 's3://my-bucket/dataset/':
        Settings:       replicas: ALL  batch-size: 200  verify: false  metadata-only: false  quota-check: false
        Time start: 2026-04-15T22:05:01  Time finished: 2026-04-15T22:05:08  Time Elapsed: 7s
        Job State: SUCCEEDED
        Inodes Scanned: 1000  Non Empty File Copies Loaded: 1000
        Bytes Scanned: 125.00MiB  Bytes Loaded: 125.00MiB  Throughput: 17.86MiB/s
        File Failure rate: 0.00%  Subtask Failure rate: 0.00%
        Files Failed: 0  Subtask Retry rate: 0.00%  Subtasks on Retry Dead Letter Queue: 0
```

### Stop a Running Job

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --stop
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <ufs-or-alluxio-path> --stop
```

{% endtab %}
{% endtabs %}

A stopped job can be resumed by submitting it again with `--submit`. Already-cached files will be skipped if `--skip-if-exists` is included.

### Key Flags

| Flag                       | Description                                                                                                                                                                                                  |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `--submit`                 | Submit the job asynchronously (returns immediately)                                                                                                                                                          |
| `--progress`               | Show progress of a submitted job                                                                                                                                                                             |
| `--stop`                   | Stop a running job                                                                                                                                                                                           |
| `--verify`                 | After load completes, re-check every file and reload any that are not fully cached.                                                                                                                          |
| `--replicas <n>`           | Load `n` replicas per file (default: 1); useful for high-concurrency reads                                                                                                                                   |
| `--skip-if-exists`         | Skip files that are already fully cached (safe to re-run a load job)                                                                                                                                         |
| `--load-policy IF_CHANGED` | Re-check each cached file against UFS metadata; reload only files whose content has changed. Use for incremental sync of mutable datasets.                                                                   |
| `--metadata-only`          | Load file metadata without caching file data                                                                                                                                                                 |
| `--batch-size <n>`         | Number of files per batch per worker. Default: 200. For small files (< 1 MB), increase to 2000–5000 for better throughput. For large files (> 100 MB), keep at 200 or lower to avoid worker memory pressure. |
| `--partial-listing`        | Start loading before the full directory listing completes; useful for very large directories                                                                                                                 |
| `--index-file <ufs-path>`  | Load a specific list of files defined in a UFS index file (one path per line)                                                                                                                                |

For the full flag reference, see [`job load` CLI documentation](/ee-ai-en/ai-3.8-15.1.x/reference/user-cli.md#job-load).

### Loading from an Index File

For selective loading or when the directory tree is too large to traverse upfront:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --index-file s3://my-bucket/load-manifest.txt --submit
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --index-file s3://my-bucket/load-manifest.txt --submit
```

{% endtab %}
{% endtabs %}

Index file format — one UFS path per line, lines starting with `#` are comments:

```
s3://my-bucket/dataset/train/
s3://my-bucket/dataset/val/file.parquet
# s3://my-bucket/dataset/test/   <- skipped
```

Directories must end with `/` to be loaded recursively.

### Incremental Load for Mutable Data

When the underlying dataset changes periodically (e.g., daily model checkpoints, updated training splits), use `--load-policy IF_CHANGED` to sync only the files that have changed since the last load:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --submit --load-policy IF_CHANGED
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <ufs-or-alluxio-path> --submit --load-policy IF_CHANGED
```

{% endtab %}
{% endtabs %}

`--load-policy IF_CHANGED` re-checks UFS metadata for each **already-cached** file and reloads it only if the content has changed. Files that are not yet cached are loaded unconditionally. This makes it the right choice for periodic sync of a mutable dataset: new files get cached, changed files get refreshed, and unchanged files are skipped.

| Flag                       | Cached files           | Uncached files |
| -------------------------- | ---------------------- | -------------- |
| `--submit` (no flags)      | Reload unconditionally | Load           |
| `--skip-if-exists`         | Skip                   | Load           |
| `--load-policy IF_CHANGED` | Reload only if changed | Load           |

## Integrating with ML Training

A typical workflow: load data → verify → run training.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# 1. Submit load
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path s3://my-bucket/dataset/ --submit --verify

# 2. Poll until SUCCEEDED
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path s3://my-bucket/dataset/ --progress
# Repeat until "Job State: SUCCEEDED", then launch training pods
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# 1. Submit load
bin/alluxio job load --path s3://my-bucket/dataset/ --submit --verify

# 2. Wait until SUCCEEDED
bin/alluxio job load --path s3://my-bucket/dataset/ --progress
# Repeat until "Job State: SUCCEEDED"

# 3. Start training
python train.py --data /mnt/alluxio/fuse/dataset/
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
**Near-100% cache coverage:** For critical datasets, run a second pass with `--skip-if-exists` after the first job reaches `SUCCEEDED`. In rare cases — transient worker failures or hash ring boundary timing — a single pass may miss a small fraction of files. A second pass fills those gaps without re-loading already-cached data:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --submit --skip-if-exists
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <ufs-or-alluxio-path> --submit --skip-if-exists
```

{% endtab %}
{% endtabs %}
{% endhint %}

## Failure Modes

**`Job State: FAILED` with `Files Failed > 0`**

Check the file-level failure list:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --progress --file-status FAILURE
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <path> --progress --file-status FAILURE
```

{% endtab %}
{% endtabs %}

Common causes: UFS access errors, network timeouts, or missing credentials. Fix the underlying issue, then resubmit with `--skip-if-exists` to avoid re-loading already-cached files.

**`Job State: FAILED` immediately after submit**

Run `--progress --verbose` for detail:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --progress --verbose
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <path> --progress --verbose
```

{% endtab %}
{% endtabs %}

Often caused by: path not found in mount table (verify with `alluxio mount list`), or insufficient cache quota.

**Load succeeds but reads still go to UFS**

Verify that specific files are actually cached:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio fs check-cached <path>
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio fs check-cached <path>
```

{% endtab %}
{% endtabs %}

If files show as uncached after a successful load, data may have been evicted. Check cache capacity and eviction settings — see [Cache Eviction](/ee-ai-en/ai-3.8-15.1.x/cache/removing-data-from-the-cache.md). For cluster-wide cache hit rate, see [Monitoring](/ee-ai-en/ai-3.8-15.1.x/administration/monitoring-alluxio.md).

## Retention of Historical Jobs

Completed job records are kept for a configurable period. The default is 7 days. To adjust:

```properties
# retain completed job records for 3 days (default: 7d)
alluxio.job.retention.time=3d
```

## Related

* [Cache Eviction](/ee-ai-en/ai-3.8-15.1.x/cache/removing-data-from-the-cache.md) — manual `job free`, version update patterns, and automatic eviction policies
* [Job Service](/ee-ai-en/ai-3.8-15.1.x/administration/managing-job-service.md) — `job list`, job states, coordinator HA, failure recovery, and configuration tuning
* [Multiple Replicas](/ee-ai-en/ai-3.8-15.1.x/high-availability/multiple-replicas.md) — load multiple copies per file for fault tolerance
* [`job load` CLI Reference](/ee-ai-en/ai-3.8-15.1.x/reference/user-cli.md#job-load)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/cache/loading-data-into-the-cache.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
