# Cache Loading

Alluxio populates its cache in two ways: **passively** on first read (automatic, no setup) and **actively** via the `job load` command (explicit preload before your job runs).

## Prerequisites

* A running Alluxio cluster with at least one worker
* At least one UFS mount configured (`alluxio mount list` to verify)

{% hint style="info" %}
Alluxio will automatically evict cached data to make room for new data according to the configured eviction policy. You do not need to pre-clear space before submitting a load job.
{% endhint %}

## Passive Caching

On every cache miss, Alluxio fetches the file from UFS and writes it into the worker cache while streaming it to the application. No configuration needed — subsequent reads are served from cache.

This is the default behavior. Use active preloading when you cannot afford the first-read latency.

## Active Preloading with `job load`

`job load` submits a distributed load job: the coordinator distributes work across all workers, each pulling its assigned files from UFS directly. For scheduling internals, HA, and advanced tuning, see [Job Service](https://documentation.alluxio.io/ee-ai-en/administration/managing-job-service).

### Submit and Monitor

`--path` accepts either a UFS path (e.g. `s3://my-bucket/dataset/`) or an Alluxio virtual path (e.g. `/mnt/dataset/`). See the [CLI reference](https://documentation.alluxio.io/ee-ai-en/reference/user-cli#job-load) for details.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Submit (returns immediately)
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --submit

# Monitor progress
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Submit (returns immediately)
bin/alluxio job load --path <ufs-or-alluxio-path> --submit

# Monitor progress
bin/alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}
{% endtabs %}

Example progress output:

```console
Progress for loading path 's3://my-bucket/dataset/':
        Settings:       replicas: ALL  batch-size: 200  verify: false  metadata-only: false  quota-check: false
        Time start: 2026-04-15T22:05:01  Time finished: 2026-04-15T22:05:08  Time Elapsed: 7s
        Job State: SUCCEEDED
        Inodes Scanned: 1000  Non Empty File Copies Loaded: 1000
        Bytes Scanned: 125.00MiB  Bytes Loaded: 125.00MiB  Throughput: 17.86MiB/s
        File Failure rate: 0.00%  Subtask Failure rate: 0.00%
        Files Failed: 0  Subtask Retry rate: 0.00%  Subtasks on Retry Dead Letter Queue: 0
```

### Stop a Running Job

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --stop
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <ufs-or-alluxio-path> --stop
```

{% endtab %}
{% endtabs %}

A stopped job can be resumed by submitting it again with `--submit`. Already-cached files will be skipped if `--skip-if-exists` is included.

### Key Flags

| Flag                      | Description                                                                                  |
| ------------------------- | -------------------------------------------------------------------------------------------- |
| `--submit`                | Submit the job asynchronously (returns immediately)                                          |
| `--progress`              | Show progress of a submitted job                                                             |
| `--stop`                  | Stop a running job                                                                           |
| `--verify`                | After load completes, re-check every file and reload any that are not fully cached.          |
| `--replicas <n>`          | Load `n` replicas per file (default: 1); useful for high-concurrency reads                   |
| `--skip-if-exists`        | Skip files that are already fully cached (safe to re-run a load job)                         |
| `--metadata-only`         | Load file metadata without caching file data                                                 |
| `--batch-size <n>`        | Number of files per batch per worker; tune for large directories                             |
| `--partial-listing`       | Start loading before the full directory listing completes; useful for very large directories |
| `--index-file <ufs-path>` | Load a specific list of files defined in a UFS index file (one path per line)                |

For the full flag reference, see [`job load` CLI documentation](https://documentation.alluxio.io/ee-ai-en/reference/user-cli#job-load).

### Loading from an Index File

For selective loading or when the directory tree is too large to traverse upfront:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --index-file s3://my-bucket/load-manifest.txt --submit
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --index-file s3://my-bucket/load-manifest.txt --submit
```

{% endtab %}
{% endtabs %}

Index file format — one UFS path per line, lines starting with `#` are comments:

```
s3://my-bucket/dataset/train/
s3://my-bucket/dataset/val/file.parquet
# s3://my-bucket/dataset/test/   <- skipped
```

Directories must end with `/` to be loaded recursively.

## Integrating with ML Training

A typical workflow: load data → verify → run training.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# 1. Submit load
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path s3://my-bucket/dataset/ --submit --verify

# 2. Poll until SUCCEEDED
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path s3://my-bucket/dataset/ --progress
# Repeat until "Job State: SUCCEEDED", then launch training pods
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# 1. Submit load
bin/alluxio job load --path s3://my-bucket/dataset/ --submit --verify

# 2. Wait until SUCCEEDED
bin/alluxio job load --path s3://my-bucket/dataset/ --progress
# Repeat until "Job State: SUCCEEDED"

# 3. Start training
python train.py --data /mnt/alluxio/fuse/dataset/
```

{% endtab %}
{% endtabs %}

## Failure Modes

**`Job State: FAILED` with `Files Failed > 0`**

Check the file-level failure list:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --progress --file-status FAILURE
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <path> --progress --file-status FAILURE
```

{% endtab %}
{% endtabs %}

Common causes: UFS access errors, network timeouts, or missing credentials. Fix the underlying issue, then resubmit with `--skip-if-exists` to avoid re-loading already-cached files.

**`Job State: FAILED` immediately after submit**

Run `--progress --verbose` for detail:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --progress --verbose
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <path> --progress --verbose
```

{% endtab %}
{% endtabs %}

Often caused by: path not found in mount table (verify with `alluxio mount list`), or insufficient cache quota.

**Load succeeds but reads still go to UFS**

Verify that specific files are actually cached:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio fs check-cached <path>
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio fs check-cached <path>
```

{% endtab %}
{% endtabs %}

If files show as uncached after a successful load, data may have been evicted. Check cache capacity and eviction settings — see [Cache Eviction](https://documentation.alluxio.io/ee-ai-en/cache/removing-data-from-the-cache). For cluster-wide cache hit rate, see [Monitoring](https://documentation.alluxio.io/ee-ai-en/administration/monitoring-alluxio).

## Retention of Historical Jobs

Completed job records are kept for a configurable period. The default is 7 days. To adjust:

```properties
# retain completed job records for 3 days (default: 7d)
alluxio.job.retention.time=3d
```

## Related

* [Cache Eviction](https://documentation.alluxio.io/ee-ai-en/cache/removing-data-from-the-cache) — manual `job free`, version update patterns, and automatic eviction policies
* [Job Service](https://documentation.alluxio.io/ee-ai-en/administration/managing-job-service) — `job list`, job states, coordinator HA, failure recovery, and configuration tuning
* [Multiple Replicas](https://documentation.alluxio.io/ee-ai-en/high-availability/multiple-replicas) — load multiple copies per file for fault tolerance
* [`job load` CLI Reference](https://documentation.alluxio.io/ee-ai-en/reference/user-cli#job-load)
