# Cache Loading

Alluxio populates its cache in two ways: **passively** on first read (automatic, no setup) and **actively** via the `job load` command (explicit preload before your job runs).

## Prerequisites

* A running Alluxio cluster with at least one worker
* At least one UFS mount configured (`alluxio mount list` to verify)

{% hint style="info" %}
Alluxio will automatically evict cached data to make room for new data according to the configured eviction policy. You do not need to pre-clear space before submitting a load job.
{% endhint %}

## Passive Caching

On every cache miss, Alluxio fetches the file from UFS and writes it into the worker cache while streaming it to the application. No configuration needed — subsequent reads are served from cache.

This is the default behavior. Use active preloading when you cannot afford the first-read latency.

## Active Preloading with `job load`

`job load` submits a distributed load job: the coordinator distributes work across all workers, each pulling its assigned files from UFS directly. For scheduling internals, HA, and advanced tuning, see [Job Service](/ee-ai-en/administration/managing-job-service.md).

### Submit and Monitor

`--path` accepts either a UFS path (e.g. `s3://my-bucket/dataset/`) or an Alluxio virtual path (e.g. `/mnt/dataset/`). See the [CLI reference](/ee-ai-en/reference/user-cli.md#job-load) for details.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Submit (returns immediately)
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --submit

# Monitor progress
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Submit (returns immediately)
bin/alluxio job load --path <ufs-or-alluxio-path> --submit

# Monitor progress
bin/alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}
{% endtabs %}

Example progress output:

```console
Progress for loading path 's3://my-bucket/dataset/':
        Settings:       replicas: ALL  batch-size: 200  verify: false  metadata-only: false  quota-check: false
        Time start: 2026-04-15T22:05:01  Time finished: 2026-04-15T22:05:08  Time Elapsed: 7s
        Job State: SUCCEEDED
        Inodes Scanned: 1000  Non Empty File Copies Loaded: 1000
        Bytes Scanned: 125.00MiB  Bytes Loaded: 125.00MiB  Throughput: 17.86MiB/s
        File Failure rate: 0.00%  Subtask Failure rate: 0.00%
        Files Failed: 0  Subtask Retry rate: 0.00%  Subtasks on Retry Dead Letter Queue: 0
```

### Stop a Running Job

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --stop
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <ufs-or-alluxio-path> --stop
```

{% endtab %}
{% endtabs %}

A stopped job can be resumed by submitting it again with `--submit`. Already-cached files will be skipped if `--skip-if-exists` is included.

### Key Flags

| Flag                      | Description                                                                                  |
| ------------------------- | -------------------------------------------------------------------------------------------- |
| `--submit`                | Submit the job asynchronously (returns immediately)                                          |
| `--progress`              | Show progress of a submitted job                                                             |
| `--stop`                  | Stop a running job                                                                           |
| `--verify`                | After load completes, re-check every file and reload any that are not fully cached.          |
| `--replicas <n>`          | Load `n` replicas per file (default: 1); useful for high-concurrency reads                   |
| `--skip-if-exists`        | Skip files that are already fully cached (safe to re-run a load job)                         |
| `--metadata-only`         | Load file metadata without caching file data                                                 |
| `--batch-size <n>`        | Number of files per batch per worker; tune for large directories                             |
| `--partial-listing`       | Start loading before the full directory listing completes; useful for very large directories |
| `--index-file <ufs-path>` | Load a specific list of files defined in a UFS index file (one path per line)                |

For the full flag reference, see [`job load` CLI documentation](/ee-ai-en/reference/user-cli.md#job-load).

### Loading from an Index File

For selective loading or when the directory tree is too large to traverse upfront:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --index-file s3://my-bucket/load-manifest.txt --submit
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --index-file s3://my-bucket/load-manifest.txt --submit
```

{% endtab %}
{% endtabs %}

Index file format — one UFS path per line, lines starting with `#` are comments:

```
s3://my-bucket/dataset/train/
s3://my-bucket/dataset/val/file.parquet
# s3://my-bucket/dataset/test/   <- skipped
```

Directories must end with `/` to be loaded recursively.

## Integrating with ML Training

A typical workflow: load data → verify → run training.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# 1. Submit load
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path s3://my-bucket/dataset/ --submit --verify

# 2. Poll until SUCCEEDED
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path s3://my-bucket/dataset/ --progress
# Repeat until "Job State: SUCCEEDED", then launch training pods
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# 1. Submit load
bin/alluxio job load --path s3://my-bucket/dataset/ --submit --verify

# 2. Wait until SUCCEEDED
bin/alluxio job load --path s3://my-bucket/dataset/ --progress
# Repeat until "Job State: SUCCEEDED"

# 3. Start training
python train.py --data /mnt/alluxio/fuse/dataset/
```

{% endtab %}
{% endtabs %}

## Failure Modes

**`Job State: FAILED` with `Files Failed > 0`**

Check the file-level failure list:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --progress --file-status FAILURE
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <path> --progress --file-status FAILURE
```

{% endtab %}
{% endtabs %}

Common causes: UFS access errors, network timeouts, or missing credentials. Fix the underlying issue, then resubmit with `--skip-if-exists` to avoid re-loading already-cached files.

**`Job State: FAILED` immediately after submit**

Run `--progress --verbose` for detail:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --progress --verbose
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <path> --progress --verbose
```

{% endtab %}
{% endtabs %}

Often caused by: path not found in mount table (verify with `alluxio mount list`), or insufficient cache quota.

**Load succeeds but reads still go to UFS**

Verify that specific files are actually cached:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio fs check-cached <path>
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio fs check-cached <path>
```

{% endtab %}
{% endtabs %}

If files show as uncached after a successful load, data may have been evicted. Check cache capacity and eviction settings — see [Cache Eviction](/ee-ai-en/cache/removing-data-from-the-cache.md). For cluster-wide cache hit rate, see [Monitoring](/ee-ai-en/administration/monitoring-alluxio.md).

## Retention of Historical Jobs

Completed job records are kept for a configurable period. The default is 7 days. To adjust:

```properties
# retain completed job records for 3 days (default: 7d)
alluxio.job.retention.time=3d
```

## Related

* [Cache Eviction](/ee-ai-en/cache/removing-data-from-the-cache.md) — manual `job free`, version update patterns, and automatic eviction policies
* [Job Service](/ee-ai-en/administration/managing-job-service.md) — `job list`, job states, coordinator HA, failure recovery, and configuration tuning
* [Multiple Replicas](/ee-ai-en/high-availability/multiple-replicas.md) — load multiple copies per file for fault tolerance
* [`job load` CLI Reference](/ee-ai-en/reference/user-cli.md#job-load)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/cache/loading-data-into-the-cache.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
