# Job Service

If you've already used `job load` or `job free`, you've been interacting with the Job Service — it's the distributed scheduling engine that drives those operations. This page explains how it works internally, how to monitor jobs in flight, and how to configure it for high availability and production scale.

For operation-specific commands, see:

* [`job load`](/ee-ai-en/ai-3.8-15.1.x/cache/loading-data-into-the-cache.md) — preloading data into the cache
* [`job free`](/ee-ai-en/ai-3.8-15.1.x/cache/removing-data-from-the-cache.md) — releasing cached data

## 1. Architecture

A **job** is a named, persistent, distributed operation submitted to Alluxio — for example, loading a dataset into cache or freeing space across all workers. Every job follows the same lifecycle:

**WAITING** → **RUNNING** → **SUCCEEDED** / **FAILED** / **STOPPED**

Once submitted, a job enters the waiting queue and is written to the Job Store before any scheduling happens. Whether that state survives a coordinator loss depends on the Job Store backend — see [Section 3](#3-failure-recovery). The scheduler then claims the job, breaks it into per-worker tasks, and tracks completion. See [Section 2](#2-listing-and-monitoring-jobs) for the full state reference.

The Job Service has three components that drive this lifecycle:

* **Coordinator** — a stateless scheduler that polls the job queue, claims jobs atomically, partitions them into file-batch tasks, and dispatches those tasks to workers. Multiple coordinators can run simultaneously; any instance can handle any job.
* **Job Store** — durable storage for job state. By default this is a local RocksDB instance on the coordinator node. In HA mode, all coordinators share an etcd cluster.
* **Workers** — execute the tasks dispatched by the coordinator: reading data from UFS and writing it to local cache (`load`), or evicting cached blocks (`free`).

{% hint style="warning" %}
The Coordinator is **not** the same as the Master in open-source Alluxio. Alluxio Enterprise Edition has replaced the Master with the Coordinator — a purpose-built stateless scheduler. There is no Master component in the enterprise edition. See [Key Components](/ee-ai-en/ai-3.8-15.1.x/how-alluxio-works.md#key-components) for an overview.
{% endhint %}

## 2. Listing and Monitoring Jobs

### Job Identity

The default way to address a job is by its **(type, path)** pair — used throughout the CLI (`--path`, `--type`). Internally every job also has a UUID (visible in `job list` output), and some REST endpoints accept it directly.

### Listing Jobs

To see all jobs across the cluster:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job list --job-type ALL --job-state ALL
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job list --job-type ALL --job-state ALL
```

{% endtab %}
{% endtabs %}

Use `--job-state` to filter by state and `--job-type` to filter by type (`LOAD`, `FREE`, `COPY`, `MOVE`). In HA mode, `--coordinator <address>` lists only jobs owned by a specific coordinator. Each entry in the output shows the job's `(type, path)` — use those values to query progress or stop the job.

### Querying a Specific Job

Once you have the `(type, path)` of a job from `job list`, query or control it:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Query progress (load)
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --progress

# Stop a running job
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --stop
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Query progress (load)
bin/alluxio job load --path <path> --progress

# Stop a running job
bin/alluxio job load --path <path> --stop
```

{% endtab %}
{% endtabs %}

The same pattern applies to other job types: `alluxio job free --path <path> --progress`, etc.

### Job States

| State         | Description                                                                |
| ------------- | -------------------------------------------------------------------------- |
| **WAITING**   | Queued in the Job Store; not yet claimed by a coordinator.                 |
| **RUNNING**   | Coordinator is dispatching tasks to workers. Monitor with `--progress`.    |
| **VERIFYING** | All tasks done; coordinator is running a verification pass (`--verify`).   |
| **SUCCEEDED** | All tasks completed successfully.                                          |
| **FAILED**    | Error threshold exceeded. Inspect with `--progress --file-status FAILURE`. |
| **STOPPED**   | Manually stopped via `--stop`.                                             |

State transitions are atomic and at-most-once, so consistency is guaranteed even if multiple coordinators are active simultaneously.

## 3. Failure Recovery

Recovery behavior depends on the Job Store backend:

* **RocksDB (single coordinator, default):** Job state is written to RocksDB on the coordinator's local filesystem. If the RocksDB directory is on persistent storage (e.g., a PVC on Kubernetes — see [Production Setup](https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/administration/pages/N5GnsW3SKizbYRzLVqUj#id-1.-production-setup)), the state survives process crashes and pod restarts attached to the same volume. If the coordinator node is permanently lost or the metastore is backed by non-persistent storage, the queue and in-flight job state are lost along with it.
* **etcd (HA mode):** Job state lives in etcd, independent of any individual coordinator. A coordinator can be lost permanently and the remaining instances continue processing from the same shared queue.

### Automatic Recovery (HA Mode)

In HA mode, jobs actively managed by a crashed coordinator are recovered automatically by its peers:

* **Detection:** Active coordinators detect "stale" jobs in the **RUNNING** state that belong to an instance no longer sending heartbeats.
* **Re-queueing:** These orphaned jobs are reverted to **WAITING** so another healthy coordinator can pick them up.

In single-coordinator mode (RocksDB), there is no peer to perform detection — recovery happens when the coordinator process itself restarts and re-reads its local RocksDB.

### Worker Failure

Tasks running in a worker's thread pool are lost if the worker restarts mid-execution. The coordinator detects the stale tasks and retries each failed task. If retries are exhausted, the task is marked as failed and may eventually cause the job to transition to **FAILED** if the error threshold is exceeded.

### Manual Recovery (`job rerun`)

{% hint style="warning" %}
`job rerun` requires **etcd-based HA mode** (`alluxio.coordinator.job.meta.store.type=ETCD`). It will fail with `UnavailableException: Job ETCD is not enabled` in single-coordinator RocksDB mode. For RocksDB deployments, recovery happens automatically when the coordinator process restarts and re-reads local RocksDB.
{% endhint %}

In scenarios where a job gets stuck due to a complex failure (e.g., network partition or complete cluster restart), administrators can manually re-queue it:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Rerun a specific load job that might be stuck
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job rerun --path /data/my-dataset --type load

# Rerun all jobs assigned to a specific (failed) coordinator
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job rerun --coordinator <failed_coordinator_host>
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Rerun a specific load job that might be stuck
bin/alluxio job rerun --path /data/my-dataset --type load

# Rerun all jobs assigned to a specific (failed) coordinator
bin/alluxio job rerun --coordinator <failed_coordinator_host>
```

{% endtab %}
{% endtabs %}

## 4. High Availability

Even though the coordinator is never in the I/O critical path, a single coordinator backed by local RocksDB can still become a single point of failure. During a coordinator node upgrade — hardware replacement or a software update — `load` and `free` jobs cannot be submitted or make progress until the node comes back online.

To eliminate this, deploy multiple coordinator instances. Multiple schedulers sharing the same job queue require a distributed state store: this is where **etcd** comes in. All coordinators point to the same etcd cluster, so any instance can pick up and process jobs from the shared queue.

* **Stateless Coordinators:** Since coordinators do not hold local state, any instance can handle any job — there is no "leader" election for scheduling.
* **Fault Tolerance:** If one coordinator fails, the others continue processing the shared queue without interruption.
* **Failover:** A failed coordinator's in-flight jobs are detected as stale and re-queued within seconds by the remaining instances.
* **Load Balancing:** All coordinators consume from the same shared global job queue — work is naturally distributed. Polling intervals include random jitter to prevent thundering-herd bursts.

### Configuring HA

The key requirement is setting `alluxio.coordinator.job.meta.store.type=ETCD` on every coordinator node to switch job state storage from local RocksDB to the shared cluster.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Set `coordinator.count` to deploy multiple replicas. The Operator provisions and manages etcd automatically.

```yaml
spec:
  coordinator:
    count: 3
  properties:
    alluxio.coordinator.job.meta.store.type: "ETCD"
```

```shell
kubectl apply -f alluxio-cluster.yaml
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
Add to `alluxio-site.properties` on **every** coordinator node, then start each coordinator process:

```properties
alluxio.coordinator.job.meta.store.type=ETCD
```

```shell
bin/alluxio process start coordinator
```

{% endtab %}
{% endtabs %}

#### Optional: Dedicated etcd cluster for job scheduling

By default the coordinator stores job metadata in the main Alluxio etcd cluster (`alluxio.etcd.endpoints`). For workloads with heavy job-submission traffic, you can route job metadata to a **separate** etcd cluster to isolate it from service-discovery traffic:

```properties
# Defaults to alluxio.etcd.endpoints if unset
alluxio.coordinator.scheduler.load.job.etcd.endpoints=http://job-etcd1:2379,http://job-etcd2:2379
```

This applies to both Kubernetes and Docker / Bare-Metal deployments. On Kubernetes, add the property under `spec.properties` in the AlluxioCluster CR; on Docker / Bare-Metal, add it to `alluxio-site.properties` on every coordinator node.

### Verifying HA

After the coordinators come online, confirm that they are actually sharing the etcd-backed job queue.

**1. All coordinator replicas are running.**

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl -n <NAMESPACE> get pods -l app.kubernetes.io/component=coordinator
```

Expected: `coordinator.count` pods, all `Running` with `READY 1/1`.
{% endtab %}

{% tab title="Docker / Bare-Metal" %}
On each coordinator host:

```shell
docker ps --filter name=alluxio-coordinator
```

Expected: the container is `Up` on every host.
{% endtab %}
{% endtabs %}

**2. Coordinators share the same job queue.** Submit a job through one coordinator and query it from a different one — in HA mode both answer from the same etcd-backed queue.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Submit via coordinator-0
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path /<mount>/<path> --submit

# Query from coordinator-1
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-1 -- \
  alluxio job load --path /<mount>/<path> --progress
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# From coordinator-1 host
bin/alluxio job load --path /<mount>/<path> --submit

# From coordinator-2 host
bin/alluxio job load --path /<mount>/<path> --progress
```

{% endtab %}
{% endtabs %}

Expected: both commands return the same job state. In single-coordinator (RocksDB) mode, the second query would report the job as not found.

**3. Failover test** (optional). Stop the coordinator currently driving a long-running job; a peer picks it up within `alluxio.coordinator.failure.detection.timeout` (default 15 s). Re-query `alluxio job load --path ... --progress` from a surviving coordinator and confirm the job advances past the failure point and eventually reaches `SUCCEEDED`.

## 5. Configuration and Tuning

You can fine-tune the Job Service behavior for your specific workload and etcd capacity.

| Property                                               | Default          | Description                                                                                                                                                                          |
| ------------------------------------------------------ | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `alluxio.coordinator.failure.detection.timeout`        | `15s`            | Time before a silent coordinator is considered failed and its jobs reclaimed by peers.                                                                                               |
| `alluxio.coordinator.scheduler.etcd.max.retries`       | `2`              | Max retries for etcd operations before giving up.                                                                                                                                    |
| `alluxio.coordinator.scheduler.etcd.read.timeout`      | `10s`            | Timeout for reading from etcd.                                                                                                                                                       |
| `alluxio.coordinator.scheduler.etcd.write.timeout`     | `15s`            | Timeout for writing to etcd.                                                                                                                                                         |
| `alluxio.coordinator.scheduler.interval.time`          | `2s`             | Scheduling interval. A shorter interval allows faster scheduling of small jobs but increases CPU overhead.                                                                           |
| `alluxio.coordinator.scheduler.job.etcd.path`          | `/alluxio/jobs/` | The prefix directory in etcd used for storing job metadata.                                                                                                                          |
| `alluxio.coordinator.scheduler.job.etcd.poll.interval` | `1s`             | How often coordinators check etcd for new waiting jobs. Lower values mean faster response but higher load on etcd.                                                                   |
| `alluxio.coordinator.scheduler.max.tasks.per.worker`   | `3`              | Maximum number of tasks each coordinator can schedule simultaneously on the same worker. Increasing this raises concurrency but risks creating a backlog on the worker.              |
| `alluxio.coordinator.scheduler.running.job.capacity`   | `100`            | Maximum number of jobs the scheduler can run concurrently. Jobs beyond this limit remain in the waiting queue.                                                                       |
| `alluxio.coordinator.scheduler.waiting.job.capacity`   | `300000`         | Maximum number of jobs that can wait in the scheduler queue. Submissions beyond this limit are rejected.                                                                             |
| `alluxio.job.batch.size`                               | `200`            | Number of files per task dispatched to a worker. Larger batches increase worker concurrency but may exceed the worker's thread pool size. Override per job via CLI (`--batch-size`). |
| `alluxio.job.cleanup.interval`                         | `1H`             | Interval for periodic cleanup of finished jobs from the job store.                                                                                                                   |
| `alluxio.job.retention.time`                           | `7d`             | Job history retention time.                                                                                                                                                          |
| `alluxio.worker.job.load.executor.threads.max`         | `2048`           | Maximum number of threads in the worker's job executor pool for parallel UFS data loading.                                                                                           |
| `alluxio.worker.load.file.partition.size`              | `256MiB`         | Shard size for splitting large files during loading. Smaller shards increase concurrency but may create more overhead.                                                               |

### Small File Optimization

When loading datasets with many small files (< 1 MB each), the default batch size of 200 files per task often leaves worker threads idle — each task completes quickly and the coordinator must dispatch the next batch before workers can fully saturate their thread pools.

Increase the batch size to keep workers busy:

```properties
# For small-file datasets — increase from default 200 (set in alluxio-site.properties)
alluxio.job.batch.size=2000
```

Or override per job without restarting the cluster:

```shell
bin/alluxio job load --path s3://bucket/small-files/ --submit --batch-size 2000
```

In practice, small-file load throughput is typically limited by the **S3 request rate (IOPS)** rather than Alluxio's internal concurrency. Once the S3 IOPS ceiling is reached, increasing `alluxio.worker.job.load.executor.threads.max` beyond its default of 2048 or adding more workers will not improve throughput — the bottleneck has shifted to the object store.

For large files (> 100 MB), keep batch size at 200 or lower. Large files are already split into `alluxio.worker.load.file.partition.size` (256 MiB) shards internally, so each task generates multiple concurrent subtasks — a large batch size can create excessive memory pressure on the worker.

## 6. Metrics

{% hint style="info" %}
These metrics are low-level indicators of internal scheduler operations. For general cluster health and job progress, refer to the default Grafana dashboard.
{% endhint %}

| Metric                              | Description                                                                   |
| ----------------------------------- | ----------------------------------------------------------------------------- |
| `alluxio_scheduler_etcd_poll`       | Counter of poll operations to etcd.                                           |
| `alluxio_scheduler_etcd_claim`      | Counter of jobs successfully claimed from the queue.                          |
| `alluxio_scheduler_etcd_reclaim`    | Counter of jobs reclaimed/recovered from failed coordinators.                 |
| `alluxio_scheduler_etcd_latency_ms` | Latency distribution of etcd operations performed by the scheduler.           |
| `alluxio_scheduler_job_access_etcd` | Counter of various job-related access types (list, get, transaction) to etcd. |

## See Also

* [Cache Loading](/ee-ai-en/ai-3.8-15.1.x/cache/loading-data-into-the-cache.md) — job submission, monitoring, and free operations
* [Installing on Kubernetes](/ee-ai-en/ai-3.8-15.1.x/start/installing-on-kubernetes.md) — ETCD configuration and multi-coordinator deployment
* [Cluster Management](/ee-ai-en/ai-3.8-15.1.x/administration/managing-alluxio.md) — resource tuning, node pinning, persistent metastore
* [Monitoring](/ee-ai-en/ai-3.8-15.1.x/administration/monitoring-alluxio.md) — Grafana dashboards, alert rules, metrics reference


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/administration/managing-job-service.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
