Job Service

If you've already used job load or job free, you've been interacting with the Job Service — it's the distributed scheduling engine that drives those operations. This page explains how it works internally, how to monitor jobs in flight, and how to configure it for high availability and production scale.

For operation-specific commands, see:

1. Architecture

A job is a named, persistent, distributed operation submitted to Alluxio — for example, loading a dataset into cache or freeing space across all workers. Every job follows the same lifecycle:

WAITINGRUNNINGSUCCEEDED / FAILED / STOPPED

Once submitted, a job enters the waiting queue and is written to the Job Store before any scheduling happens. Whether that state survives a coordinator loss depends on the Job Store backend — see Section 3. The scheduler then claims the job, breaks it into per-worker tasks, and tracks completion. See Section 2 for the full state reference.

The Job Service has three components that drive this lifecycle:

  • Coordinator — a stateless scheduler that polls the job queue, claims jobs atomically, partitions them into file-batch tasks, and dispatches those tasks to workers. Multiple coordinators can run simultaneously; any instance can handle any job.

  • Job Store — durable storage for job state. By default this is a local RocksDB instance on the coordinator node. In HA mode, all coordinators share an etcd cluster.

  • Workers — execute the tasks dispatched by the coordinator: reading data from UFS and writing it to local cache (load), or evicting cached blocks (free).

circle-exclamation

2. Listing and Monitoring Jobs

Job Identity

The default way to address a job is by its (type, path) pair — used throughout the CLI (--path, --type). Internally every job also has a UUID (visible in job list output), and some REST endpoints accept it directly.

Listing Jobs

To see all jobs across the cluster:

Use --job-state to filter by state and --job-type to filter by type (LOAD, FREE, COPY, MOVE). In HA mode, --coordinator <address> lists only jobs owned by a specific coordinator. Each entry in the output shows the job's (type, path) — use those values to query progress or stop the job.

Querying a Specific Job

Once you have the (type, path) of a job from job list, query or control it:

The same pattern applies to other job types: alluxio job free --path <path> --progress, etc.

Job States

State
Description

WAITING

Queued in the Job Store; not yet claimed by a coordinator.

RUNNING

Coordinator is dispatching tasks to workers. Monitor with --progress.

VERIFYING

All tasks done; coordinator is running a verification pass (--verify).

SUCCEEDED

All tasks completed successfully.

FAILED

Error threshold exceeded. Inspect with --progress --file-status FAILURE.

STOPPED

Manually stopped via --stop.

State transitions are atomic and at-most-once, so consistency is guaranteed even if multiple coordinators are active simultaneously.

3. Failure Recovery

Recovery behavior depends on the Job Store backend:

  • RocksDB (single coordinator, default): Job state is written to RocksDB on the coordinator's local filesystem. If the RocksDB directory is on persistent storage (e.g., a PVC on Kubernetes — see Production Setup), the state survives process crashes and pod restarts attached to the same volume. If the coordinator node is permanently lost or the metastore is backed by non-persistent storage, the queue and in-flight job state are lost along with it.

  • etcd (HA mode): Job state lives in etcd, independent of any individual coordinator. A coordinator can be lost permanently and the remaining instances continue processing from the same shared queue.

Automatic Recovery (HA Mode)

In HA mode, jobs actively managed by a crashed coordinator are recovered automatically by its peers:

  • Detection: Active coordinators detect "stale" jobs in the RUNNING state that belong to an instance no longer sending heartbeats.

  • Re-queueing: These orphaned jobs are reverted to WAITING so another healthy coordinator can pick them up.

In single-coordinator mode (RocksDB), there is no peer to perform detection — recovery happens when the coordinator process itself restarts and re-reads its local RocksDB.

Worker Failure

Tasks running in a worker's thread pool are lost if the worker restarts mid-execution. The coordinator detects the stale tasks and retries each failed task. If retries are exhausted, the task is marked as failed and may eventually cause the job to transition to FAILED if the error threshold is exceeded.

Manual Recovery (job rerun)

circle-exclamation

In scenarios where a job gets stuck due to a complex failure (e.g., network partition or complete cluster restart), administrators can manually re-queue it:

4. High Availability

Even though the coordinator is never in the I/O critical path, a single coordinator backed by local RocksDB can still become a single point of failure. During a coordinator node upgrade — hardware replacement or a software update — load and free jobs cannot be submitted or make progress until the node comes back online.

To eliminate this, deploy multiple coordinator instances. Multiple schedulers sharing the same job queue require a distributed state store: this is where etcd comes in. All coordinators point to the same etcd cluster, so any instance can pick up and process jobs from the shared queue.

  • Stateless Coordinators: Since coordinators do not hold local state, any instance can handle any job — there is no "leader" election for scheduling.

  • Fault Tolerance: If one coordinator fails, the others continue processing the shared queue without interruption.

  • Failover: A failed coordinator's in-flight jobs are detected as stale and re-queued within seconds by the remaining instances.

  • Load Balancing: All coordinators consume from the same shared global job queue — work is naturally distributed. Polling intervals include random jitter to prevent thundering-herd bursts.

Configuring HA

The key requirement is setting alluxio.coordinator.job.meta.store.type=ETCD on every coordinator node to switch job state storage from local RocksDB to the shared cluster.

Set coordinator.count to deploy multiple replicas. The Operator provisions and manages etcd automatically.

Optional: Dedicated etcd cluster for job scheduling

By default the coordinator stores job metadata in the main Alluxio etcd cluster (alluxio.etcd.endpoints). For workloads with heavy job-submission traffic, you can route job metadata to a separate etcd cluster to isolate it from service-discovery traffic:

This applies to both Kubernetes and Docker / Bare-Metal deployments. On Kubernetes, add the property under spec.properties in the AlluxioCluster CR; on Docker / Bare-Metal, add it to alluxio-site.properties on every coordinator node.

Verifying HA

After the coordinators come online, confirm that they are actually sharing the etcd-backed job queue.

1. All coordinator replicas are running.

Expected: coordinator.count pods, all Running with READY 1/1.

2. Coordinators share the same job queue. Submit a job through one coordinator and query it from a different one — in HA mode both answer from the same etcd-backed queue.

Expected: both commands return the same job state. In single-coordinator (RocksDB) mode, the second query would report the job as not found.

3. Failover test (optional). Stop the coordinator currently driving a long-running job; a peer picks it up within alluxio.coordinator.failure.detection.timeout (default 15 s). Re-query alluxio job load --path ... --progress from a surviving coordinator and confirm the job advances past the failure point and eventually reaches SUCCEEDED.

5. Configuration and Tuning

You can fine-tune the Job Service behavior for your specific workload and etcd capacity.

Property
Default
Description

alluxio.coordinator.failure.detection.timeout

15s

Time before a silent coordinator is considered failed and its jobs reclaimed by peers.

alluxio.coordinator.scheduler.etcd.max.retries

2

Max retries for etcd operations before giving up.

alluxio.coordinator.scheduler.etcd.read.timeout

10s

Timeout for reading from etcd.

alluxio.coordinator.scheduler.etcd.write.timeout

15s

Timeout for writing to etcd.

alluxio.coordinator.scheduler.interval.time

2s

Scheduling interval. A shorter interval allows faster scheduling of small jobs but increases CPU overhead.

alluxio.coordinator.scheduler.job.etcd.path

/alluxio/jobs/

The prefix directory in etcd used for storing job metadata.

alluxio.coordinator.scheduler.job.etcd.poll.interval

1s

How often coordinators check etcd for new waiting jobs. Lower values mean faster response but higher load on etcd.

alluxio.coordinator.scheduler.max.tasks.per.worker

3

Maximum number of tasks each coordinator can schedule simultaneously on the same worker. Increasing this raises concurrency but risks creating a backlog on the worker.

alluxio.coordinator.scheduler.running.job.capacity

100

Maximum number of jobs the scheduler can run concurrently. Jobs beyond this limit remain in the waiting queue.

alluxio.coordinator.scheduler.waiting.job.capacity

300000

Maximum number of jobs that can wait in the scheduler queue. Submissions beyond this limit are rejected.

alluxio.job.batch.size

200

Number of files per task dispatched to a worker. Larger batches increase worker concurrency but may exceed the worker's thread pool size. Override per job via CLI (--batch-size).

alluxio.job.cleanup.interval

1H

Interval for periodic cleanup of finished jobs from the job store.

alluxio.job.retention.time

7d

Job history retention time.

alluxio.worker.job.load.executor.threads.max

2048

Maximum number of threads in the worker's job executor pool for parallel UFS data loading.

alluxio.worker.load.file.partition.size

256MiB

Shard size for splitting large files during loading. Smaller shards increase concurrency but may create more overhead.

6. Metrics

circle-info

These metrics are low-level indicators of internal scheduler operations. For general cluster health and job progress, refer to the default Grafana dashboard.

Metric
Description

alluxio_scheduler_etcd_poll

Counter of poll operations to etcd.

alluxio_scheduler_etcd_claim

Counter of jobs successfully claimed from the queue.

alluxio_scheduler_etcd_reclaim

Counter of jobs reclaimed/recovered from failed coordinators.

alluxio_scheduler_etcd_latency_ms

Latency distribution of etcd operations performed by the scheduler.

alluxio_scheduler_job_access_etcd

Counter of various job-related access types (list, get, transaction) to etcd.

See Also

Last updated