Job Service
If you've already used job load or job free, you've been interacting with the Job Service — it's the distributed scheduling engine that drives those operations. This page explains how it works internally, how to monitor jobs in flight, and how to configure it for high availability and production scale.
For operation-specific commands, see:
1. Architecture
A job is a named, persistent, distributed operation submitted to Alluxio — for example, loading a dataset into cache or freeing space across all workers. Every job follows the same lifecycle:
WAITING → RUNNING → SUCCEEDED / FAILED / STOPPED
Once submitted, a job enters the waiting queue and is written to the Job Store before any scheduling happens. Whether that state survives a coordinator loss depends on the Job Store backend — see Section 3. The scheduler then claims the job, breaks it into per-worker tasks, and tracks completion. See Section 2 for the full state reference.
The Job Service has three components that drive this lifecycle:
Coordinator — a stateless scheduler that polls the job queue, claims jobs atomically, partitions them into file-batch tasks, and dispatches those tasks to workers. Multiple coordinators can run simultaneously; any instance can handle any job.
Job Store — durable storage for job state. By default this is a local RocksDB instance on the coordinator node. In HA mode, all coordinators share an etcd cluster.
Workers — execute the tasks dispatched by the coordinator: reading data from UFS and writing it to local cache (
load), or evicting cached blocks (free).
The Coordinator is not the same as the Master in open-source Alluxio. Alluxio Enterprise Edition has replaced the Master with the Coordinator — a purpose-built stateless scheduler. There is no Master component in the enterprise edition. See Key Components for an overview.
2. Listing and Monitoring Jobs
Job Identity
The default way to address a job is by its (type, path) pair — used throughout the CLI (--path, --type). Internally every job also has a UUID (visible in job list output), and some REST endpoints accept it directly.
Listing Jobs
To see all jobs across the cluster:
Use --job-state to filter by state and --job-type to filter by type (LOAD, FREE, COPY, MOVE). In HA mode, --coordinator <address> lists only jobs owned by a specific coordinator. Each entry in the output shows the job's (type, path) — use those values to query progress or stop the job.
Querying a Specific Job
Once you have the (type, path) of a job from job list, query or control it:
The same pattern applies to other job types: alluxio job free --path <path> --progress, etc.
Job States
WAITING
Queued in the Job Store; not yet claimed by a coordinator.
RUNNING
Coordinator is dispatching tasks to workers. Monitor with --progress.
VERIFYING
All tasks done; coordinator is running a verification pass (--verify).
SUCCEEDED
All tasks completed successfully.
FAILED
Error threshold exceeded. Inspect with --progress --file-status FAILURE.
STOPPED
Manually stopped via --stop.
State transitions are atomic and at-most-once, so consistency is guaranteed even if multiple coordinators are active simultaneously.
3. Failure Recovery
Recovery behavior depends on the Job Store backend:
RocksDB (single coordinator, default): Job state is written to RocksDB on the coordinator's local filesystem. If the RocksDB directory is on persistent storage (e.g., a PVC on Kubernetes — see Production Setup), the state survives process crashes and pod restarts attached to the same volume. If the coordinator node is permanently lost or the metastore is backed by non-persistent storage, the queue and in-flight job state are lost along with it.
etcd (HA mode): Job state lives in etcd, independent of any individual coordinator. A coordinator can be lost permanently and the remaining instances continue processing from the same shared queue.
Automatic Recovery (HA Mode)
In HA mode, jobs actively managed by a crashed coordinator are recovered automatically by its peers:
Detection: Active coordinators detect "stale" jobs in the RUNNING state that belong to an instance no longer sending heartbeats.
Re-queueing: These orphaned jobs are reverted to WAITING so another healthy coordinator can pick them up.
In single-coordinator mode (RocksDB), there is no peer to perform detection — recovery happens when the coordinator process itself restarts and re-reads its local RocksDB.
Worker Failure
Tasks running in a worker's thread pool are lost if the worker restarts mid-execution. The coordinator detects the stale tasks and retries each failed task. If retries are exhausted, the task is marked as failed and may eventually cause the job to transition to FAILED if the error threshold is exceeded.
Manual Recovery (job rerun)
job rerun)job rerun requires etcd-based HA mode (alluxio.coordinator.job.meta.store.type=ETCD). It will fail with UnavailableException: Job ETCD is not enabled in single-coordinator RocksDB mode. For RocksDB deployments, recovery happens automatically when the coordinator process restarts and re-reads local RocksDB.
In scenarios where a job gets stuck due to a complex failure (e.g., network partition or complete cluster restart), administrators can manually re-queue it:
4. High Availability
Even though the coordinator is never in the I/O critical path, a single coordinator backed by local RocksDB can still become a single point of failure. During a coordinator node upgrade — hardware replacement or a software update — load and free jobs cannot be submitted or make progress until the node comes back online.
To eliminate this, deploy multiple coordinator instances. Multiple schedulers sharing the same job queue require a distributed state store: this is where etcd comes in. All coordinators point to the same etcd cluster, so any instance can pick up and process jobs from the shared queue.
Stateless Coordinators: Since coordinators do not hold local state, any instance can handle any job — there is no "leader" election for scheduling.
Fault Tolerance: If one coordinator fails, the others continue processing the shared queue without interruption.
Failover: A failed coordinator's in-flight jobs are detected as stale and re-queued within seconds by the remaining instances.
Load Balancing: All coordinators consume from the same shared global job queue — work is naturally distributed. Polling intervals include random jitter to prevent thundering-herd bursts.
Configuring HA
The key requirement is setting alluxio.coordinator.job.meta.store.type=ETCD on every coordinator node to switch job state storage from local RocksDB to the shared cluster.
Set coordinator.count to deploy multiple replicas. The Operator provisions and manages etcd automatically.
Add to alluxio-site.properties on every coordinator node, then start each coordinator process:
Optional: Dedicated etcd cluster for job scheduling
By default the coordinator stores job metadata in the main Alluxio etcd cluster (alluxio.etcd.endpoints). For workloads with heavy job-submission traffic, you can route job metadata to a separate etcd cluster to isolate it from service-discovery traffic:
This applies to both Kubernetes and Docker / Bare-Metal deployments. On Kubernetes, add the property under spec.properties in the AlluxioCluster CR; on Docker / Bare-Metal, add it to alluxio-site.properties on every coordinator node.
Verifying HA
After the coordinators come online, confirm that they are actually sharing the etcd-backed job queue.
1. All coordinator replicas are running.
Expected: coordinator.count pods, all Running with READY 1/1.
On each coordinator host:
Expected: the container is Up on every host.
2. Coordinators share the same job queue. Submit a job through one coordinator and query it from a different one — in HA mode both answer from the same etcd-backed queue.
Expected: both commands return the same job state. In single-coordinator (RocksDB) mode, the second query would report the job as not found.
3. Failover test (optional). Stop the coordinator currently driving a long-running job; a peer picks it up within alluxio.coordinator.failure.detection.timeout (default 15 s). Re-query alluxio job load --path ... --progress from a surviving coordinator and confirm the job advances past the failure point and eventually reaches SUCCEEDED.
5. Configuration and Tuning
You can fine-tune the Job Service behavior for your specific workload and etcd capacity.
alluxio.coordinator.failure.detection.timeout
15s
Time before a silent coordinator is considered failed and its jobs reclaimed by peers.
alluxio.coordinator.scheduler.etcd.max.retries
2
Max retries for etcd operations before giving up.
alluxio.coordinator.scheduler.etcd.read.timeout
10s
Timeout for reading from etcd.
alluxio.coordinator.scheduler.etcd.write.timeout
15s
Timeout for writing to etcd.
alluxio.coordinator.scheduler.interval.time
2s
Scheduling interval. A shorter interval allows faster scheduling of small jobs but increases CPU overhead.
alluxio.coordinator.scheduler.job.etcd.path
/alluxio/jobs/
The prefix directory in etcd used for storing job metadata.
alluxio.coordinator.scheduler.job.etcd.poll.interval
1s
How often coordinators check etcd for new waiting jobs. Lower values mean faster response but higher load on etcd.
alluxio.coordinator.scheduler.max.tasks.per.worker
3
Maximum number of tasks each coordinator can schedule simultaneously on the same worker. Increasing this raises concurrency but risks creating a backlog on the worker.
alluxio.coordinator.scheduler.running.job.capacity
100
Maximum number of jobs the scheduler can run concurrently. Jobs beyond this limit remain in the waiting queue.
alluxio.coordinator.scheduler.waiting.job.capacity
300000
Maximum number of jobs that can wait in the scheduler queue. Submissions beyond this limit are rejected.
alluxio.job.batch.size
200
Number of files per task dispatched to a worker. Larger batches increase worker concurrency but may exceed the worker's thread pool size. Override per job via CLI (--batch-size).
alluxio.job.cleanup.interval
1H
Interval for periodic cleanup of finished jobs from the job store.
alluxio.job.retention.time
7d
Job history retention time.
alluxio.worker.job.load.executor.threads.max
2048
Maximum number of threads in the worker's job executor pool for parallel UFS data loading.
alluxio.worker.load.file.partition.size
256MiB
Shard size for splitting large files during loading. Smaller shards increase concurrency but may create more overhead.
6. Metrics
These metrics are low-level indicators of internal scheduler operations. For general cluster health and job progress, refer to the default Grafana dashboard.
alluxio_scheduler_etcd_poll
Counter of poll operations to etcd.
alluxio_scheduler_etcd_claim
Counter of jobs successfully claimed from the queue.
alluxio_scheduler_etcd_reclaim
Counter of jobs reclaimed/recovered from failed coordinators.
alluxio_scheduler_etcd_latency_ms
Latency distribution of etcd operations performed by the scheduler.
alluxio_scheduler_job_access_etcd
Counter of various job-related access types (list, get, transaction) to etcd.
See Also
Cache Loading — job submission, monitoring, and free operations
Installing on Kubernetes — ETCD configuration and multi-coordinator deployment
Cluster Management — resource tuning, node pinning, persistent metastore
Monitoring — Grafana dashboards, alert rules, metrics reference
Last updated