Stale Cache Cleaning

In Alluxio, clients use consistent hashing to determine the appropriate worker to access or write a file. This ensures that each file is typically cached only on its designated worker. However, under certain conditions, a worker may end up caching data that no longer belongs to it. To reclaim memory and maintain optimal cache utilization, Alluxio provides a mechanism to clear stale cached data from workers.

This operation triggers each worker to scan its local cache, verify whether it still owns the cached data, and delete any data that no longer belongs to it.

When Stale Cache Occurs

Stale data may exist on a worker due to the following situations:

Replica Reduction If the replication factor of a file is reduced (e.g., from 3 to 2), the third worker still holds a redundant replica that is no longer needed.
Dynamic Hash Ring Membership Change When using dynamic hash ring, a worker may temporarily go offline and its responsibilities are taken over by other workers. If the original worker later rejoins, the other workers that were serving in its place may hold stale data.
Cluster Expansion Adding new workers can change the ownership of cached files. Data previously cached on old workers may now be the responsibility of newly added workers.

To clean up such stale data, the clear stale cache operation can be manually triggered.

Differences between Clearing Stale Cache and Free Job

Clearing Stale Cache and Free Job are two mechanisms for cache cleanup in Alluxio, and they are often confused due to their similar purposes. The table below outlines their differences:

Aspect

Clearing Stale Cache

Free Job

Primary Use Case

Cleaning up stale or misplaced data after cluster changes

Releasing cache for data that is no longer needed by applications

Type of Cache Freed

Incorrect/invalid cache

Valid cache that is no longer needed

Target of Cleanup

Removes cached files that shouldn't reside on a worker

Removes files from workers based on the input specification

Interface

REST API

REST API & CLI

Input Parameters

None

Requires a directory path or an index file as input

Scheduling Mechanism

Immediately executes on all workers

Relies on the job system for scheduling

Usage

This feature is currently accessible via REST API only. Please refer to the API reference page for more details.

Submit Task

The following API triggers the clear stale cache task to be asynchronously executed across all workers:

curl -X POST <coordinator-host>:<coordinator-api-port>/api/v1/cache -d '{"op":"clear-stale"}' -H "Content-Type: application/json"

This command submits a background job to all workers. Submitting the same request multiple times will not cause duplicate executions.

Example Response:

{
  "errors": {
    "worker1-host": "Connection refused",
    "worker2-host": "Timeout"
  }
}

An empty errors object indicates successful job submission to all workers. Otherwise, the errors field will be a mapping of the hostname of the workers where an error occurred, and the error message. An error occurs if the job failed to be submitted to a worker due to network connection failure, or a job submitted earlier has not finished running.

Stop Task

To cancel the task (if needed), send a DELETE request with the same op:

curl -X DELETE <coordinator-host>:<coordinator-api-port>/api/v1/cache -d '{"op":"clear-stale"}' -H "Content-Type: application/json"

This request will stop the background task on all workers. If no such task is running, the command will still succeed without error.

Monitoring Task Progress

There is currently no RPC to track the progress of the clear stale cache job. However, you can monitor its progress in the following ways:

Via Logs

When the task completes on a worker, the following log will appear:

2025-04-21T19:51:22,889 INFO  AsyncJobWorker - Clear stale cached files finished. 104857600 bytes released

This log message indicates the job completion and the amount of stale data removed.

Via Prometheus Metrics

Alluxio exposes a metric to track stale cache clearance:

alluxio_cleared_stale_cached_data

This metric accumulates the total number of bytes cleared by the clear stale cache operation on a worker. At the completion of the job, the aggregated sum of this metric across all workers will plateau. You can use this metric to monitor and alert on cache cleanup trends across your cluster.

Last updated 1 month ago