Stale Cache Cleaning
In Alluxio, clients use consistent hashing to determine the appropriate worker to access or write a file. This ensures that each file is typically cached only on its designated worker. However, under certain conditions, a worker may end up caching data that no longer belongs to it. To reclaim memory and maintain optimal cache utilization, Alluxio provides a mechanism to clear stale cached data from workers.
This operation triggers each worker to scan its local cache, verify whether it still owns the cached data, and delete any data that no longer belongs to it.
When Stale Cache Occurs
Stale data may exist on a worker due to the following situations:
Replica Reduction If the replication factor of a file is reduced (e.g., from 3 to 2), the third worker still holds a redundant replica that is no longer needed.
Dynamic Hash Ring Membership Change When using dynamic hash ring, a worker may temporarily go offline and its responsibilities are taken over by other workers. If the original worker later rejoins, the other workers that were serving in its place may hold stale data.
Cluster Expansion Adding new workers can change the ownership of cached files. Data previously cached on old workers may now be the responsibility of newly added workers.
To clean up such stale data, the clear stale cache operation can be manually triggered.
Differences between Clearing Stale Cache and Free Job
Clearing Stale Cache and Free Job are two mechanisms for cache cleanup in Alluxio, and they are often confused due to their similar purposes. The table below outlines their differences:
Primary Use Case
Cleaning up stale or misplaced data after cluster changes
Releasing cache for data that is no longer needed by applications
Type of Cache Freed
Incorrect/invalid cache
Valid cache that is no longer needed
Target of Cleanup
Removes cached files that shouldn't reside on a worker
Removes files from workers based on the input specification
Interface
RESTful API
RESTful API & CLI
Input Parameters
None
Requires a directory path or an index file as input
Scheduling Mechanism
Immediately executes on all workers
Relies on the job system for scheduling
Usage
This feature is currently accessible via RESTful API only.
Submit Task
The following API triggers the clear stale cache
task to be asynchronously executed across all workers:
This command submits a background job to all workers. Submitting the same request multiple times will not cause duplicate executions.
Example Response:
An empty
errors
object indicates successful job submission to all workers. Otherwise, theerrors
field will be a mapping of the hostname of the workers where an error occurred, and the error message. An error occurs if the job failed to be submitted to a worker due to network connection failure, or a job submitted earlier has not finished running.
Stop Task
To cancel the task (if needed), send a DELETE request with the same op
:
This request will stop the background task on all workers. If no such task is running, the command will still succeed without error.
Monitoring Task Progress
There is currently no RPC to track the progress of the clear stale cache job. However, you can monitor its progress in the following ways:
Via Logs
When the task completes on a worker, the following log will appear:
This log message indicates the job completion and the amount of stale data removed.
Via Prometheus Metrics
Alluxio exposes a metric to track stale cache clearance:
This metric accumulates the total number of bytes cleared by the clear stale cache operation on a worker. At the completion of the job, the aggregated sum of this metric across all workers will plateau. You can use this metric to monitor and alert on cache cleanup trends across your cluster.
Last updated