Managing Cache
Alluxio acts as an intelligent caching layer between your compute applications and persistent storage. Effectively managing this cache is key to maximizing performance, reducing storage costs, and ensuring data freshness.
This guide provides a comprehensive overview of the caching lifecycle in Alluxio, from how data enters the system to how you manage its behavior and eventual removal.
The Cache Lifecycle
The life of data in Alluxio typically follows three main phases:
Loading Data: How data gets into the cache in the first place.
Managing Data: How to control what is cached, how much space it uses, and for how long.
Removing Data: How data is evicted or deleted from the cache.
1. Loading Data
Data enters the Alluxio cache in two primary ways:
Passive Caching (Read-Through): The most common method. When an application reads a file that isn't in Alluxio, the worker fetches it from the UFS and caches it automatically.
Active Preloading: You can proactively load hot datasets into the cache using the
job loadcommand. This "warms up" the cache so performance is optimal from the very first query.
Learn more about Loading Data →
2. Managing Data
Once data is cached, you need tools to control how it behaves. Alluxio offers several mechanisms to fine-tune this:
Cache Filters: Decide what gets cached. You can exclude temporary files or force revalidation for changing data.
Quota: Limit the storage space available to specific directories or tenants to prevent one user from monopolizing the cache.
Time-to-Live (TTL): Set expiration times for data to ensure staleness doesn't persist.
Cache Priority: Mark critical datasets as "High Priority" so they are the last to be removed when the cache fills up.
Learn more about Managing Data →
3. Removing Data
Data doesn't stay in the cache forever. Removal happens for three reasons:
Automatic Eviction: When space is full, Alluxio removes the least useful data (like Least Recently Used) to make room for new content.
Manual Removal: You can explicitly remove data using the
job freecommand.Stale Cleanup: Administrative tools can clean up data that becomes invalid due to cluster topology changes.
Learn more about Removing Data →
Strategy Guide: Best Practices by Use Case
Different datasets require different caching strategies. Use this guide to choose the right configuration for your workload.
Immutable or Rarely Changed Data
Examples: Dimension tables, ML reference datasets, static assets.
Goal: Maximum performance (cache hit rate). Strategy:
Filter: Use the default
immutablepolicy.Loading: Use
job loadto pre-warm the cache.Priority (Optional): Set to
HIGHto protect it from eviction.Quota (Optional): Assign generous quotas to ensure it always fits.
Periodically Updated Data
Examples: Hourly ETL reports, daily model retraining data.
Goal: Balance performance with data freshness. Strategy:
Filter: Use
maxAge(e.g.,1hor1d) so Alluxio automatically checks for updates after a set time.Loading: Run a
job loadimmediately after your upstream update process finishes to ensure the new version is hot.
Temporary or Streaming Data
Examples: Checkpoints, temp query files, build artifacts.
Goal: Prevent "cache pollution" (filling cache with useless data). Strategy:
Filter: Use
skipCachefor write-heavy, read-once data.TTL: Set a short TTL (e.g.,
10m) to ensure any cached data is quickly removed.Priority: Set to
LOWso it's the first to go if space is needed.
Compliance-Sensitive Data
Examples: PII logs, GDPR requests.
Goal: Strict control over data lifetime. Strategy:
TTL: Enforce hard limits (e.g.,
90d) on sensitive directories.Removal: Use
job freeimmediately after data processing is complete.Filter: Consider
skipCachefor highly sensitive files to prevent them from hitting the cache disk at all.
Last updated