Managing Cache

Alluxio acts as an intelligent caching layer between your compute applications and persistent storage. Effectively managing this cache is key to maximizing performance, reducing storage costs, and ensuring data freshness.

This guide provides a comprehensive overview of the caching lifecycle in Alluxio, from how data enters the system to how you manage its behavior and eventual removal.

The Cache Lifecycle

The life of data in Alluxio typically follows three main phases:

  1. Loading Data: How data gets into the cache in the first place.

  2. Managing Data: How to control what is cached, how much space it uses, and for how long.

  3. Removing Data: How data is evicted or deleted from the cache.

1. Loading Data

Data enters the Alluxio cache in two primary ways:

  • Passive Caching (Read-Through): The most common method. When an application reads a file that isn't in Alluxio, the worker fetches it from the UFS and caches it automatically.

  • Active Preloading: You can proactively load hot datasets into the cache using the job load command. This "warms up" the cache so performance is optimal from the very first query.

Learn more about Loading Data →

2. Managing Data

Once data is cached, you need tools to control how it behaves. Alluxio offers several mechanisms to fine-tune this:

  • Cache Filters: Decide what gets cached. You can exclude temporary files or force revalidation for changing data.

  • Quota: Limit the storage space available to specific directories or tenants to prevent one user from monopolizing the cache.

  • Time-to-Live (TTL): Set expiration times for data to ensure staleness doesn't persist.

  • Cache Priority: Mark critical datasets as "High Priority" so they are the last to be removed when the cache fills up.

Learn more about Managing Data →

3. Removing Data

Data doesn't stay in the cache forever. Removal happens for three reasons:

  • Automatic Eviction: When space is full, Alluxio removes the least useful data (like Least Recently Used) to make room for new content.

  • Manual Removal: You can explicitly remove data using the job free command.

  • Stale Cleanup: Administrative tools can clean up data that becomes invalid due to cluster topology changes.

Learn more about Removing Data →

Strategy Guide: Best Practices by Use Case

Different datasets require different caching strategies. Use this guide to choose the right configuration for your workload.

Immutable or Rarely Changed Data

Examples: Dimension tables, ML reference datasets, static assets.

Goal: Maximum performance (cache hit rate). Strategy:

  • Filter: Use the default immutable policy.

  • Loading: Use job load to pre-warm the cache.

  • Priority (Optional): Set to HIGH to protect it from eviction.

  • Quota (Optional): Assign generous quotas to ensure it always fits.

Periodically Updated Data

Examples: Hourly ETL reports, daily model retraining data.

Goal: Balance performance with data freshness. Strategy:

  • Filter: Use maxAge (e.g., 1h or 1d) so Alluxio automatically checks for updates after a set time.

  • Loading: Run a job load immediately after your upstream update process finishes to ensure the new version is hot.

Temporary or Streaming Data

Examples: Checkpoints, temp query files, build artifacts.

Goal: Prevent "cache pollution" (filling cache with useless data). Strategy:

  • Filter: Use skipCache for write-heavy, read-once data.

  • TTL: Set a short TTL (e.g., 10m) to ensure any cached data is quickly removed.

  • Priority: Set to LOW so it's the first to go if space is needed.

Compliance-Sensitive Data

Examples: PII logs, GDPR requests.

Goal: Strict control over data lifetime. Strategy:

  • TTL: Enforce hard limits (e.g., 90d) on sensitive directories.

  • Removal: Use job free immediately after data processing is complete.

  • Filter: Consider skipCache for highly sensitive files to prevent them from hitting the cache disk at all.

Last updated