Managing Data in the Cache
Once data is loaded into Alluxio, managing it effectively is crucial for optimizing performance, controlling resource consumption, and ensuring data freshness. Alluxio provides a rich set of tools to control what data is cached, how much space it can consume, how long it lives in the cache, and its importance relative to other data.
This guide covers four key tools for cache management:
Cache Filter Policies decide which files enter the cache and how their consistency with the UFS is treated. Available filter modes can enforce immutable behavior, skip caching entirely, or periodically revalidate data with a time window.
Cache Quotas govern how much cache space a directory tree may consume. They are enforced as data accumulates so a tenant or dataset cannot monopolize cluster storage.
Time-to-Live (TTL) rules bound how long cached pages may live after admission. Workers continuously scan for expired entries and evict them even if the cache is not full.
Cache Priority decides what gets evicted first based on priorities assigned by an administrator whenever space is needed. LRU (or another base evictor) still orders items within the same priority tier.
Operational Guide
This section provides detailed usage guide for each cache management tool.
Controlling What to Cache: Cache Filter Policies
The Cache Filter Policy feature is enabled by default and allows you to create rules that determine which files should or should not be cached based on their path. This is useful for excluding frequently changing files, temporary files, or data that provides little performance benefit from being cached.
By default, Alluxio uses an Immutable policy, meaning it will cache data upon first read and will not check the UFS for updates afterward. You can override this behavior by defining rules for specific paths.
Alluxio defines three filter modes for a file:
Immutable: The file's data and metadata will never change. Alluxio will cache it once and never check the UFS for updates. This is the default behavior and the most performant option.
Skip Cache: The file's data and metadata should not be cached in Alluxio. All requests for this file will be forwarded directly to the UFS. This is ideal for highly volatile files where cache consistency would be difficult to maintain.
Max Age: The file's data and metadata may change. You can specify a duration (e.g.,
10m) after which the cached copy is considered stale. Alluxio will then re-check the UFS for a newer version on the next access.
Examples of Filter Modes
The following examples show the logic behind different filter rules. You would typically apply these rules using the Alluxio CLI or REST API when using ETCD. For detailed command usage, please refer to the CLI guide.
Immutable: For Data That Never Changes
This is the most performant option and should be the default for most of your data. If a file is marked as immutable, Alluxio caches it once and never checks the UFS for updates.
Consistency: This policy provides strong consistency only if the source file in the UFS truly never changes. If the source file is modified after being cached, Alluxio will continue to serve the old, stale version, leading to permanent inconsistency.
Skip Cache: For Volatile or Rarely Used Data
If you have data that changes frequently (e.g., temporary scripts) or is not worth caching, you can exclude it using a skipCache rule. All requests for these files will be served directly from the UFS.
Consistency: This policy provides strong consistency, as it bypasses the cache and reads directly from the source of truth (the UFS). The consistency guarantee is the same as that of the underlying storage system.
Max Age: For Data with Bounded Staleness
For data that is updated periodically, you can set a maxAge. This tells Alluxio to consider the cached data fresh for the specified duration. After the duration expires, Alluxio will check the UFS for a newer version on the next access.
Consistency: This policy provides bounded staleness. Clients may read a version of the data that is stale, but no older than the specified maxAge duration. It offers a balance between performance and freshness but does not guarantee strong consistency.
This configuration is useful for mutable datasets where you can tolerate a certain level of staleness in exchange for higher performance.
To specify the duration when setting defaultType to be maxAge, set defaultMaxAge.
Difference Between maxAge Cache Filter and TTL Rules
maxAge Cache Filter and TTL RulesThe maxAge cache filter and TTL rules can appear similar in functionality and thus confusing for a user to choose between when they want to ensure bounded staleness of cached data. The maxAge cache filter takes effect at data access time, revalidates the cache with the UFS if the cache has reached the max age. Therefore, an expired cache item won't get automatically revalidated until it is accessed. TTL rules are enforced independently of data access, so cached data will be evicted once its TTL expires, even if it is not accessed. Cache items are not revalidated before they are cleared from the cache by TTL expiration, even if it has not changed in the UFS at all. Therefore, if you want to ensure that cached data is always fresh when accessed, use the maxAge cache filter. If you want to ensure that cached data does not linger in the cache beyond a certain time period, use TTL rules. In some cases, you may want to use both features together for more comprehensive cache management.
Controlling How Much to Cache: Cache Quotas
Cache Quotas allow administrators to limit the total amount of cache space that can be used by files within a specific directory tree. This is essential for multi-tenant environments or for ensuring that no single dataset or user can monopolize the cluster's cache resources.
When a directory's quota is exceeded, Alluxio will take action to enforce the limit. By default, it will stop caching new data for that directory (NO_CACHE) and trigger evictions to bring usage back under the limit.
Configuration
To use directory-based quotas, you must enable the feature in alluxio-site.properties and specify the coordinator address.
You can also control the behavior when a quota limit is exceeded using the alluxio.quota.limit.exceeded.action property. The available actions are:
NO_CACHE(default): Stops caching new data under the path but allows read requests to be served from the UFS.REJECT: Rejects new write requests that would create cached data under the path, returning an exception to the client.NOOP: Continues to allow new data to be cached, relying on eviction to manage the space. This may be insufficient if the write rate is higher than the eviction rate.
Basic Operations
You can manage quotas using the alluxio quota CLI. For a complete list of commands and flags, please refer to the CLI guide.
1. Add a Quota: To set a 10GB quota on the /s3/data directory:
2. List Quotas: To view all existing quotas and their current usage:
3. Remove a Quota: To remove the quota on /s3/data:
4. Update a Quota: To update an existing quota:
Quotas can also be nested. For example, you can assign a 10GB quota to a team's directory and then subdivide that with smaller quotas for individual projects within that team.
Controlling How Long to Cache: Time-to-Live (TTL)
The Time-to-Live (TTL) feature allows you to define a maximum lifespan for cached data in specific directories. Once a file's cache duration exceeds its TTL, Alluxio automatically evicts it during its next periodic check. The TTL timer starts when the data is first loaded into the Alluxio cache.
This is useful for:
Automatically cleaning up temporary or time-sensitive data.
Ensuring that stale data does not remain in the cache indefinitely.
Satisfying compliance requirements by limiting the lifespan of sensitive data.
Configuration
To use TTL-based eviction, you must enable the feature in alluxio-site.properties:
It is also critical to configure the scan interval, which determines how frequently workers check for expired data.
A very short interval can create unnecessary system overhead. We recommend setting this to a value that balances cleanup timeliness with performance, such as 24h in many production scenarios.
Basic Operations
TTL rules are managed using the alluxio ttl CLI. For a complete list of commands and flags, please refer to the CLI guide.
1. Add TTL Rules
To set a 24-hour TTL for all files under /s3/daily_reports/:
2. Remove TTL rules
This command is used to remove TTL policies from ETCD. Here are some examples:
3. Update TTL Rules
You can update a TTL rule after it was added:
4. List TTL Rules
To see all active TTL rules:
Like quotas, TTL rules are hierarchical. If multiple TTL rules apply to a file, the most specific path match takes precedence.
Controlling What to Evict First: Cache Priority
By default, Alluxio uses an LRU (Least Recently Used) policy to decide which data to evict when cache space is full. However, some data is more critical than other data, regardless of how recently it was accessed. The Cache Priority feature allows you to assign a cache priority to files.
When space is needed, Alluxio will always evict data with a lower priority before evicting any data with a higher priority. There are three priority levels: HIGH, MEDIUM, and LOW.
This is useful for protecting critical datasets (e.g., a dimension table for a series of queries) from being evicted by less important, ad-hoc jobs.
Configuration
To enable cache priority, you must set the following property in alluxio-site.properties on all client and worker nodes:
Since priority rules are persisted in etcd, you must also ensure that etcd connection details are properly configured.
Basic Operations
Priority rules are managed using the alluxio priority CLI. For a complete list of commands and flags, please refer to the CLI guide.
1. Add a Priority Rule: To assign HIGH priority to a critical dataset:
2. List Priority Rules: To view the current priority rules:
3. Remove a Priority Rule: To delete an existing priority rule:
4. Update a Priority Rule: To update an existing priority rule:
For files with the same priority level, the standard eviction policy (e.g., LRU) is used to determine the eviction order.
Last updated