Managing Data in the Cache

Once data is loaded into Alluxio, managing it effectively is crucial for optimizing performance, controlling resource consumption, and ensuring data freshness. Alluxio provides a rich set of tools to control what data is cached, how much space it can consume, how long it lives in the cache, and its importance relative to other data.

This guide covers four key aspects of cache management:

  1. Cache Filter Policies: Defining rules to include or exclude certain files from being cached.

  2. Cache Quotas: Setting storage limits on specific directories to manage space.

  3. Time-to-Live (TTL): Automatically expiring and evicting data after a set period.

  4. Eviction Priority: Influencing which data gets evicted first when space is needed.

Controlling What to Cache: Cache Filter Policies

In many cases, a dataset is larger than the available cache capacity in Alluxio. The Cache Filter Policy feature allows you to create rules that determine which files should or should not be cached based on their path. This is useful for excluding frequently changing files, temporary files, or data that provides little performance benefit from being cached.

Alluxio defines three filter modes for a file:

  • Immutable: The file's data and metadata will never change. Alluxio will cache it once and never check the UFS for updates. This is the most performant option.

  • Skip Cache: The file's data and metadata should not be cached in Alluxio. All requests for this file will be forwarded directly to the UFS. This is ideal for highly volatile files where cache consistency would be difficult to maintain.

  • Max Age: The file's data and metadata may change. You can specify a duration (e.g., 10m) after which the cached copy is considered stale. Alluxio will then re-check the UFS for a newer version on the next access.

Note on Default Behavior If the cache filter feature is not enabled (alluxio.user.client.cache.filter.enabled=false), Alluxio's behavior is equivalent to the Immutable policy. It will cache data upon first read and will not check the UFS for updates afterward.

Configuration

To enable and configure cache filter policies, you must first enable the feature in alluxio-site.properties. Alluxio uses ETCD to store and manage filter rules, allowing you to update them dynamically via the CLI or REST API without restarting services.

Set the following properties:

# Enable the cache filter feature
alluxio.user.client.cache.filter.enabled=true

# Set ETCD as the rule management backend
alluxio.user.client.cache.filter.type=ETCD

Once configured, Alluxio components will automatically sync with the rules stored in ETCD.

Examples of Filter Modes

The following examples show the logic behind different filter rules. You would typically apply these rules using the Alluxio CLI or REST API when using ETCD. For detailed command usage, please refer to the CLI guide.

Immutable: For Data That Never Changes

This is the most performant option and should be the default for most of your data. If a file is marked as immutable, Alluxio caches it once and never checks the UFS for updates.

Consistency: This policy provides strong consistency only if the source file in the UFS truly never changes. If the source file is modified after being cached, Alluxio will continue to serve the old, stale version, leading to permanent inconsistency.

{
  "apiVersion": "1",
  "metadata": {
    "defaultType": "immutable"
  },
  "data": {
    "defaultType": "immutable"
  }
}

Skip Cache: For Volatile or Rarely Used Data

If you have data that changes frequently (e.g., temporary scripts) or is not worth caching, you can exclude it using a skipCache rule. All requests for these files will be served directly from the UFS.

Consistency: This policy provides strong consistency, as it bypasses the cache and reads directly from the source of truth (the UFS). The consistency guarantee is the same as that of the underlying storage system.

{
  "apiVersion": "1",
  "metadata": {
    "skipCache": ["file://dev/scripts/.*"],
    "defaultType": "immutable"
  },
  "data": {
    "skipCache": ["file://dev/scripts/.*"],
    "defaultType": "immutable"
  }
}

Max Age: For Data with Bounded Staleness

For data that is updated periodically, you can set a maxAge. This tells Alluxio to consider the cached data fresh for the specified duration. After the duration expires, Alluxio will check the UFS for a newer version on the next access.

Consistency: This policy provides bounded staleness. Clients may read a version of the data that is stale, but no older than the specified maxAge duration. It offers a balance between performance and freshness but does not guarantee strong consistency.

{
  "apiVersion": "1",
  "metadata": {
    "maxAge": {"s3://datalake/tables/pipeline/sales/.*": "1h"},
    "defaultType": "immutable"
  },
  "data": {
    "defaultType": "immutable"
  }
}

This configuration is useful for mutable datasets where you can tolerate a certain level of staleness in exchange for higher performance.

Controlling How Much to Cache: Cache Quotas

Cache Quotas allow administrators to limit the total amount of cache space that can be used by files within a specific directory tree. This is essential for multi-tenant environments or for ensuring that no single dataset or user can monopolize the cluster's cache resources.

When a directory's quota is exceeded, Alluxio will take action to enforce the limit. By default, it will stop caching new data for that directory (NO_CACHE) and trigger evictions to bring usage back under the limit.

Configuration

To use directory-based quotas, you must enable the feature in alluxio-site.properties and specify the coordinator address.

# Enable the directory-based cluster quota feature
alluxio.quota.enabled=true

# Configure the coordinator address
alluxio.coordinator.address=<host>:<port>

You can also control the behavior when a quota limit is exceeded using the alluxio.quota.limit.exceeded.action property. The available actions are:

  • NO_CACHE (default): Stops caching new data under the path but allows read requests to be served from the UFS.

  • REJECT: Rejects new write requests that would create cached data under the path, returning an exception to the client.

  • NOOP: Continues to allow new data to be cached, relying on eviction to manage the space. This may be insufficient if the write rate is higher than the eviction rate.

Basic Operations

You can manage quotas using the alluxio quota CLI. For a complete list of commands and flags, please refer to the CLI guide.

1. Add a Quota: To set a 10GB quota on the /s3/data directory:

$ bin/alluxio quota add --directory /s3/data/ --quota-size 10GB
Successfully added quota definition for path /s3/data/ with size 10GB.

2. List Quotas: To view all existing quotas and their current usage:

$ bin/alluxio quota list
Alluxio path                    	        Capacity	            Used	State
/s3/data                    	             10.00GB	     Calculating	Available

3. Remove a Quota: To remove the quota on /s3/data:

$ bin/alluxio quota remove --directory /s3/data
Successfully removed quota definition for path /s3/data.

4. Update a Quota: To update an existing quota:

$ bin/alluxio quota update --directory /local/data/ --quota-size 100GB

Quotas can also be nested. For example, you can assign a 10GB quota to a team's directory and then subdivide that with smaller quotas for individual projects within that team.

Controlling How Long to Cache: Time-to-Live (TTL)

The Time-to-Live (TTL) feature allows you to define a maximum lifespan for cached data in specific directories. Once a file's cache duration exceeds its TTL, Alluxio automatically evicts it during its next periodic check. The TTL timer starts when the data is first loaded into the Alluxio cache.

This is useful for:

  • Automatically cleaning up temporary or time-sensitive data.

  • Ensuring that stale data does not remain in the cache indefinitely.

  • Satisfying compliance requirements by limiting the lifespan of sensitive data.

Configuration

To use TTL-based eviction, you must enable the feature in alluxio-site.properties:

alluxio.ttl.policy.enabled=true

It is also critical to configure the scan interval, which determines how frequently workers check for expired data.

# Default is 1 hour. Set a larger interval for production environments.
alluxio.ttl.eviction.check.interval=1h

A very short interval can create unnecessary system overhead. We recommend setting this to a value that balances cleanup timeliness with performance, such as 24h in many production scenarios.

Basic Operations

TTL rules are managed using the alluxio ttl CLI. For a complete list of commands and flags, please refer to the CLI guide.

1. Add TTL Rules

To set a 24-hour TTL for all files under /s3/daily_reports/:

$ bin/alluxio ttl add --path /s3/daily_reports/ --time 24h
Added alluxioPath=/s3/daily_reports/ and time=24h

2. Remove TTL rules

This command is used to remove TTL policies from ETCD. Here are some examples:

$ bin/alluxio ttl remove --path /s3/test_folder
Removed TTL policy for alluxioPath=/s3/test_folder/

3. Update TTL Rules

You can update a TTL rule after it was added:

$ bin/alluxio ttl update --path /s3/test_folder/ --time 30min
Updated alluxioPath=/s3/test_folder/ and time=30min

$ bin/alluxio ttl update --path /s3/test_folder/ --time 5s
Warning: You are setting TTL policy to 5s. This TTL is too small. Note that expired cache are scanned and evicted every 1h. Please consider making this TTL larger with `bin/alluxio ttl update` command.
Updated alluxioPath=/s3/test_folder/ and time=5s

4. List TTL Rules

To see all active TTL rules:

$ bin/alluxio ttl list
Listing all TTL policies
/s3/daily_reports/     TTL: 24 hours

Like quotas, TTL rules are hierarchical. If multiple TTL rules apply to a file, the most specific path match takes precedence.

Controlling What to Evict First: Eviction Priority

By default, Alluxio uses an LRU (Least Recently Used) policy to decide which data to evict when cache space is full. However, some data is more critical than other data, regardless of how recently it was accessed. The Priority Eviction feature allows you to assign an eviction priority to files.

When space is needed, Alluxio will always evict data with a lower priority before evicting any data with a higher priority. There are three priority levels: HIGH, MEDIUM, and LOW.

This is useful for protecting critical datasets (e.g., a dimension table for a series of queries) from being evicted by less important, ad-hoc jobs.

Configuration

To enable priority eviction, you must set the following property in alluxio-site.properties on all client and worker nodes:

alluxio.worker.page.store.evictor.priority.enabled=true

Since priority rules are persisted in etcd, you must also ensure that etcd connection details are properly configured.

Basic Operations

Priority rules are managed using the alluxio priority CLI. For a complete list of commands and flags, please refer to the CLI guide.

1. Add a Priority Rule: To assign HIGH priority to a critical dataset:

$ bin/alluxio priority add --path s3://bucket/critical_data --priority high

2. List Priority Rules: To view the current priority rules:

$ bin/alluxio priority list

3. Remove a Priority Rule: To delete an existing priority rule:

$ bin/alluxio priority remove --path s3://bucket/data

4. Update a Priority Rule: To update an existing priority rule:

$ bin/alluxio priority update --path s3://bucket/data --priority medium

For files with the same priority level, the standard eviction policy (e.g., LRU) is used to determine the eviction order.

Last updated