Cache Filtering

Cache Filter Overview

There can be cases where the total size of the dataset exceeds the total disk space allocated for the Alluxio cache. To address this, Alluxio provides the cache filter feature that allows us to cache only hot data. By configuring the cache filter, users can specify to cache certain files based on the file path.

Configuration

To enable the cache filter feature, add the following configurations to the alluxio-site.properties file:

alluxio.user.client.cache.filter.enabled=true
alluxio.user.client.cache.filter.class=alluxio.client.file.cache.filter.RuleSetBasedCacheFilter
alluxio.user.client.cache.filter.type=RULE_SET
alluxio.user.client.cache.filter.config.file=${ALLUXIO_HOME}/conf/cache_filter.json
alluxio.user.client.cache.filter.config.check.interval=5min
  • alluxio.user.client.cache.filter.config.file is used for specifying the path of the JSON file that is to set the rules for filtering

  • alluxio.user.client.cache.filter.config.check.interval is the interval for checking the JSON file

These two properties allow the filter rules to be dynamically updated on a running cluster.

Setting rules of filtering in the JSON file

There are three different types of rules for filtering in Alluxio:

  • Immutable: assume that the file on UFS will never change; these files are cached and never evicted

  • Skip Cache: assume that the file on UFS changes frequently; these files are never cached and always read from the UFS

  • Max Age: assume that a file on the UFS will change within a specified time interval; these files are cached but set to expire after a certain time interval

To illustrate how to set the rules, an example JSON file as shown below.

{
  "apiVersion": 1,
  "data": {
    "immutable": [".*/immutable_tables/.*"],
    "skipCache": [".*/skip_cache_tables/.*"],
    "maxAge": {".*/mutable_tables/.*":"10s"},
    "defaultType": "immutable"
  },
  "metadata": {
    "immutable": [".*/immutable_tables/.*"],
    "skipCache": [".*/skip_cache_tables/.*"],
    "maxAge": {".*/mutable_tables/.*":"10s"},
    "defaultType": "immutable"
  }
}

There are two parts in this JSON file. One is for metadata, the other is for data. Let's take the data part as an example.

  • A singleton list with the regex pattern .*/immutable_tables/.* is set for the immutable key, so these files will never be evicted.

  • A singleton list with the regex pattern .*/skip_cache_tables/.* is set for the skipCache key, so these files will never be cached.

  • A single entry map {".*/mutable_tables/.*":"10s"} is set for the maxAge key. Files matching the .*/mutable_tables/.* regex pattern will be cached and expired after 10s.

The metadata part is set in an identical way, so the metadata for the same files will have the same caching behavior. The two types are separately defined to allow different behaviors for each.

Finally, we can set the default type for the file that doesn't match any configured regex patterns. In the example, the defaultType is immutable.

Last updated