Cache Filtering

There can be cases where the total size of the dataset exceeds the total disk space allocated for the Alluxio cache. To address this, Alluxio provides the cache filter feature that allows the admin to specify rules to cache only certain files based on the file path. The admin can also specify rules for Alluxio to automatically reload the cache from UFS to observe the latest file contents.

Configuration

To enable the cache filter feature, add the following configurations to the alluxio-site.properties file:

# Cache Filter is not enabled by default
alluxio.user.client.cache.filter.enabled=true
alluxio.user.client.cache.filter.type=RULE_SET
alluxio.user.client.cache.filter.config.file=${ALLUXIO_HOME}/conf/cache_filter.json
alluxio.user.client.cache.filter.config.check.interval=5min

Alluxio comes with a template JSON file for the cache filter rules. The file is located at ${ALLUXIO_HOME}/conf/cache_filter.json.template. Alluxio components will dynamically reload the cache filter rule file every alluxio.user.client.cache.filter.config.check.interval, so changes to the file will be picked up automatically with a delay. If you want a faster update, you can set the check interval to a smaller value like 30s.

Cache Filter Rules

Alluxio defines 3 different modes for cache filtering, depending on a file's mutability.

  • Immutable: User claims that the metadata/data on the path never changes. Alluxio will cache the metadata/data of these files and never check the UFS to see if the files have changed.

  • Skip Cache: User claims that the metadata/data on the path should not be cached in Alluxio. This is most used by files/directories that can change frequently and have little value in caching. This is also suitable for files that are read very rarely.

  • Max Age: User claims that the metadata/data on the path may change. If we specify MaxAge=10min, that means metadata/data cache in Alluxio within 10 minutes are acceptable (assumed fresh), and Alluxio will have to check with the UFS again if the existing cache is older than 10 minutes, in case the file has changed in the UFS.

Example

To illustrate how to set the rules, an example JSON file as shown below.

{
  "apiVersion": 1,
  "metadata": {
    "immutable": [".*/immutable_tables/.*"],
    "skipCache": [".*/skip_cache_tables/.*"],
    "maxAge": {".*/mutable_tables/.*":"10s"},
    "defaultType": "immutable"
  },
  "data": {
    "immutable": [".*/immutable_tables/.*"],
    "skipCache": [".*/skip_cache_tables/.*"],
    "maxAge": {".*/mutable_tables/.*":"10s"},
    "defaultType": "immutable"
  }
}

The configuration file allows the admin to specify different caching rules for metadata and data. Let's just focus on the metadata part for now. The metadata section contains 4 parts:

  1. immutable: A list of regex patterns for files that should be considered Immutable by Alluxio.

  2. skipCache: A list of regex patterns for files that should be considered Skip Cache by Alluxio.

  3. maxAge: A map of regex patterns with different Max Age values. For example, files matching ".*/mutable_tables/.*" will have a max cache age of 10 seconds.

  4. defaultType: If a file does not match any of the regex patterns, it will be considered as the default type specified.

Note: Canonically, the rule names are in camel case: immutable, skipCache, maxAge, but the rule names are case-insensitive so both Skipcache and skipCAche are parsed correctly to skipCache.

The following sections will further explain the best caching rule configurations for different use cases.

Managing Immutable Files

Immutable files are the best fitting scenario for cache. If a file is immutable, there will not be any consistency issue in a distributed cache system like Alluxio. If a file is guaranteed to not change, users will observe the same contents from Alluxio or UFS. The default cache_filter.json.template defines both metadata and data as immutable for all files in Alluxio.

{
  "apiVersion": "1",
  "metadata": {
    "defaultType": "immutable"
  },
  "data": {
    "defaultType": "immutable"
  }
}

It is also possible to use Alluxio to only cache metadata but not data.

{
  "apiVersion": "1",
  "metadata": {
    "defaultType": "immutable"
  },
  "data": {
    "defaultType": "skipCache"
  }
}

If you know certain files are not suitable for caching, they can be added to the skipCache list. In this way, Alluxio is configured as a router between accessing cache or UFS directly.

{
  "apiVersion": "1",
  "metadata": {
    "skipCache": [".*/never_cache/.*"],
    "defaultType": "immutable"
  },
  "data": {
    "skipCache": [".*/never_cache/.*"],
    "defaultType": "immutable"
  }
}

Managing Mutable Files

Mutable files are troublesome to caching systems, including Alluxio. For files whose content can change over time, Alluxio offers 3 options to handle them.

Option 1: Mutable files do NOT go through cache

The easiest way is to directly go to UFS for those files. A UFS typically offers strong consistency and is considered as the source of truth. If all metadata and data requests are directed to UFS, then users of Alluxio do not see any consistency issues.

Typically, users identify paths such mutable directories and mark them skipCache. For example, users could have some directories on NAS used for storing scripts. They want strong consistency on those directories and it is fine to directly retrieve them from UFS instead of from cache. The following cache filter configuration shows this example:

{
  "apiVersion": "1",
  "metadata": {
    "skipCache": ["file://dev/scripts/.*", "file://data/dev/scripts/.*"],
    "defaultType": "immutable"
  },
  "data": {
    "skipCache": ["file://dev/scripts/.*", "file://data/dev/scripts/.*"],
    "defaultType": "immutable"
  }
}

Suitable use cases This option fits use cases where performance can be traded for strong consistency. Also, if the file is updated frequently, the cache performance gain will soon be outweighed by the extra admin overhead or cache refresh cost.

Option 2: Mutable files are cached and manually refresh cache

In some use cases, performance cannot be traded and we cannot set the skipCache option. However, if the admin is aware of when these files are changed, they can be manually refreshed. In this case, the files can be initially configured as immutable. However, the files can be occasionally mutable. In those cases, users may still configure immutable for those files, but manually refresh/invalidate cache in Alluxio after those files are updated in UFS.

The cache filter configuration will look like before, as if those files are immutable.

{
  "apiVersion": "1",
  "metadata": {
    "defaultType": "immutable"
  },
  "data": {
    "defaultType": "immutable"
  }
}

Because those paths are configured immutable, Alluxio will never refresh the cache. This relies on the administrator to perform extra operations to refresh.

Imagine an example workflow is a classic ELT use case below:

  1. Load raw data from OLTP database, table SALES. The filter is all trade entries from the past day.

  2. All entries go through the ELT pipeline and a materialized table is created under s3://datalake/tables/pipeline/sales/, like s3://datalake/tables/pipeline/sales/data_20240102/xxx.parquet.

  3. Alluxio mounts s3://datalake/tables/pipeline/ and serves the data to data analysts through its northbound S3/FUSE interface.

In the example scenario, the data engineer finds the data in s3://datalake/tables/pipeline/sales/data_20240102/ is incorrect and now needs to reload the table by triggering the ELT pipeline again. So files in that directory will be changed (mutable).

After the ELT pipeline is triggered and complete, the data engineer has to manually reload the metadata in Alluxio.

# The command is
$ alluxio job load --metadata-only --path <ufs path> --submit

# So in this case the data engineer needs to trigger
$ alluxio job load --metadata-only \ 
  --path s3://datalake/tables/pipeline/sales/data_20240102/ --submit

The command above relies on the Scheduler service in Alluxio master in 3.x. The command will force reload the metadata under the specified directory in Alluxio workers. If the cache contents have changed, the cache will be invalidated (so the next read will load the latest cache data from UFS).

If the data is best served from cache, this is the best chance to trigger the data preloading too.

$ alluxio job load --path s3://datalake/tables/pipeline/sales/data_20240102/ --submit

The example shows how to update and reload one directory. When an operation modifies the UFS contents directly without involving Alluxio, the corresponding metadata needs to be forcibly refreshed to reflect that change.

Suitable use cases This is best suited if file mutations are rare cases and triggered manually, where most files are immutable. This is also suitable when data is generated/updated by pipelines, and we can add one extra step to that pipeline to notify Alluxio.

Option 3: Time based cache refresh

In some use cases, performance is a hard requirement and it is not known when to trigger updates on Alluxio to refresh mutated files. Under those circumstances, users may want Alluxio to periodically pick up the changes on those files and update automatically.

MaxAge type serves this purpose. If data in s3://datalake/tables/pipeline/sales/ and s3://datalake/tables/pipeline/inventory/ are mutable every a few hours, the cache filter can be configured as below:

{
  "apiVersion": "1",
  "metadata": {
    "maxAge": {"s3://datalake/tables/pipeline/sales/.*": "1h","s3://datalake/tables/pipeline/inventory/.*": "1h"},
    "defaultType": "immutable"
  },
  "data": {
    "defaultType": "immutable"
  }
}

Because maxAge is set to 1 hour, Alluxio worker will perform a metadata sync with the UFS to observe the latest file metadata, if the existing metadata cache is older than 1 hour.

Users may ask, why is the maxAge only set on metadata but not data? In Alluxio, metadata defines data status and controls data lifecycle. When the metadata is refreshed (reloaded from UFS with a different state), Alluxio will remove the data too (but not reload the data into cache). Therefore, it is enough to only set the maxAge rule on metadata to trigger metadata reloading. This configuration is both effective and performant.

Comparison between 3 Options

Let's compare the 3 options above with a table.

Caches strongly prefer immutability. If the underlying files never change, clients are guaranteed to observe the same contents from Alluxio client cache, cluster cache, or UFS directly. Alluxio can also replicate, evict, or reload the cache without consistency concerns. Therefore, we promote Option 1 and Option 2, but not Option 3. Option 1 and 2 better guarantee Alluxio cache is immutable. Consistency issues are best avoided.

If you have to choose Option 3 for automatic cache refreshing, the maxAge should be set to a balance between performance and consistency. A load metadata operation with the UFS can be slow and costly. If the data is updated every few hours, maxAge of 1h is certainly better than 10s. However, if the data is changing frequently and you have to use a maxAge below 1min, please try to use skipCache instead because the underlying file is updating so frequently that caching can become both worthless and confusing, where clients may observe a stale version.

If a file is configured immutable, Alluxio will never check the UFS to pick up changes on that file. In this case, when the file has changed in the UFS, you will observe a stale version from Alluxio cache. With 2 workers potentially caching the file, it may also be possible for different versions to be loaded, resulting in observing two different versions of the same file and preventing the older version from being refreshed.

FAQ

What is the cache refresh logic when Cache Filter is not enabled?

Alluxio previously used alluxio.user.file.metadata.sync.interval configuration to control whether existing cache should be considered stale and refreshed.

Below are how the Cache Filter Rule types map to old alluxio.user.file.metadata.sync.interval semantics:

  • Immutable: alluxio.user.file.metadata.sync.interval=-1

  • Skip Cache: alluxio.user.file.metadata.sync.interval specifies whether existing cache should be refreshed. skipCache mode does not go through cache at all. So there is no matching concept for alluxio.user.file.metadata.sync.interval.

  • Max Age(k): alluxio.user.file.metadata.sync.interval=k

  • Max Age(0): alluxio.user.file.metadata.sync.interval=0

If the Cache Filter feature is not enabled, Alluxio will fall back to resolving alluxio.user.file.metadata.sync.interval configuration to determine whether cache should be refreshed. The default value is -1, meaning Alluxio will never attempt a second metadata sync with the UFS and update cache contents, after the very first sync.

# Default configuration
alluxio.user.file.metadata.sync.interval=-1

What are some considerations when I configure different cache filter rules for metadata and data?

Typically, your cache filter rules for metadata and data should be the same, because reasoning the behavior or troubleshooting issues can be tricky when you have different rules for metadata and data.

For example, you cache metadata as immutable and skip all data cache with skipCache. Then a file s3://bucket/table/data.txt has been overwritten from length=1GB to length=512MB in the UFS. Then if you try to scan the file with Alluxio from the beginning to the end, Alluxio will rely on the cached staleness metadata with length=1GB and try to read the data from UFS. Later you will receive an exception from the UFS complaining that offset beyond 512MB is not found. This example shows that there is risk when you define different cache filter rules for metadata and data. If you read these from different sources (one from Alluxio cache and the other from UFS), there is a risk of inconsistency.

Previous sections have listed some scenarios where you may want to use different cache filter rules for metadata and data.

Why does my application observe inconsistent data when using maxAge rule?

When configured maxAge, Alluxio works with "Bounded Staleness" level consistency. "Bounded Staleness" specifies that the clients may observe a version that is older than the latest version in the UFS, but the staleness is bounded by a certain time. Effectively, the bound is the maxAge definition. This consistency level is stronger than eventual consistency, but is weaker than strong consistency or sequential consistency. Within this bound, clients may observe any stale version of the file, hence the inconsistency.

Under the hood, Alluxio is a distributed system and can serve multiple levels of cache (client cache and cluster cache). If the clients are served by different cache, the cache may load from the UFS at different time, and consequently observe different versions of the file.

After the maxAge elapses, Alluxio components will consider existing cache as stale. Then they will attempt a metadata sync with the UFS to observe the latest version.

Why does my application observe inconsistent data even when I configured files to be immutable?

Similar to the question above, different Alluxio components may load from UFS at different time. If the file in UFS has changed, it is possible that different Alluxio components may cache different versions of a file. Because the cache filter rule is set to immutable, Alluxio components will never attempt to check the file in UFS again, so existing cache may stay inconsistent forever.

When this becomes an issue, refer to Option 1 or Option 2 in Managing Mutable Files.

Last updated