Index Service
Note that the Index Service is currently an experimental feature. See limitations below.
The Index Service is a caching service for directory listing, designed to handle large directories containing hundreds of millions of files and subdirectories, while providing high performance and scalability.
Overview
The Index Service is a distributed cache over the directory listings, similar to the cache for the metadata and data of regular files. It utilizes cached content to provide faster listings than directly listing the UFS. On the client side, it enables parallel processing of the directory entries, to improve the performance of the listing operation.
The Index Service also integrates with Cache Filter to support setting fine-grained cache filter rules for different directories.
Enabling Index Service
To enable Index Service, add the following configurations to all Alluxio nodes, including clients:
Additionally, there are a few performance-related configurations available:
alluxio.client.index.service.parallelism
controls the number of threads a client uses for handling the directory listing requests. It defaults to the number of CPU cores available on the client node. Setting this to a greater value may help improve performance, if the workload involves a large number of concurrent listing requests.
alluxio.client.index.service.list.batch.size
controls the size of a batch every time the client receives from the listing cache on the worker nodes. A smaller batch helps reduces the peak memory usage on both the client and the worker, but may not fully utilize the bandwidth between the client and the worker. A greater value may reduce the number of round trips between the client and the worker, thus improving listing performance, but also incurs a higher memory footprint on both sides.
Restart the nodes to apply the configurations.
Currently, the Index Service is available via the Alluxio Java API and the POSIX API.
Listing through the Alluxio CLI:
Listing through POSIX API after mounting Alluxio with FUSE:
Setting cache filter rules and Consistency Implications
When a client wants to list a directory through the Index Service, the Index Service may reuse the cached version of the target directory if it has been cached, or load it from the UFS when it's not. Sometimes even though a directory's content has been cached, it may not be eligible for reuse as the cache can expire, according to the user's preference. The Index Service supports fined-grained directory cache filter rules through the use of a Cache Filter configuration. Index Service will then consult the configured cache filter rules to decide whether a cached version of the target directory can be reused.
The following example shows a configuration where different cache filter rules are applied to different directories:
When the directory's content changes in the UFS, i.e. new files added or existing files deleted, a listing through Index Service will reflect the changes according to the cache filter rules set on that directory.
There are 3 different types of rules that can be set on the directories:
Max Age
In the above example, the directory
s3://tables/daily_partitions/
contains incremental updates to the tables that are generated on a daily basis. This directory is set to have a max age of 4 hours, which means a stale, cached version of the directory's listing will exist for at most 4 hours. In other words, after a new partition is generated, a user has to wait at most 4 hours before it is visible in the directory listing.Skip Cache
The directory
s3://tables/intermediate_temp_tables
contains temporary tables that are created as the intermediate artifacts of a data processing pipeline, and they need to be visible in the listings as soon as they are created. This directory is set to be skip cache, which means listing this directory will always invoke the UFS and return latest contents.Immutable
The default cache filter rule is set to be immutable, therefore any other directories not explicitly listed in the cache filter configuration are considered immutable. Immutable directories will be loaded once on the very first time they are listed through Alluxio, and the Index Service will never again check with UFS to see if their contents have changed. This immutable type is suitable for the majority of directories which represent a static dataset that never changes.
Documentation on Cache Filter has more details on the different types of cache filter rules, and offers recommendation about the desired combinations for different use cases.
Note that in the above example, the directory paths end with a $
to indicate that the specified cache control policy is only applied on the directory itself, not on the files and subdirectories in it recursively. The regular expressions also tolerate an optional trailing slash with /?
. You can check the effective cache filter rule on the different directories and files with bin/alluxio fs ls -c /path
:
Note that the cache filter rules can be set differently on the directories and the files contained in it. In the above example, although /s3_tables/intermediate_temp_tables
is set to skip cache, the files and subdirectories in it are immutable, according to the default type:
It's important to note that, if a directory is set to have immutable or max-age cacheability, changes to it are not reflected immediately even though they are made directly through Alluxio. If it is set to be immutable, use the bin/alluxio index invalidate
command to invalidate the now outdated cache. If it is set to have a max age, then the stale cached listing will be automatically refreshed after the specified age elapses. You can also use the invalidate command to manually invalidate the cache and force an immediate refresh afterward.
Refreshing Stale Listings Manually
In rare cases where an unforeseen update is made to a directory that was deemed immutable, in order for Alluxio to know about this change, the admin can manually invalidate the cached directory listing. On the next access, the Index Service will reload the directory from UFS, thus picking up the change.
An admin can use the index invalidate
command to invalidate a cached directory. For example:
This will invalidate the listing cache of /s3_tables/permanent_tables
, and force an update from the UFS on the next access.
Limitations
The Index Service currently has the following known limitations:
Not available in Alluxio's S3 API.
If a file is created or deleted through Alluxio, the changes to the parent directory of the file is not immediately reflected in the listing of the directory, if the directory has immutable or max-age cacheability.
For 2, a workaround is to manually invalidate a cached directory using the index invalidate
command.
Last updated