Directory-Based Cluster Quota
Last updated
Last updated
Alluxio allows admins to define a quota on a directory, which can limit the total amount of cache space used by the files in that directory in Alluxio, across all workers in the cluster.
ETCD only stores quota definitions. Current quota usages are not stored and updated in ETCD.
Workers stay updated on the latest quota definitions by querying ETCD. Based on the quota definitions, workers track the current cache usage under directory with quota definitions.
The coordinator is responsible for periodically polling the latest quota usages from the workers, and forming a cluster-wide usage view by aggregating the usage from each worker. For a certain quota definition, if the total usage in all workers exceeds the defined limit, the coordinator will detect that. Then the coordinator will send cache-control commands to all workers to handle the violated quotas. More details about cache-control commands can be found here.
Please make sure the coordinator is running when enabling directory-based cluster quota feature. If the coordinator is not running, commands that add/remove/update/list quotas will fail. The workers will keep operating, but the quota definitions will not be enforced. So some quotas may be overused without limit.
To enable the directory-based cluster quota feature, set the following properties in conf/alluxio-site.properties
:
You can add a quota definition on a directory in Alluxio by specifying the directory path and the quota size. Note the path is an Alluxio path without a scheme, instead of a UFS path (like s3://bucket
).
When you add a quota definition, please make sure the directory is resolvable in Alluxio. In other words, the target directory must be under an existing Alluxio mount point and it must exist in the UFS. In the example below, there are two existing mount points in Alluxio namespace, and you may only add quota definitions on directories under /s3
or /local
.
You can set quota definition on any directory under existing mount points. For example:
Nested quota definitions are not supported yet. You cannot add a quota definition if that will result in nested quota definitions. For example:
You can remove a quota definition on a directory in Alluxio by specifying the directory path.
Updating a quota definition takes the same arguments as adding a quota definition.
However, note that if you are updating the quota size to a smaller value than the current usage, the update will fail. In this case, we recommend freeing up some space before reducing the quota size.
In that case, you may use the --force
option to force the update. However, that is not the recommended approach. If you force-update the quota capacity to be less than the current usage, the coordinator will realize the quota is violated. It will then trigger eviction on all workers to evict some cache under that quota. It may also, depending on the configuration, ask the workers to either stop caching or reject I/O requests if they try to create new cache under that directory.
You can list all existing quota definitions and their usages in Alluxio by running the following command. Note that if you have just added a new quota definition, workers may need some time to scan their existing cache and calculate the current usage under this directory. In the meantime, the usage will be marked as Calculating
.
Alluxio also provides a command to continuously poll the latest quota usages from the coordinator:
When the current quota usage exceeds the defined limit, the coordinator will command all workers in the cluster to evict some cache under that quota, so the aggregated usage will fall below the limit again. For example, if there is one existing quota definition of 10GB capacity on Alluxio path /s3/
with 2 workers in the cluster.
Worker A currently holds 12GB cache under /s3/
.
Worker B currently holds 4GB cache under /s3/
.
The coordinator will aggregate the total usage from both workers, and observe that the current total is 16GB, which exceeds the limit of 10GB. The coordinator will send a command to both worker A and B, asking them to each evict (16 - 10) / 16 = 0.375
of their current cache under /s3/
. If the eviction is successful, after a short time, the cache usage of the two workers will become:
Worker A now holds 7.5GB cache under /s3/
.
Worker B now holds 2.5GB cache under /s3/
.
As can be observed, when the quota limit is exceeded, the coordinator will ask each worker to evict the same percentage of cache. This is the only eviction strategy supported.
On top of eviction, alluxio.quota.limit.exceeded.action
controls the behavior when a quota is violated. There are 3 supported modes:
NO_CACHE
(default): When a quota is violated, all workers will avoid adding new cache under that path. If a read request is a cache-miss, the worker will serve it by reading the UFS but not caching. This mode stops new cache from being added to the cluster, while allowing requests to complete without errors.
REJECT
: When a quota is violated, all workers will reject new cache requests under that path. If a request wants to create new cache under that path, the worker returns an exception. The behavior of Alluxio cluster cache becomes more similar to a disk, where you cannot write into a full disk.
NOOP
: When a quota is violated, still allow new cache to be created under that path. Under this mode, new cache will keep being added to the workers, whereas the coordinator keeps commanding workers to evict cache to restore the quota limit. Note that if the incoming rate of cache is faster than eviction, the usage may never be restored to below the limit set by the quota definition.
alluxio.quota.worker.heartbeat.interval.ms
controls the frequency of coordinator aggregating the current quota usage from workers in the cluster. Because the coordinator polls the quota usages from workers periodically, the quota usage it observes (also reflected in the bin/alluxio quota list
command) will have a delay.
The Load command will load a directory or a large list of files specified in an index file into the cache. It is possible for a load operation to exceed a quota and cause undesired cache eviction. To avoid this scenario, a check can be enabled to reject the job if the load operation would exceed a quota.
This check will calculate and compare 3 pieces of information:
The total size of the files in the UFS
For those files, how much of them are already cached in Alluxio workers
The current availability of the quota definition
For example, if the index file specifies 100 files of total 100GB size, with 50GB of them are already cached in the workers and the quota currently only has 10GB availability left, Alluxio should reject this load job. The admin should manually free up at least 40GB from that quota before trying to submit the load job again.
By default, the quota check step in a Load job is disabled and none of the following constraints will be enforced. To enable the quota check step, set the following property:
Adding the --skip-quota-check
will override the quota check enabled by the configuration property.
The quota check must be enabled for any of the following constraints to be enforced. If it is disabled by configuration or command line flag, no checks will occur.
The quota check will reject a load job if it involves files from multiple quota definitions. This applies to both loading a directory or loading a list of files specified in an index file. This constraint will always be enforced if quota check is enabled.
The quota check can be configured to reject a load job if it tries to load a path that is not under a quota definition. This constraint can be used to ensure all load jobs are under the quota control. It can optionally be turned off from the set of quota checks by setting:
Since the quota is only checked at the start of a load job, it is possible to exceed the quota if running concurrent load jobs under the same quota. To avoid this situation, the quota check can be configured to only allow one load job at a time to run within a path set with a quota. To opt into enforcing this restriction, set the property: