Writing Temporary Files
Last updated
Last updated
This feature is experimental.
In certain scenarios, the performance and bandwidth of the underlying file system (UFS) may not meet the needs of large-scale data writes. To address this issue, Alluxio offers an option to write data directly to the Alluxio cluster only. Since the process does not interact with UFS, the write performance and bandwidth depends entirely on the performance and bandwidth of the Alluxio cluster. This feature is called CACHE_ONLY.
The recommended use cases for CACHE_ONLY include:
Temporarily saving checkpoint files during AI training
Shuffle files generated during big data computations
In these use cases, the files written are temporary in nature and not meant to be persisted in storage for long term use.
To use the CACHE_ONLY feature, the CACHE_ONLY storage component must be separately deployed. Note that Alluxio client directly interfaces with the CACHE_ONLY storage and does not communicate with the Alluxio worker. The data and metadata in CACHE_ONLY storage are managed independently by CACHE_ONLY storage itself. Since files are managed separately, files in the CACHE_ONLY cannot interact with all the other files served by the Alluxio workers.
The deployment of CACHE_ONLY storage is integrated into the Alluxio operator. Enable it by populating the cacheOnly
field in the Alluxio deployment file.
Note: The CACHE_ONLY Worker requires local disk storage for CACHE_ONLY data. This disk space is completely independent of the Alluxio Worker cache, so estimate the required capacity and reserve disk space accordingly.
Configure cacheOnly.master.resources
and cacheOnly.worker.resources
in a similar fashion as the coordinator
and worker
fields.
The recommended memory calculation is:
Once CACHE_ONLY storage is deployed, all requests to its mount point will be treated as CACHE_ONLY requests. You can access CACHE_ONLY data in various ways.
Access using the Alluxio CLI:
Access using the Alluxio FUSE interface:
CACHE_ONLY supports multi-replica writes. Enable this feature by adding the alluxio.gemini.user.file.replication
configuration in the deployment file:
Alluxio supports temporarily storing data in memory and uploading it to the CACHE_ONLY cluster in the background using multipart uploads to improve write performance. To enable this feature, add the following configurations:
alluxio.gemini.user.file.cache.only.multipart.upload.enabled
false
Enables the multipart upload feature
alluxio.gemini.user.file.cache.only.multipart.upload.threads
16
Maximum number of threads for multipart upload
alluxio.gemini.user.file.cache.only.multipart.upload.buffer.number
16
Number of memory buffers for multipart upload
Note: Enabling multipart upload will significantly increase the memory usage of the Alluxio client. The memory usage is calculated as follows:
Files stored as CACHE_ONLY will not be automatically evicted. The files can be manually deleted to free up space if the capacity is near full. Delete it via Alluxio FUSE with rm ${file_path}
or run the Alluxio CLI command bin/alluxio fs rm ${file_path}