File Writing
Last updated
Last updated
This feature is experimental.
In certain scenarios, the performance and bandwidth of the underlying file system (UFS) may not meet the needs of large-scale data writes. To address this issue, Alluxio offers an option to write data directly to the Alluxio cluster only. Since the process does not interact with UFS, the write performance and bandwidth depends entirely on the performance and bandwidth of the Alluxio cluster. This feature is called CACHE_ONLY.
The recommended use cases for CACHE_ONLY include:
Temporarily saving checkpoint files during AI training
Shuffle files generated during big data computations
In these use cases, the files written are temporary in nature and not meant to be persisted in storage for long term use.
Additionally, for scenarios requiring eventual persistence of CACHE_ONLY files, Alluxio supports an optional async persistence feature, which can be configured as described in the section.
To use the CACHE_ONLY feature, the CACHE_ONLY storage component must be separately deployed. Note that Alluxio client directly interfaces with the CACHE_ONLY storage and does not communicate with the Alluxio worker. The data and metadata in CACHE_ONLY storage are managed independently by CACHE_ONLY storage itself. Since files are managed separately, files in the CACHE_ONLY cannot interact with all the other files served by the Alluxio workers.
The deployment of CACHE_ONLY storage is integrated into the Alluxio operator. Enable it by populating the cacheOnly
field in the Alluxio deployment file.
Note: The CACHE_ONLY Worker requires local disk storage for CACHE_ONLY data. This disk space is completely independent of the Alluxio Worker cache, so estimate the required capacity and reserve disk space accordingly.
Configure cacheOnly.master.resources
and cacheOnly.worker.resources
in a similar fashion as the coordinator
and worker
fields.
The recommended memory calculation is:
Once CACHE_ONLY storage is deployed, all requests to its mount point will be treated as CACHE_ONLY requests. You can access CACHE_ONLY data in various ways.
Access using the Alluxio CLI:
Access using the Alluxio FUSE interface:
For scenarios where temporary data written to Alluxio requires eventual persistence, Alluxio offers an async persistence mechanism. This allows data written to a CACHE_ONLY
mount point to be asynchronously uploaded to a corresponding UFS path as configured.
This is especially useful in environments where immediate persistence is not necessary, but eventual consistency is desired.
Limited metadata operations: Only basic file persistence is supported; operations like renaming are not reliably handled.
No UFS cleanup: Deleting files from Alluxio does not automatically remove the corresponding data from UFS.
Weak recovery semantics: If the UFS and Alluxio versions of a file diverge, Alluxio cannot currently reconcile them.
File modifications retrigger persistence: Modifying a file in Alluxio will schedule a new async persistence task, potentially creating inconsistent versions across Alluxio and UFS.
Cache isolation: Files written via CACHE_ONLY
and later persisted are not removed from CACHE_ONLY
even after being written to UFS. Reading through the original CACHE_ONLY
path will hit the CACHE_ONLY
cache, while reading through the UFS path will use the standard Alluxio Worker pipeline — the two caches do not share data.
To enable the feature, you need to
Enabling CACHE_ONLY, as a prerequisite to enable async persistence
Setting alluxio.gemini.master.async.upload.local.file.path
and the corresponding json path. Note that this should be set on both alluxio and alluxio CACHE_ONLY machines (See instructions below)
Enabling alluxio coordinator (async persistence relies on the job service)
Make sure the CACHE_ONLY masters be able to connect to ETCD
If you use operator to deploy, make sure alluxio properties are also set on alluxio CACHE_ONLY properties. Some operation generated properties need to be specified on CACHE_ONLY components manually due to the current operator limitation.
alluxio.gemini.master.async.upload.local.file.path
Path to the async upload path mapping JSON file
N/A
alluxio.gemini.master.persistence.checker.interval
Interval to check and update async persistence status
1s
alluxio.gemini.master.persistence.scheduler.interval
Interval to schedule new async persistence tasks
1s
The file path specified in alluxio.gemini.master.async.upload.local.file.path should be in JSON format. Example:
cacheOnlyMountPoint
Yes
The mount point path for CACHE_ONLY storage
asyncUploadPathMapping
Yes
Key is the CACHE_ONLY sub-path, value is the Alluxio path to persist to (resolved by mount table)
blackList
Optional
Simple filename pattern exclusion list (non-regex)
Worker failure: If a CACHE_ONLY
worker goes offline, Alluxio can retrieve data from other CACHE_ONLY
workers that have replicas (if replication is enabled).
Master failover: Metadata required for async persistence is stored in the Alluxio master journal. When a master fails, a standby master can recover the metadata by reading the journal.
Coordinator restart: Async persistence is managed by the Alluxio Coordinator, which stores job state in a local RocksDB. The coordinator can resume ongoing jobs after a restart by reading the RocksDB state.
Worker reassignment: If the worker responsible for uploading to UFS fails, the coordinator will reschedule the task to another worker.
When a file is lost from CACHE_ONLY
, Alluxio supports restoring data from UFS under the following mechanisms:
File open with missing blocks: If a file opened through a CACHE_ONLY
path has incomplete blocks, the Alluxio client will attempt to fetch the missing content from UFS and cache it.
File read errors: If Alluxio encounters an error reading from CACHE_ONLY
, the client will fallback to reading the file directly from UFS, without caching it back into Alluxio.
The file was previously stored in UFS via async persistence.
The modification time in UFS is newer than the one in Alluxio.
The file length in UFS matches that in Alluxio metadata.
CACHE_ONLY supports multi-replica writes. Enable this feature by adding the alluxio.gemini.user.file.replication
configuration in the deployment file:
Alluxio supports temporarily storing data in memory and uploading it to the CACHE_ONLY cluster in the background using multipart uploads to improve write performance. To enable this feature, add the following configurations:
alluxio.gemini.user.file.cache.only.multipart.upload.enabled
false
Enables the multipart upload feature
alluxio.gemini.user.file.cache.only.multipart.upload.threads
16
Maximum number of threads for multipart upload
alluxio.gemini.user.file.cache.only.multipart.upload.buffer.number
16
Number of memory buffers for multipart upload
Note: Enabling multipart upload will significantly increase the memory usage of the Alluxio client. The memory usage is calculated as follows:
Files stored as CACHE_ONLY will not be automatically evicted. The files can be manually deleted to free up space if the capacity is near full. Delete it via Alluxio FUSE with rm ${file_path}
or run the Alluxio CLI command bin/alluxio fs rm ${file_path}