Writing Temporary Files

In certain scenarios, the performance and bandwidth of the underlying file system (UFS) may not meet the needs of large-scale data writes. To address this issue, Alluxio offers an option to write data directly to the Alluxio cluster only. Since the process does not interact with UFS, the write performance and bandwidth depends entirely on the performance and bandwidth of the Alluxio cluster. This feature is called CACHE_ONLY.

The recommended use cases for CACHE_ONLY include:

  • Temporarily saving checkpoint files during AI training

  • Shuffle files generated during big data computations

In these use cases, the files written are temporary in nature and not meant to be persisted in storage for long term use.

Enabling CACHE_ONLY

To use the CACHE_ONLY feature, the CACHE_ONLY storage component must be separately deployed. Note that Alluxio client directly interfaces with the CACHE_ONLY storage and does not communicate with the Alluxio worker. The data and metadata in CACHE_ONLY storage are managed independently by CACHE_ONLY storage itself. Since files are managed separately, files in the CACHE_ONLY cannot interact with all the other files served by the Alluxio workers.

Deploying CACHE_ONLY storage on Kubernetes

The deployment of CACHE_ONLY storage is integrated into the Alluxio operator. Enable it by populating the cacheOnly field in the Alluxio deployment file.

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio
spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: <TAG>
  properties:

  worker:
    count: 2

  pagestore:
    size: 100Gi

  cacheOnly:
    enabled: true
    mountPath: "/cache-only"
    image: <PRIVATE_REGISTRY>/alluxio-cacheonly
    imageTag: <TAG>
    imagePullPolicy: IfNotPresent

    # Replace with base64 encoded license generated by
    # cat /path/to/license.json | base64 |  tr -d "\n"
    license:

    properties:

    journal:
      storageClass: "gp2"

    worker:
      count: 2
    tieredstore:
      levels:
        - level: 0
          alias: SSD
          mediumtype: SSD
          path: /data1/cacheonly/worker
          type: hostPath
          quota: 10Gi

Note: The CACHE_ONLY Worker requires local disk storage for CACHE_ONLY data. This disk space is completely independent of the Alluxio Worker cache, so estimate the required capacity and reserve disk space accordingly.

Configuring Resource Usage

Configure cacheOnly.master.resources and cacheOnly.worker.resources in a similar fashion as the coordinator and worker fields.

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio
spec:
  cacheOnly:
    enabled: true
    master:
      count: 1
      resources:
        limits:
          cpu: "8"
          memory: "40Gi"
        requests:
          cpu: "8"
          memory: "40Gi"
      jvmOptions:
        - "-Xmx24g"
        - "-Xms24g"
        - "-XX:MaxDirectMemorySize=8g"
    worker:
      count: 2
      resources:
        limits:
          cpu: "8"
          memory: "20Gi"
        requests:
          cpu: "8"
          memory: "20Gi"
      jvmOptions:
        - "-Xmx8g"
        - "-Xms8g"
        - "-XX:MaxDirectMemorySize=8g"

The recommended memory calculation is:

(${Xmx} + ${MaxDirectMemorySize}) * 1.1 <= ${requests} = ${limit}

Accessing CACHE_ONLY

Once CACHE_ONLY storage is deployed, all requests to its mount point will be treated as CACHE_ONLY requests. You can access CACHE_ONLY data in various ways.

Access using the Alluxio CLI:

bin/alluxio fs ls /cache_only

Access using the Alluxio FUSE interface:

cd ${fuse_mount_path}/cache_only
echo '123' > test.txt
cat test.txt

Advanced Configurations

Enabling Multi-Replica

CACHE_ONLY supports multi-replica writes. Enable this feature by adding the alluxio.gemini.user.file.replication configuration in the deployment file:

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio
spec:
  properties:
    "alluxio.gemini.user.file.replication": "2"

Enabling Multipart Upload

Alluxio supports temporarily storing data in memory and uploading it to the CACHE_ONLY cluster in the background using multipart uploads to improve write performance. To enable this feature, add the following configurations:

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio
spec:
  properties:
    "alluxio.gemini.user.file.cache.only.multipart.upload.enabled": "true"
    "alluxio.gemini.user.file.cache.only.multipart.upload.threads": "16"
    "alluxio.gemini.user.file.cache.only.multipart.upload.buffer.number": "16"
Configuration Item
Default
Description

alluxio.gemini.user.file.cache.only.multipart.upload.enabled

false

Enables the multipart upload feature

alluxio.gemini.user.file.cache.only.multipart.upload.threads

16

Maximum number of threads for multipart upload

alluxio.gemini.user.file.cache.only.multipart.upload.buffer.number

16

Number of memory buffers for multipart upload

Note: Enabling multipart upload will significantly increase the memory usage of the Alluxio client. The memory usage is calculated as follows:

${alluxio.gemini.user.file.cache.only.multipart.upload.buffer.number} * 64MB

Cache Eviction

Files stored as CACHE_ONLY will not be automatically evicted. The files can be manually deleted to free up space if the capacity is near full. Delete it via Alluxio FUSE with rm ${file_path} or run the Alluxio CLI command bin/alluxio fs rm ${file_path}

Last updated