Data Processing

This document describes how to accelerate data processing using Alluxio.

In traditional data processing pipelines, engines such as Spark, Flink, or MapReduce typically read data from storage systems and write results back to distributed storage (such as S3 or HDFS) for persistence or downstream consumption. However, performance bottlenecks of these distributed storage systems, such as limited throughput or high latency, often lead to long processing times and decreased efficiency across the pipeline.

To address this challenge, Alluxio provides a high-performance, easily integrable data acceleration solution. This solution seamlessly integrates with existing compute engines and user applications without requiring any code modifications, significantly improving I/O performance and accelerating the entire data processing workflow.

Architecture Overview

The overall workflow of this acceleration solution is illustrated below:

Original data is stored in HDFS or S3. Alluxio acts as a caching proxy, allowing compute engines to read data directly from the cache and avoid repeated access to the underlying storage.
Intermediate or temporary files generated during processing are written directly to Alluxio's CACHE_ONLY layer to speed up writes and improve intermediate data efficiency.
Temporary files stored in Alluxio can be read directly by downstream jobs, enhancing read performance and speeding up pipeline handoffs.
Final result files are also written to Alluxio first. With asynchronous persistence enabled, these files are uploaded in the background to persistent storage systems like S3 or HDFS, ensuring durability.

Throughout this pipeline, Alluxio handles all data read/write I/O operations, significantly reducing the load on underlying storage systems (UFS) and enabling end-to-end acceleration.

Required Features

This solution depends on several core Alluxio features. It is recommended to become familiar with them before deployment:

CACHE_ONLY: Restricts file I/O to Alluxio only, avoiding UFS interaction and enabling high-performance caching.
Asynchronous Persistence: Enables asynchronous upload of final result files to persistent storage systems like S3 or HDFS.
Client Path Mapping: Transparently maps original file paths to Alluxio paths, allowing seamless integration without modifying application code.

Example: Accelerating a Hive Table

Assume we have the following Hive table:

CREATE TABLE `employee_orc_s3`(
  `name` string,
  `salary` int,
  `deptno` int,
  `doj` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3a://test_bucket/user/hive/warehouse/employee_orc_s3'

Our goal is to achieve the following:

Access and operate on the Hive table via Alluxio without modifying its LOCATION;
Ensure intermediate files reside only in Alluxio and are not persisted to S3;
Asynchronously persist final output files from Alluxio to S3 for durability.

Alluxio Configuration

# Client path mapping
alluxio.user.virtual.path.mapping.enabled=true
alluxio.user.virtual.path.mapping.ufs.mapping.enabled=true
alluxio.user.virtual.path.mapping.rule.file.path=/opt/alluxio/conf/path_mapping.json

# Async upload configuration
alluxio.underfs.gemini.ufs.fallback.enabled=true

Path mapping configuration, path_mapping.json:

{
  "rules": [
    {
      "src": "^s3a://test_bucket/user/hive/warehouse/employee_orc_s3(.*)",
      "dst": "gemini://${GEMINI_MASTER_ADDRESS}/user/hive/warehouse/employee_orc_s3{{ var1 }}"
    }
  ]
}

This configuration indicates that all paths prefixed with s3a://test_bucket/user/hive/warehouse/employee_orc_s3 will be redirected to Alluxio’s CACHE_ONLY path gemini://${GEMINI_MASTER_ADDRESS}/user/hive/warehouse/employee_orc_s3.

CACHE_ONLY Async Persistence Configuration

Assume the following mount points:

/s3          ----------> s3a://test_bucket/
/cache_only  ----------> gemini://${GEMINI_MASTER_ADDRESS}/

Enable async persistence on the CACHE_ONLY master node:

alluxio.gemini.master.async.upload.local.file.path=/opt/alluxio/conf/async_upload.json

Async upload configuration, sync_upload.json:

{
  "cacheOnlyMountPoint": "/cache_only",
  "asyncUploadPathMapping": {
    "/cache_only/user/hive/warehouse/employee_orc_s3": "/s3/user/hive/warehouse/employee_orc_s3"
  },
  "blackList": [
    "_temporary",
    ".spark-staging"
  ]
}

This configuration indicates that files written to gemini://${GEMINI_MASTER_ADDRESS}/user/hive/warehouse/employee_orc_s3 will be asynchronously persisted to the corresponding S3 path s3://test_bucket/user/hive/warehouse/employee_orc_s3. Paths listed in the blacklist (such as _temporary and .spark-staging) are excluded from persistence, as they are intermediate files that will be automatically cleaned up after the job completes.

Required Alluxio jars in client application

From the Alluxio release package, 3 jars are needed:

client/alluxio-AI-3.7-13.0.0-client.jar: Alluxio Client jar, used to connect to Alluxio Cluster;
client/ufs/alluxio-underfs-gemini-shaded-AI-3.7-13.0.0.jar: Alluxio UFS jar, used to connect to alluxio CACHE_ONLY Cluster;
client/ufs/alluxio-underfs-s3a-v2-shaded-AI-3.7-13.0.0.jar: Alluxio UFS jar, used to connect to AWS S3;

Submitting Spark SQL Jobs

Submit Spark SQL jobs as follows:

bin/spark-sql \
  --master spark://localhost:7077 \
  --deploy-mode client \
  --conf spark.executor.memory=1g \
  --conf spark.driver.memory=1g \
  --conf spark.hadoop.fs.s3a.impl=alluxio.hadoop.FileSystem \
  --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 \
  --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \
  --jars alluxio-AI-3.7-13.0.0--client.jar,alluxio-underfs-s3a-v2-shaded-AI-3.7-13.0.0-.jar,alluxio-underfs-gemini-shaded-AI-3.7-13.0.0-.jar

Key configurations:

spark.hadoop.fs.s3a.impl=alluxio.hadoop.FileSystem: Redirects s3a:// paths to Alluxio;
--jars: Specifies required client and shaded UFS jars.

After starting Spark SQL, you can run:

insert into employee_orc_s3 values('jack', 15000, 222, date '2025-06-23');
select * from employee_orc_s3;

Verifying Cache and Async Persistence

After writing data, verify files in Alluxio's CACHE_ONLY layer:

bin/alluxio fs ls /cache_only/user/hive/warehouse/employee_orc_s3

After async persistence is completed, verify files in S3:

bin/alluxio fs ls /s3/user/hive/warehouse/employee_orc_s3

Optional: Direct Read from S3

Once data is persisted, you can optionally access the Hive table directly from S3 without using Alluxio:

bin/spark-sql \
  --master spark://localhost:7077 \
  --deploy-mode client \
  --conf spark.executor.memory=1g \
  --conf spark.driver.memory=1g \
  --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 \
  --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

Releasing CACHE_ONLY Cache Space

Once files are successfully persisted to UFS, you can release cache space in Alluxio's CACHE_ONLY layer using one of the following methods:

Remove File Using `rm`

This method removes both data and metadata. The file is completely deleted from CACHE_ONLY, and future access must go directly to the UFS.

Release Cache Using `free`

This method clears only the cached data but retains metadata:

$ALLUXIO_HOME/gemini/bin/alluxio fs free $PATH

The file remains accessible via the Alluxio path, but the next access will trigger a reload from UFS back into CACHE_ONLY.

Last updated 20 days ago