# Data Processing

This document describes how to accelerate data processing using Alluxio.

In traditional data processing pipelines, engines such as Spark, Flink, or MapReduce typically read data from storage systems and write results back to distributed storage (such as S3 or HDFS) for persistence or downstream consumption. However, performance bottlenecks of these distributed storage systems, such as limited throughput or high latency, often lead to long processing times and decreased efficiency across the pipeline.

To address this challenge, Alluxio provides a high-performance, easily integrable data acceleration solution. This solution seamlessly integrates with existing compute engines and user applications without requiring any code modifications, significantly improving I/O performance and accelerating the entire data processing workflow.

## Architecture Overview

The overall workflow of this acceleration solution is illustrated below:

<figure><img src="/files/lkrSAkIlPApszR01gYNm" alt=""><figcaption></figcaption></figure>

* Original data is stored in HDFS or S3. Alluxio acts as a caching proxy, allowing compute engines to read data directly from the cache and avoid repeated access to the underlying storage.
* Intermediate or temporary files generated during processing are written directly to Alluxio's `CACHE_ONLY` layer to speed up writes and improve intermediate data efficiency.
* Temporary files stored in Alluxio can be read directly by downstream jobs, enhancing read performance and speeding up pipeline handoffs.
* Final result files are also written to Alluxio first. With asynchronous persistence enabled, these files are uploaded in the background to persistent storage systems like S3 or HDFS, ensuring durability.

Throughout this pipeline, Alluxio handles all data read/write I/O operations, significantly reducing the load on underlying storage systems (UFS) and enabling end-to-end acceleration.

## Required Features

This solution depends on several core Alluxio features. It is recommended to become familiar with them before deployment:

* CACHE\_ONLY: Restricts file I/O to Alluxio only, avoiding UFS interaction and enabling high-performance caching.
* Asynchronous Persistence: Enables asynchronous upload of final result files to persistent storage systems like S3 or HDFS.
* [Client Path Mapping](/ee-ai-en/ai-3.7/data-access/client-virtual-path-mapping.md): Transparently maps original file paths to Alluxio paths, allowing seamless integration without modifying application code.

## Example: Accelerating a Hive Table

Assume we have the following Hive table:

```sql
CREATE TABLE `employee_orc_s3`(
  `name` string,
  `salary` int,
  `deptno` int,
  `doj` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3a://test_bucket/user/hive/warehouse/employee_orc_s3'
```

Our goal is to achieve the following:

1. Access and operate on the Hive table via Alluxio without modifying its LOCATION;
2. Ensure intermediate files reside only in Alluxio and are not persisted to S3;
3. Asynchronously persist final output files from Alluxio to S3 for durability.

### Alluxio Configuration

```properties
# Client path mapping
alluxio.user.virtual.path.mapping.enabled=true
alluxio.user.virtual.path.mapping.ufs.mapping.enabled=true
alluxio.user.virtual.path.mapping.rule.file.path=/opt/alluxio/conf/path_mapping.json

# Async upload configuration
alluxio.underfs.gemini.ufs.fallback.enabled=true
```

Path mapping configuration, `path_mapping.json`:

```json
{
  "rules": [
    {
      "src": "^s3a://test_bucket/user/hive/warehouse/employee_orc_s3(.*)",
      "dst": "gemini://${GEMINI_MASTER_ADDRESS}/user/hive/warehouse/employee_orc_s3{{ var1 }}"
    }
  ]
}
```

This configuration indicates that all paths prefixed with `s3a://test_bucket/user/hive/warehouse/employee_orc_s3` will be redirected to Alluxio’s CACHE\_ONLY path `gemini://${GEMINI_MASTER_ADDRESS}/user/hive/warehouse/employee_orc_s3`.

### CACHE\_ONLY Async Persistence Configuration

Assume the following mount points:

```
/s3          ----------> s3a://test_bucket/
/cache_only  ----------> gemini://${GEMINI_MASTER_ADDRESS}/
```

Enable async persistence on the CACHE\_ONLY master node:

```properties
alluxio.gemini.master.async.upload.local.file.path=/opt/alluxio/conf/async_upload.json
```

Async upload configuration, `sync_upload.json`:

```json
{
  "cacheOnlyMountPoint": "/cache_only",
  "asyncUploadPathMapping": {
    "/cache_only/user/hive/warehouse/employee_orc_s3": "/s3/user/hive/warehouse/employee_orc_s3"
  },
  "blackList": [
    "_temporary",
    ".spark-staging"
  ]
}
```

This configuration indicates that files written to `gemini://${GEMINI_MASTER_ADDRESS}/user/hive/warehouse/employee_orc_s3` will be asynchronously persisted to the corresponding S3 path `s3://test_bucket/user/hive/warehouse/employee_orc_s3`. Paths listed in the blacklist (such as `_temporary` and `.spark-staging`) are excluded from persistence, as they are intermediate files that will be automatically cleaned up after the job completes.

### Required Alluxio jars in client application

From the Alluxio release package, 3 jars are needed:

* `client/alluxio-AI-3.7-13.0.0-client.jar`: Alluxio Client jar, used to connect to Alluxio Cluster;
* `client/ufs/alluxio-underfs-gemini-shaded-AI-3.7-13.0.0.jar`: Alluxio UFS jar, used to connect to alluxio CACHE\_ONLY Cluster;
* `client/ufs/alluxio-underfs-s3a-v2-shaded-AI-3.7-13.0.0.jar`: Alluxio UFS jar, used to connect to AWS S3;

### Submitting Spark SQL Jobs

Submit Spark SQL jobs as follows:

```bash
bin/spark-sql \
  --master spark://localhost:7077 \
  --deploy-mode client \
  --conf spark.executor.memory=1g \
  --conf spark.driver.memory=1g \
  --conf spark.hadoop.fs.s3a.impl=alluxio.hadoop.FileSystem \
  --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 \
  --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \
  --jars alluxio-AI-3.7-13.0.0--client.jar,alluxio-underfs-s3a-v2-shaded-AI-3.7-13.0.0-.jar,alluxio-underfs-gemini-shaded-AI-3.7-13.0.0-.jar
```

Key configurations:

* `spark.hadoop.fs.s3a.impl=alluxio.hadoop.FileSystem`: Redirects `s3a://` paths to Alluxio;
* `--jars`: Specifies required client and shaded UFS jars.

After starting Spark SQL, you can run:

```sql
insert into employee_orc_s3 values('jack', 15000, 222, date '2025-06-23');
select * from employee_orc_s3;
```

### Verifying Cache and Async Persistence

After writing data, verify files in Alluxio's CACHE\_ONLY layer:

```bash
bin/alluxio fs ls /cache_only/user/hive/warehouse/employee_orc_s3
```

After async persistence is completed, verify files in S3:

```bash
bin/alluxio fs ls /s3/user/hive/warehouse/employee_orc_s3
```

### Optional: Direct Read from S3

Once data is persisted, you can optionally access the Hive table directly from S3 without using Alluxio:

```bash
bin/spark-sql \
  --master spark://localhost:7077 \
  --deploy-mode client \
  --conf spark.executor.memory=1g \
  --conf spark.driver.memory=1g \
  --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 \
  --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
```

## Releasing CACHE\_ONLY Cache Space

Once files are successfully persisted to UFS, you can release cache space in Alluxio's CACHE\_ONLY layer using one of the following methods:

### Remove File Using `rm`

This method removes both data and metadata. The file is completely deleted from CACHE\_ONLY, and future access must go directly to the UFS.

### Release Cache Using `free`

This method clears only the cached data but retains metadata:

```bash
$ALLUXIO_HOME/gemini/bin/alluxio fs free $PATH
```

The file remains accessible via the Alluxio path, but the next access will trigger a reload from UFS back into CACHE\_ONLY.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.7/data-access/performance/data-processing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
