Alluxio
ProductsLanguageHome
  • Introduction
  • Overview
    • Architecture
    • Job Service
    • Quick Start Guide
    • FAQ
    • Use Cases
  • Core Services
    • Caching
    • Unified Namespace
  • Install Alluxio
    • Local Machine
    • Cluster
    • Cluster with HA
    • Docker
    • Software Requirements
  • Kubernetes
    • Deploy
    • Spark on Kubernetes
    • Metrics
  • Cloud Native
    • Alibaba Cloud ACK
    • AWS EMR
    • Tencent EMR
    • Google Dataproc
  • Compute Integration
    • Apache Spark
    • Apache Hadoop MapReduce
    • Apache Flink
    • Apache Hive
    • Presto on Iceberg (Experimental)
    • Presto
    • Trino
    • Tensorflow
  • Storage Integrations
    • Amazon AWS S3
    • HDFS
    • Azure Blob Store
    • Azure Data Lake Storage Gen2
    • Azure Data Lake Storage
    • Google Cloud Storage
    • Qiniu Kodo
    • COSN
    • CephObjectStorage
    • MinIO
    • NFS
    • Aliyun Object Storage Service
    • Ozone
    • Swift
    • WEB
    • CephFS
  • Security
  • Operations
    • Configuration Settings
    • User CLI
    • Admin CLI
    • Web UI
    • Journal Management
    • Metastore Management
    • Metrics
  • Administration
    • Troubleshooting
    • Basic Logging
    • Remote Logging
    • Performance Tuning
    • Scalability Tuning
    • StressBench (Experimental)
    • Upgrading
  • Solutions
  • Client APIs
    • Java API
    • S3 API
    • REST API
    • POSIX API
  • Contributor Resources
    • Building Alluxio From Source
    • Contribution Guide
    • Code Conventions
    • Documentation Conventions
    • Contributor Tools
  • Reference
    • List Of Configuration Properties
    • List of Metrics
  • REST API
    • Master REST API
    • Worker REST API
    • Proxy REST API
    • Job REST API
  • Javadoc
Powered by GitBook
On this page
  • Prerequisites
  • Configuration
  • Set property in core-site.xml
  • Specify path to core-site.xml in conf/flink-conf.yaml
  • Distribute the Alluxio Client Jar
  • Translate additional Alluxio site properties to Flink
  • Using Alluxio with Flink
  • Wordcount Example
  1. Compute Integration

Apache Flink

Last updated 7 months ago

This guide describes how to get Alluxio running with , so that you can easily work with files stored in Alluxio.

Prerequisites

  • Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.

  • Alluxio has been set up and is running.

  • Flink has been installed and set up.

Configuration

Apache Flink allows to use Alluxio through a generic file system wrapper for the Hadoop file system. Therefore, the configuration of Alluxio is done mostly in Hadoop configuration files.

Set property in core-site.xml

If you have a Hadoop setup next to the Flink installation, add the following property to the core-site.xml configuration file:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>

In case you don't have a Hadoop setup, you have to create a file called core-site.xml with the following contents:

<configuration>
  <property>
    <name>fs.alluxio.impl</name>
    <value>alluxio.hadoop.FileSystem</value>
  </property>
</configuration>

Specify path to core-site.xml in conf/flink-conf.yaml

Next, you have to specify the path to the Hadoop configuration in Flink. Open the conf/flink-conf.yaml file in the Flink root directory and set the fs.hdfs.hadoopconf configuration value to the directory containing the core-site.xml. (For newer Hadoop versions, the directory usually ends with etc/hadoop.)

Distribute the Alluxio Client Jar

We need to make the Alluxio jar file available to Flink, because it contains the configured alluxio.hadoop.FileSystem class.

There are different ways to achieve that:

  • Put the {{site.ALLUXIO_CLIENT_JAR_PATH}} file into the lib directory of Flink (for local and standalone cluster setups)

  • Put the {{site.ALLUXIO_CLIENT_JAR_PATH}} file into the ship directory for Flink on YARN.

  • Specify the location of the jar file in the HADOOP_CLASSPATH environment variable (make sure its available on all cluster nodes as well). For example like this:

$ export HADOOP_CLASSPATH={{site.ALLUXIO_CLIENT_JAR_PATH}}

Translate additional Alluxio site properties to Flink

In addition, if there are any client-related properties specified in conf/alluxio-site.properties, translate those to env.java.opts in {FLINK_HOME}/conf/flink-conf.yaml for Flink to pick up Alluxio configuration. For example, if you want to configure Alluxio client to use CACHE_THROUGH as the write type, you should add the following to {FLINK_HOME}/conf/flink-conf.yaml.

env.java.opts: -Dalluxio.user.file.writetype.default=CACHE_THROUGH

Note: If there are running flink clusters, stop the flink clusters and restart them to apply the changes to the configuration.

Using Alluxio with Flink

To use Alluxio with Flink, just specify paths with the alluxio:// scheme.

If Alluxio is installed locally, a valid path would look like this alluxio://localhost:19998/user/hduser/gutenberg.

Wordcount Example

This example assumes you have set up Alluxio and Flink as previously described.

Put the file LICENSE into Alluxio, assuming you are in the top level Alluxio project directory:

$ bin/alluxio fs copyFromLocal LICENSE alluxio://localhost:19998/LICENSE

Run the following command from the top level Flink project directory:

$ bin/flink run examples/batch/WordCount.jar \
  --input alluxio://localhost:19998/LICENSE \
  --output alluxio://localhost:19998/output

In order to communicate with Alluxio, we need to provide Flink programs with the Alluxio Core Client jar. We recommend you to download the tarball from Alluxio . Alternatively, advanced users can choose to compile this client jar from the source code by following the instructions . The Alluxio client jar can be found at {{site.ALLUXIO_CLIENT_JAR_PATH}}.

Open your browser and check . There should be an output file output which contains the word counts of the file LICENSE.

Apache Flink
download page
here
http://localhost:19999/browse