Alluxio
ProductsLanguageHome
  • Introduction
  • Overview
    • Architecture
    • Job Service
    • Quick Start Guide
    • FAQ
    • Use Cases
  • Core Services
    • Caching
    • Unified Namespace
  • Install Alluxio
    • Local Machine
    • Cluster
    • Cluster with HA
    • Docker
    • Software Requirements
  • Kubernetes
    • Deploy
    • Spark on Kubernetes
    • Metrics
  • Cloud Native
    • Alibaba Cloud ACK
    • AWS EMR
    • Tencent EMR
    • Google Dataproc
  • Compute Integration
    • Apache Spark
    • Apache Hadoop MapReduce
    • Apache Flink
    • Apache Hive
    • Presto on Iceberg (Experimental)
    • Presto
    • Trino
    • Tensorflow
  • Storage Integrations
    • Amazon AWS S3
    • HDFS
    • Azure Blob Store
    • Azure Data Lake Storage Gen2
    • Azure Data Lake Storage
    • Google Cloud Storage
    • Qiniu Kodo
    • COSN
    • CephObjectStorage
    • MinIO
    • NFS
    • Aliyun Object Storage Service
    • Ozone
    • Swift
    • WEB
    • CephFS
  • Security
  • Operations
    • Configuration Settings
    • User CLI
    • Admin CLI
    • Web UI
    • Journal Management
    • Metastore Management
    • Metrics
  • Administration
    • Troubleshooting
    • Basic Logging
    • Remote Logging
    • Performance Tuning
    • Scalability Tuning
    • StressBench (Experimental)
    • Upgrading
  • Solutions
  • Client APIs
    • Java API
    • S3 API
    • REST API
    • POSIX API
  • Contributor Resources
    • Building Alluxio From Source
    • Contribution Guide
    • Code Conventions
    • Documentation Conventions
    • Contributor Tools
  • Reference
    • List Of Configuration Properties
    • List of Metrics
  • REST API
    • Master REST API
    • Worker REST API
    • Proxy REST API
    • Job REST API
  • Javadoc
Powered by GitBook
On this page
  • Prerequisites
  • Basic Setup
  • Configuring Hadoop Core-site Properties
  • Distributing the Alluxio Client Jar
  • Example
  • Advanced Setup
  • Distributing the Alluxio Client Jar
  • Customize Alluxio User Properties for All MapReduce Jobs
  • Customize Alluxio User Properties for Individual MapReduce Jobs
  • Troubleshooting
  • Logging Configuration
  • Q: Why do I see exceptions like "No FileSystem for scheme: alluxio"?
  • Q: Why do I see exceptions like "java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found"?
  1. Compute Integration

Apache Hadoop MapReduce

Last updated 7 months ago

This guide describes how to configure Alluxio with Apache Hadoop MapReduce, so that your MapReduce programs can read+write data stored in Alluxio.

Prerequisites

  • Alluxio has been set up and is running.

  • Make sure that the Alluxio client jar is available on each machine. This Alluxio client jar file can be found at {{site.ALLUXIO_CLIENT_JAR_PATH}} in the tarball downloaded from the Alluxio . Alternatively, advanced users can compile the client jar from the source code by following the .

  • In order to run map-reduce examples, we also recommend downloading the based on your Hadoop version. For example, if you are using Hadoop 2.7 download this .

Basic Setup

Configuring Hadoop Core-site Properties

This step is only required for Hadoop 1.x and can be skipped by users of Hadoop 2.x or later.

Add the following property to the core-site.xml of your Hadoop installation:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
  <description>The Alluxio FileSystem</description>
</property>

This will allow MapReduce jobs to recognize URIs with Alluxio scheme alluxio:// in their input and output files.

Distributing the Alluxio Client Jar

In order for the MapReduce applications to read and write files in Alluxio, the Alluxio client jar must be on the JVM classpath of all nodes of the application.

The Alluxio client jar should also be added to the HADOOP_CLASSPATH environment variable. This makes the Alluxio client available to JVMs which are created when running hadoop jar command:

$ export HADOOP_CLASSPATH={{site.ALLUXIO_CLIENT_JAR_PATH}}:${HADOOP_CLASSPATH}

You can use the -libjars command line option when using hadoop jar ..., specifying {{site.ALLUXIO_CLIENT_JAR_PATH}} as the argument of -libjars. Hadoop will place the jar in the Hadoop DistributedCache, making it available to all the nodes. For example, the following command adds the Alluxio client jar to the -libjars option:

$ ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount \
  -libjars {{site.ALLUXIO_CLIENT_JAR_PATH}} <INPUT FILES> <OUTPUT DIRECTORY>

Example

For this example, we will use a pseudo-distributed Hadoop cluster, started by running:

$ cd $HADOOP_HOME
$ ./bin/stop-all.sh
$ ./bin/start-all.sh

Depending on the Hadoop version, you may need to replace ./bin with ./sbin.

Start Alluxio locally:

$ ./bin/alluxio-start.sh local SudoMount

You can add a sample file to Alluxio to run MapReduce wordcount on. From your Alluxio directory:

$ ./bin/alluxio fs mkdir /wordcount
$ ./bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txt

This command will copy the LICENSE file into the Alluxio namespace with the path /wordcount/input.txt.

Now we can run a MapReduce job (using Hadoop 2.7.3 as example) for wordcount.

$ ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount \
  -libjars {{site.ALLUXIO_CLIENT_JAR_PATH}} \
  alluxio://localhost:19998/wordcount/input.txt \
  alluxio://localhost:19998/wordcount/output

After this job completes, the result of the wordcount will be in the /wordcount/output directory in Alluxio. You can see the resulting files by running:

$ ./bin/alluxio fs ls /wordcount/output
$ ./bin/alluxio fs cat /wordcount/output/part-r-00000

Advanced Setup

Distributing the Alluxio Client Jar

You could place the client jar {{site.ALLUXIO_CLIENT_JAR_PATH}} in the $HADOOP_HOME/lib (it may be $HADOOP_HOME/share/hadoop/common/lib for different versions of Hadoop) directory of every MapReduce node, and then restart Hadoop. Alternatively, add this jar to mapreduce.application.classpath system property for your Hadoop deployment to the Alluxio client jar is on the classpath.

Note that the jars must be installed again for each new release. On the other hand, when the jar is already on every node, then the -libjars command line option is not needed.

Customize Alluxio User Properties for All MapReduce Jobs

Alluxio configuration parameters can be added to the Hadoop core-site.xml file to affect all MapReduce jobs.

For example, when Alluxio is running in HA mode, all MapReduce jobs will need to have the Alluxio client configured to communicate to the masters in HA mode. To configure clients to communicate with an Alluxio cluster in HA mode using internal leader election, the following section would need to be added to your Hadoop installation's core-site.xml

<configuration>
  <property>
    <name>alluxio.master.rpc.addresses</name>
    <value>master_hostname_1:19998,master_hostname_2:19998,master_hostname_3:19998</value>
  </property>
</configuration>

Customize Alluxio User Properties for Individual MapReduce Jobs

Hadoop MapReduce users can add "-Dproperty=value" after the hadoop jar or yarn jar command and the properties will be propagated to all the tasks of this job. For example, the following MapReduce wordcount job sets write type to CACHE_THROUGH when writing to Alluxio:

$ ./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount \
  -Dalluxio.user.file.writetype.default=CACHE_THROUGH \
  -libjars {{site.ALLUXIO_CLIENT_JAR_PATH}} \
  <INPUT FILES> <OUTPUT DIRECTORY>

Troubleshooting

Logging Configuration

Logs with Hadoop can be modified in many different ways. If you wish to directly modify the log4j.properties file for Hadoop, then you can add or modify appenders within ${HADOOP_HOME}/conf/log4j.properties on each of the nodes in your cluster.

Q: Why do I see exceptions like "No FileSystem for scheme: alluxio"?

A: This error message is seen when your MapReduce application tries to access Alluxio as an HDFS-compatible file system, but the alluxio:// scheme is not recognized by the application. Please make sure your Hadoop configuration file core-site.xml has the following property:

<configuration>
  <property>
    <name>fs.alluxio.impl</name>
    <value>alluxio.hadoop.FileSystem</value>
  </property>
</configuration>

Q: Why do I see exceptions like "java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found"?

A: This error message is seen when your MapReduce application tries to access Alluxio as an HDFS-compatible file system, the alluxio:// scheme has been configured correctly but the Alluxio client jar is not found on the classpath of your application.

You can append the client jar to $HADOOP_CLASSPATH:

$ export HADOOP_CLASSPATH={{site.ALLUXIO_CLIENT_JAR_PATH}}:${HADOOP_CLASSPATH}

If the corresponding classpath has been set but exceptions still exist, users can check whether the path is valid by:

$ ls {{site.ALLUXIO_CLIENT_JAR_PATH}}

Alternative configurations are described in the section.

Tip:The previous wordcount example is also applicable to Alluxio in HA mode. See the instructions on

This guide on describes several ways to distribute the jars to nodes in the cluster. From that guide, the recommended way to distribute the Alluxio client jar is to use the distributed cache, via the -libjars command line option. Another way to distribute the client jar is to manually distribute it to all the Hadoop nodes.

See for more details.

You may also modify the configuration values in in your installation. If you simply wish to modify log levels then your can change mapreduce.map.log.level or mapreduce.reduce.log.level.

If you are using YARN then you may also wish to modify some of the yarn.log.* properties which can be found in

download page
instructions
hadoop-mapreduce-examples jar
examples jar
how to include 3rd party libraries from Cloudera
mapred-site.xml
yarn-site.xml
Advanced Setup
Using the HDFS API to connect to Alluxio with high availability
HA mode client configuration parameters