HDFS
Last updated
Last updated
This guide describes the instructions to configure HDFS as Alluxio's under storage system. HDFS, or Hadoop Distributed File System, is the primary distributed storage used by Hadoop applications, providing reliable and scalable storage for big data processing in Hadoop ecosystems. For more information about HDFS, please read its documentation.
If you haven't already, please see Prerequisites before you get started.
In preparation for using HDFS with Alluxio:
Use the mount table operations to add a new mount point, specifying the Alluxio path to create the mount on and the HDFS address as the UFS URI. Credentials and configuration options can also be specified as part of the mount command by specifying the --option
flag as described by configuring mount points.
An example command to mount hdfs://<HDFS_NAMENODE>:<HDFS_PORT>
to /hdfs
:
The UFS URI can be hdfs://localhost:8020
if you are running the HDFS namenode locally with default port and mapping HDFS root directory to Alluxio, or hdfs://localhost:8020/alluxio/data
if only the HDFS directory /alluxio/data
is mapped to Alluxio. To find out where HDFS is running, use hdfs getconf -confKey fs.defaultFS
to get the default hostname and port HDFS is listening on.
Additionally, you should specify the property alluxio.underfs.version
to be your HDFS version. See mounting HDFS with specific versions.
Note that configuration options can be specified as mount options or as configuration properties in conf/alluxio-site.properties
. The following sections will describe how to set configurations as properties, but they can also be set as mount options via --option <key>=<value>
.
When HDFS has non-default configurations, you need to configure Alluxio servers to access HDFS with the proper configuration file. Note that once this is set, your applications using Alluxio client do not need any special configuration.
There are two possible approaches:
Copy or make symbolic links from hdfs-site.xml
and core-site.xml
from your Hadoop installation into ${ALLUXIO_HOME}/conf
. Make sure this is set up on all servers running Alluxio.
Alternatively, you can set the property alluxio.underfs.hdfs.configuration
in conf/alluxio-site.properties
to point to your hdfs-site.xml
and core-site.xml
. Make sure this configuration is set on all servers running Alluxio.
To configure Alluxio to work with HDFS namenodes in HA mode, first configure Alluxio servers to access HDFS with the proper configuration files.
In addition, set the under storage address to hdfs://nameservice/
(nameservice
is the HDFS nameservice already configured in hdfs-site.xml
). To mount an HDFS subdirectory to Alluxio instead of the whole HDFS namespace, change the under storage address to something like hdfs://nameservice/alluxio/data
.
To ensure that the permission information of files/directories including user, group and mode in HDFS is consistent with Alluxio (e.g., a file created by user Foo in Alluxio is persisted to HDFS also with owner as user Foo), the user to start Alluxio coordinator and worker processes is required to be either:
HDFS super user Namely, use the same user that starts HDFS namenode process to also start Alluxio coordinator and worker processes.
A member of HDFS superuser group. Edit HDFS configuration file hdfs-site.xml
and check the value of configuration property dfs.permissions.superusergroup
. If this property is set with a group (e.g., "hdfs"), add the user to start Alluxio process (e.g., "alluxio") to this group ("hdfs"); if this property is not set, add a group to this property where your Alluxio running user is a member of this newly added group.
The user set above is only the identity that starts Alluxio coordinator and worker processes. Once Alluxio servers are started, it is unnecessary to run your Alluxio client applications using this user.
If your HDFS cluster is Kerberized, first configure Alluxio servers to access HDFS with the proper configuration files.
In addition, security configuration is needed for Alluxio to be able to communicate with the HDFS cluster. Set the following Alluxio properties in alluxio-site.properties
:
If connecting to secure HDFS, run kinit
on all Alluxio nodes. Use the principal hdfs
and the keytab that you configured earlier in alluxio-site.properties
. A known limitation is that the Kerberos TGT may expire after the max renewal lifetime. You can work around this by renewing the TGT periodically. Otherwise you may see No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
when starting Alluxio services. Another option is to set alluxio.hadoop.security.kerberos.keytab.login.autorenewal=true
so the TGT is automatically refreshed.
The user can also use alluxio.hadoop.security.krb5.conf
to specify the krb5.conf
file location and use alluxio.hadoop.security.authentication
to specify authentication method.
By default, Alluxio will use machine-level Kerberos configuration to determine the Kerberos realm and KDC. You can override these defaults by setting the JVM properties java.security.krb5.realm
and java.security.krb5.kdc
.
To set these, set ALLUXIO_JAVA_OPTS
in conf/alluxio-env.sh
.
There are multiple ways for a user to mount an HDFS cluster with a specified version as an under storage into Alluxio namespace. Before mounting HDFS with a specific version, make sure you have built a client with that specific version of HDFS. You can check the existence of this client by going to the lib
directory under the Alluxio directory.
If you have built Alluxio from source, you can build additional client jar files by running mvn
command under the underfs
directory in the Alluxio source tree. For example, issuing the following command would build the client jar for the 2.8.0 version.
When mounting the under storage of Alluxio root directory with a specific HDFS version, one can add the following line to the site properties file (conf/alluxio-site.properties
)
Alluxio supports the following versions of HDFS as a valid argument of mount option alluxio.underfs.version
:
Apache Hadoop: 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10, 3.0, 3.1, 3.2, 3.3
Note: Apache Hadoop 1.0 and 1.2 are still supported, but not included in the default download. To build this module yourself, build the shaded hadoop client and then the UFS module.
Hadoop comes with a native library that provides better performance and additional features compared to its Java implementation. For example, when the native library is used, the HDFS client can use native checksum function which is more efficient than the default Java implementation. To use the Hadoop native library with Alluxio HDFS under filesystem, first install the native library on Alluxio nodes by following the instructions on this page. Once the hadoop native library is installed on the machine, update Alluxio startup Java parameters in conf/alluxio-env.sh
by adding the following line:
Make sure to restart Alluxio services for the change to take effect.
HDFS support the trash function by default. When a HDFS file is deleted, it will be moved to the trash directory and later actually deleted. User can enable the HDFS trash feature by setting alluxio.underfs.hdfs.trash.enabled=true
.
If the feature is enabled, when a user deletes a file that's stored in HDFS via Alluxio, Alluxio will try to move it to the HDFS trash directory instead of deleting it directly from HDFS. Note that HDFS must be configured to enable the trash functionality in core-site.xml
. If trashing is not enabled in HDFS, Alluxio will delete the file directly, regardless of this configuration.
<HDFS_NAMENODE>
The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes).
<HDFS_PORT>
The port at which the NameNode accepts client connections.
<HADOOP_VERSION>