AWS EMR
Last updated
Last updated
This guide describes how to configure Alluxio to run on .
AWS EMR provides great options for running clusters on-demand to handle compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Alluxio can run on EMR to provide functionality above what EMRFS currently provides. Aside from the added performance benefits of caching, Alluxio enables users to run compute workloads against on-premise storage or a different cloud provider's storage such as GCS and Azure Blob Store.
Account with AWS
IAM Account with the default EMR Roles
Key Pair for EC2
An S3 Bucket
AWS CLI: configured with your AWS access key id and secret access key
The majority of the pre-requisites can be found by going through the guide. An S3 bucket is needed as Alluxio's root Under File System and to serve as the location for the bootstrap script. If desired, the root UFS can be configured to be HDFS or any other supported under storage. Type of EC2 instance to be used for Alluxio Master and Worker depends on the workload characteristics. General recommended types of EC2 instances for Alluxio Master are r5.4xlarge or r5.8xlarge. EC2 instance types of r5d.4xlarge or r5d.8xlarge enable use of SSD as Alluxio worker storage tier.
The simplest step to using EMR with Alluxio is to create a table on Alluxio and query it using Presto/Hive.
Tuning of Alluxio properties can be done in a few different locations. Depending on which service needs tuning, EMR offers different ways of modifying the service settings/environment variables.
The requires passing in multiple flags to successfully execute:
instance-type
: The instance type to provision with. Note that your account is limited in the number of instances you can launch in each region; check your instance limits . A good instance type to start off with is r4.4xlarge
.
The first argument, the root UFS URI, is required. This S3 URI designates the root mount of the Alluxio file system and should be of the form s3://bucket-name/mount-point
. The mount point should be a folder; follow to create a folder in S3.
You can also specify additional Alluxio properties as a delimited list of key-value pairs in the format key=value
. For example, alluxio.user.file.writetype.default=CACHE_THROUGH
instructs Alluxio to write files synchronously to the underlying storage system. See more about .
Log into the .
Note that in the example create-cluster
command, a security group was not specified, so a security group is automatically created for the new cluster. This security group is not configured to allow inbound SSH. In order for the above SSH command to work, edit the ElasticMapReduce-master
security group in the EC2 console, adding an inbound rule for port 22 with source 0.0.0.0/0
. Read for more details.
Create a database, then check in the to see if the database is created.