AWS EMR
This guide describes how to configure Alluxio to run on AWS EMR.
Overview
AWS EMR provides great options for running clusters on-demand to handle compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Alluxio can run on EMR to provide functionality above what EMRFS currently provides. Aside from the added performance benefits of caching, Alluxio enables users to run compute workloads against on-premise storage or a different cloud provider's storage such as GCS and Azure Blob Store.
Prerequisites
Account with AWS
IAM Account with the default EMR Roles
Key Pair for EC2
An S3 Bucket
AWS CLI: configured with your AWS access key id and secret access key
The majority of the pre-requisites can be found by going through the AWS EMR Getting Started guide. An S3 bucket is needed as Alluxio's root Under File System and to serve as the location for the bootstrap script. If desired, the root UFS can be configured to be HDFS or any other supported under storage. Type of EC2 instance to be used for Alluxio Master and Worker depends on the workload characteristics. General recommended types of EC2 instances for Alluxio Master are r5.4xlarge or r5.8xlarge. EC2 instance types of r5d.4xlarge or r5d.8xlarge enable use of SSD as Alluxio worker storage tier.
Basic Setup
Creating a Table
The simplest step to using EMR with Alluxio is to create a table on Alluxio and query it using Presto/Hive.
Customization
Tuning of Alluxio properties can be done in a few different locations. Depending on which service needs tuning, EMR offers different ways of modifying the service settings/environment variables.
Last updated

