Google Dataproc
This guide describes how to configure Alluxio to run on Google Cloud Dataproc.
Overview
Google Cloud Dataproc is a managed on-demand service to run Presto, Spark, and Hadoop compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Aside from the added performance benefits of caching, Alluxio enables users to run compute workloads against on-premise storage or a different cloud provider's storage such as AWS S3 and Azure Blob Store.
Prerequisites
A project with Cloud Dataproc API and Compute Engine API enabled.
A GCS Bucket.
The gcloud CLI set up with necessary GCS interoperable storage access keys.
Note: GCS interoperability should be enabled in the Interoperability tab in GCS settings.
A GCS bucket is required if mounting the bucket to the root of the Alluxio namespace. Alternatively, the root UFS can be reconfigured to be HDFS or any other supported under storage. Type of VM instance to be used for Alluxio Master and Worker depends on the workload characteristics. General recommended types of VM instances for Alluxio Master are n2-highmem-16 or n2-highmem-32. VM instance types of n2-standard-16 or n2-standard-32 enable use of SSD as Alluxio worker storage tier.
Basic Setup
When creating a Dataproc cluster, Alluxio can be installed using an initialization action.
Create a cluster
There are several properties set as metadata labels which control the Alluxio deployment.
A required argument is the root UFS address configured using
alluxio_root_ufs_uri
. If set toLOCAL
, the HDFS cluster residing within the same Dataproc cluster will be used as Alluxio's root UFS.Specify properties using the metadata key
alluxio_site_properties
. Delimit multiple properties with a semicolon (;
).
Example 1: use google cloud storage bucket as Alluxio root UFS
$ gcloud dataproc clusters create <cluster_name> \
--initialization-actions gs://alluxio-public/dataproc/{{site.ALLUXIO_VERSION_STRING}}/alluxio-dataproc.sh \
--metadata \
alluxio_root_ufs_uri=gs://<my_bucket>,\
alluxio_site_properties="fs.gcs.accessKeyId=<my_access_key>;fs.gcs.secretAccessKey=<my_secret_key>"
Example 2: use Dataproc internal HDFS as Alluxio root UFS
$ gcloud dataproc clusters create <cluster_name> \
--initialization-actions gs://alluxio-public/dataproc/{{site.ALLUXIO_VERSION_STRING}}/alluxio-dataproc.sh \
--metadata \
alluxio_root_ufs_uri="LOCAL",\
alluxio_hdfs_version="2.9",\
alluxio_site_properties="alluxio.master.mount.table.root.option.alluxio.underfs.hdfs.configuration=/etc/hadoop/conf/core-site.xml:/etc/hadoop/conf/hdfs-site.xml"
Customization
The Alluxio deployment on Google Dataproc can customized for more complex scenarios by passing additional metadata labels to the gcloud clusters create
command.
Next steps
The status of the cluster deployment can be monitored using the CLI.
$ gcloud dataproc clusters list
Identify the instance name and SSH into this instance to test the deployment.
$ gcloud compute ssh <cluster_name>-m
Test that Alluxio is running as expected
$ sudo runuser -l alluxio -c "alluxio runTests"
Alluxio is installed and configured in /opt/alluxio/
. Alluxio services are started as alluxio
user.
Compute Applications
Spark, Hive, and Presto on Dataproc are pre-configured to connect to Alluxio.
Last updated