GCS

This guide describes how to configure Alluxio with Google Cloud Storage (GCS) as the under storage system.

Google Cloud Storage (GCS) is a scalable and durable object storage service offered by Google Cloud Platform (GCP). It allows users to store and retrieve various types of data, including unstructured and structured data.

For more information about GCS, please read its documentation.

Prerequisites

If you haven't already, please see Prerequisites before you get started.

In preparation for using GCS with Alluxio:

<GCS_BUCKET>

<GCS_DIRECTORY>

The directory you want to use in the bucket, either by creating a new directory or using an existing one

Alluxio provides two ways to access GCS. GCS version 1 is implemented based on jets3t library which is designed for AWS S3. Thus, it only accepts Google cloud storage interoperability access/secret keypair which allows full access to all Google Cloud storages inside a Google cloud project. No permission or access control can be defined when using the interoperability keys. The conjunction of Google interoperability API and jets3t library has a performance impact of the default GCS UFS module.

The default GCS UFS module (GCS version 2) is implemented based on Google Cloud API which accepts Google application credentials. Fine grained permissions can be defined when creating the application credentials to limit access to specific buckets. The metadata and I/O performance is much better compared to GCS version 1.

Basic Setup

Use the mount table operations to add a new mount point, specifying the Alluxio path to create the mount on and the GCS path as the UFS URI. Credentials and configuration options can also be specified as part of the mount command by specifying the --option flag as described by configuring mount points.

Choose your preferred GCS UFS version and provide the corresponding Google credentials.

GCS version 2

An example command to mount gs://<GCS_BUCKET>/<GCS_DIRECTORY> to /gs using GCS v2:

bin/alluxio mount add --path /gs/ --ufs-uri gs://<GCS_BUCKET>/<GCS_DIRECTORY> \
  --option fs.gcs.credential.path=/path/to/<google_application_credentials>.json

This property key fs.gcs.credential.path provides the path to the Google application credentials json file. Note that the Google application credentials json file should be placed in all the Alluxio nodes in the same path. If the nodes running the Alluxio processes already contain the GCS credentials, this property may not be needed but it is always recommended to set this property explicitly.

GCS version1

An example command to mount gs://<GCS_BUCKET>/<GCS_DIRECTORY> to /gs using GCS v1:

bin/alluxio mount add --path /gs/ --ufs-uri gs://<GCS_BUCKET>/<GCS_DIRECTORY> \
  --option alluxio.underfs.gcs.version=1 --option fs.gcs.accessKeyId=<GCS_ACCESS_KEY_ID> \
  --option fs.gcs.secretAccessKey=<GCS_SECRET_ACCESS_KEY>
  • The first property key tells Alluxio to load the Version 1 GCS UFS module which uses the jets3t library.

  • Replace <GCS_ACCESS_KEY_ID> and <GCS_SECRET_ACCESS_KEY> with actual GCS interoperable storage access keys, or other environment variables that contain your credentials.

Note: GCS interoperability is disabled by default. Please click on the Interoperability tab in GCS setting and enable this feature. Click on Create a new key to get the Access Key and Secret pair.

Advanced Setup

Customize the Directory Suffix

Directories are represented in GCS as zero-byte objects named with a specified suffix. The directory suffix can be updated with the configuration parameter alluxio.underfs.gcs.directory.suffix.

GCS Access Control

If Alluxio security is enabled, Alluxio enforces the access control inherited from underlying object storage.

The GCS credentials specified in Alluxio config represents a GCS user. GCS service backend checks the user permission to the bucket and the object for access control. If the given GCS user does not have access permissions to the specified bucket, a permission denied error will be thrown. When Alluxio security is enabled, Alluxio loads the bucket ACL to Alluxio when the metadata is first loaded to the Alluxio namespace.

Mapping from GCS ACL to Alluxio permission

Alluxio checks the GCS bucket READ/WRITE ACL to determine the owner's permission mode to a Alluxio file. For example, if the GCS user has read-only access to the underlying bucket, the mounted directory and files would have 0500 mode. If the GCS user has full access to the underlying bucket, the mounted directory and files would have 0700 mode.

Mapping from GCS user to Alluxio file owner (GCS Version 1 only)

By default, Alluxio tries to extract the GCS user id from the credentials. Optionally, alluxio.underfs.gcs.owner.id.to.username.mapping can be used to specify a preset gcs owner id to Alluxio username static mapping in the format id1=user1;id2=user2. The Google Cloud Storage IDs can be found at the console address. Please use the Owners one.

Accessing GCS through Proxy (GCS Version 2 only)

If the Alluxio cluster is behind a corporate proxy or a firewall, the Alluxio GCS integration may not be able to access the internet with the default settings.

Add the following java options to conf/alluxio-env.sh before starting the Alluxio coordinator and workers.

ALLUXIO_COORDINATOR_JAVA_OPTS+=" -Dhttps.proxyHost=<proxy_host> -Dhttps.proxyPort=<proxy_port> -Dhttp.proxyHost=<proxy_host> -Dhttp.proxyPort=<proxy_port> -Dhttp.nonProxyHosts=<non_proxy_host>"
ALLUXIO_WORKER_JAVA_OPTS+=" -Dhttps.proxyHost=<proxy_host> -Dhttps.proxyPort=<proxy_port> -Dhttp.proxyHost=<proxy_host> -Dhttp.proxyPort=<proxy_port> -Dhttp.nonProxyHosts=<non_proxy_host>"

An example value for http.nonProxyHosts is localhost|127.*|[::1]|192.168.0.0/16.

If username and password are required for the proxy, add the http.proxyUser, https.proxyUser, http.proxyPassword, and https.proxyPassword java options.

Last updated