Google GCS

This guide describes how to configure Alluxio with Google Cloud Storage (GCS)arrow-up-right as the under storage system.

Google Cloud Storage (GCS) is a scalable and durable object storage service offered by Google Cloud Platform (GCP). It allows users to store and retrieve various types of data, including unstructured and structured data.

For more information about GCS, please read its documentationarrow-up-right.

Prerequisites

Before you get started, please ensure you have the required information listed below:

<GCS_DIRECTORY>

The directory you want to use in the bucket, either by creating a new directory or using an existing one

The default GCS UFS module (GCS version 2) is implemented based on Google Cloud API which accepts Google application credentialsarrow-up-right. Fine grained permissions can be defined when creating the application credentials to limit access to specific buckets.

Basic Setup

For the general mount mechanism and UnderFileSystem CR field reference, see Underlying Storage.

An example ufs.yaml to create a GCS mount point with the operator:

apiVersion: k8s-operator.alluxio.com/v1
kind: UnderFileSystem
metadata:
  name: alluxio-gs
  namespace: alx-ns
spec:
  alluxioCluster: alluxio-cluster
  path: gs://<GS_BUCKET>/<PATH>
  mountPath: /gs
  mountOptions:
    fs.gcs.credential.path: /path/to/<google_application_credentials>.json

The credentials file should be provided as a secretarrow-up-right. See how to add a secret as a file.

Advanced Setup

Customize the Directory Suffix

Directories are represented in GCS as zero-byte objects named with a specified suffix. The directory suffix can be updated with the configuration parameter alluxio.underfs.gcs.directory.suffix.

GCS multipart upload

Alluxio uploads large files in parallel parts to improve throughput. Because the GCS Java client library does not expose a native multipart upload API, parts are uploaded as temporary objects and merged into the final object using GCS object compositionarrow-up-right. The final object is then created atomically via a server-side rename, so concurrent readers never see a partially-written key. Temporary objects are deleted after a successful upload or on abort.

Multipart upload is enabled by default. Files smaller than one partition (minimum 5 MiB) are always uploaded as a single object and are not affected by this setting.

To tune multipart upload behavior, set the following properties.

Add the following to your AlluxioCluster spec:

If an Alluxio worker crashes mid-upload before abortion and cleanup can take place, temporary objects with __mpu_ in their key may remain in the bucket. These can be removed manually by key prefix or automatically with a GCS Object Lifecycle rulearrow-up-right targeting keys that match *__mpu_*.

GCS Access Control

If Alluxio security is enabled, Alluxio enforces the access control inherited from underlying object storage.

The GCS credentials specified in Alluxio config represents a GCS user. GCS service backend checks the user permission to the bucket and the object for access control. If the given GCS user does not have access permissions to the specified bucket, a permission denied error will be thrown. When Alluxio security is enabled, Alluxio loads the bucket ACL to Alluxio when the metadata is first loaded to the Alluxio namespace.

Mapping from GCS ACL to Alluxio permission

Alluxio checks the GCS bucket READ/WRITE ACL to determine the owner's permission mode to a Alluxio file. For example, if the GCS user has read-only access to the underlying bucket, the mounted directory and files would have 0500 mode. If the GCS user has full access to the underlying bucket, the mounted directory and files would have 0700 mode.

Accessing GCS through Proxy

If the Alluxio cluster is behind a corporate proxy or a firewall, the Alluxio GCS integration may not be able to access the internet with the default settings.

Add the following java options to conf/alluxio-env.sh before starting the Alluxio coordinator and workers.

An example value for http.nonProxyHosts is localhost|127.*|[::1]|192.168.0.0/16.

If username and password are required for the proxy, add the http.proxyUser, https.proxyUser, http.proxyPassword, and https.proxyPassword java options.

Last updated