Amazon AWS S3
Last updated
Last updated
This guide describes the instructions to configure Amazon AWS S3 as Alluxio's under storage system. Amazon AWS S3, or Amazon Simple Storage Service, is an object storage service offering industry-leading scalability, data availability, security, and performance. For more information about Amazon AWS S3, please read its documentation.
If you haven't already, please see Prerequisites before you get started.
In preparation for using Amazon AWS S3 with Alluxio:
Use the mount table operations to add a new mount point, specifying the Alluxio path to create the mount on and the S3 path as the UFS URI. Credentials and configuration options can also be specified as part of the mount command by specifying the --option
flag as described by configuring mount points.
An example command to mount s3://<S3_BUCKET>/<S3_DIRECTORY>
to /s3
:
Note that if you want to mount the root of the S3 bucket, add a trailing slash after the bucket name (e.g. s3://S3_BUCKET/
).
For other methods of setting AWS credentials, see the credentials section in Advanced Setup.
Note that configuration options can be specified as mount options or as configuration properties in conf/alluxio-site.properties
. The following sections will describe how to set configurations as properties, but they can also be set as mount options via --option <key>=<value>
.
Configure the AWS SDK version 2 when accessing S3 buckets. The default version used is v1. If you want to set the version to be v2, which has better memory management and higher throughput performance, please add the following configuration in conf/alluxio-site.properties
Configure S3 region when accessing S3 buckets to improve performance. Otherwise, global S3 bucket access will be enabled which introduces extra requests. S3 region can be set in conf/alluxio-site.properties
You can specify credentials in different ways, from highest to lowest priority:
s3a.accessKeyId
and s3a.secretKey
specified as mount options
s3a.accessKeyId
and s3a.secretKey
specified as Java system properties
s3a.accessKeyId
and s3a.secretKey
in alluxio-site.properties
Environment Variables AWS_ACCESS_KEY_ID
or AWS_ACCESS_KEY
(either is acceptable) and AWS_SECRET_ACCESS_KEY
or AWS_SECRET_KEY
(either is acceptable) on the Alluxio servers
Profile file containing credentials at ~/.aws/credentials
AWS Instance profile credentials, if you are using an EC2 instance
When using an AWS Instance profile as the credentials' provider:
Create an IAM Role with access to the mounted bucket
Create an Instance profile as a container for the defined IAM Role
Launch an EC2 instance using the created profile
Note that the IAM role will need access to both the files in the bucket as well as the bucket itself in order to determine the bucket's owner. Automatically assigning an owner to the bucket can be avoided by setting the property alluxio.underfs.s3.inherit.acl=false
.
See Amazon's documentation for more details.
To enable the use of the HTTPS protocol for secure communication with S3 with an additional layer of security for data transfers, configure the following setting in conf/alluxio-site.properties:
You may encrypt your data stored in S3. The encryption is only valid for data at rest in S3 and will be transferred in decrypted form when read by clients. Note, enabling this will also enable HTTPS to comply with requirements for reading/writing objects.
Enable this feature by configuring conf/alluxio-site.properties
:
By default, a request directed at the bucket named "mybucket" will be sent to the host name "mybucket.s3.amazonaws.com". You can enable DNS-Buckets to use path style data access, for example: "http://s3.amazonaws.com/mybucket" by setting the following configuration:
To communicate with S3 through a proxy, modify conf/alluxio-site.properties
to include:
<PROXY_HOST>
and <PROXY_PORT>
should be replaced by the host and port of your proxy.
If you want to access a specific region in the AWS service, other than the default us-east-1 region, modify conf/alluxio-site.properties
to include:
If you want to access a specific endpoint(like AWS VPC endpoint) in a specific region in the AWS service, modify conf/alluxio-site.properties
to include:
Both the endpoint and region value need to be updated to use the non-global region. After the setting, alluxio.underfs.s3.region=<S3_REGION>
will no longer take effect.
To use an S3 service provider other than "s3.amazonaws.com", modify conf/alluxio-site.properties
to include:
Replace <S3_ENDPOINT>
with the hostname and port of your S3 service, e.g., http://localhost:9000
. Only use this parameter if you are using a provider other than s3.amazonaws.com
.
Both the endpoint and region value need to be updated to use non-home region.
All OCI object storage regions need to use PathStyleAccess
Some S3 service providers only support v2 signatures. For these S3 providers, you can enforce using the v2 signatures by setting the alluxio.underfs.s3.signer.algorithm
to S3SignerType
.
Set the VPC endpoint and region for your S3 bucket configuration:
S3 is an object store and because of this feature, the whole file is sent from client to worker, stored in the local disk temporary directory, and uploaded in the close()
method by default.
To enable S3 streaming upload, you need to modify conf/alluxio-site.properties
to include:
The default upload process is safer but has the following issues:
Slow upload time. The file has to be sent to Alluxio worker first and then Alluxio worker is responsible for uploading the file to S3. The two processes are sequential.
The temporary directory must have the capacity to store the whole file.
Slow close()
. The execution time of close()
method is proportional to the file size and inversely proportional to the bandwidth. That is O(FILE_SIZE/BANDWIDTH).
The S3 streaming upload feature addresses the above issues and is based on the S3 low-level multipart upload.
The S3 streaming upload has the following advantages:
Shorter upload time. Alluxio worker uploads buffered data while receiving new data. The total upload time will be at least as fast as the default method.
Smaller capacity requirement. Our data is buffered and uploaded according to partitions (alluxio.underfs.s3.streaming.upload.partition.size
which is 64MB by default). When a partition is successfully uploaded, this partition will be deleted.
Faster close()
. We begin uploading data when data buffered reaches the partition size instead of uploading the whole file in close()
.
If a S3 streaming upload is interrupted, there may be intermediate partitions uploaded to S3 and S3 will charge for the stored data. To reduce the charges, users can modify conf/alluxio-site.properties
to include:
Intermediate multipart uploads in all non-readonly S3 mount points older than the clean age (configured by alluxio.underfs.s3.intermediate.upload.clean.age
) will be cleaned when a cleanup interval (configured by alluxio.underfs.cleanup.interval
) is reached.
The default upload method uploads one file completely from start to end in one go. We use multipart-upload method to upload one file by multiple parts, every part will be uploaded in one thread. It won't generate any temporary files while uploading. It will consume more memory but faster than streaming upload mode.
To enable S3 multipart upload, you need to modify conf/alluxio-site.properties
to include:
There are other parameters you can specify in conf/alluxio-site.properties
to make the process faster and better.
If the S3 connection is slow, a larger timeout is useful:
When accessing S3 through Alluxio with a large number of clients per Alluxio server, it is important to increase the S3 connection pool size to avoid performance issues. If the connection pool size is too small, it may result in S3 request failures with errors such as "Unable to execute HTTP request: Timeout waiting for connection from pool"
due to high competition for available connections. Increasing the pool size ensures smoother communication and optimal performance by setting:
S3 identity and access management is very different from the traditional POSIX permission model. For instance, S3 ACL does not support groups or directory-level settings. Alluxio makes the best effort to inherit permission information including file owner, group and permission mode from S3 ACL information.
The S3 credentials set in Alluxio configuration corresponds to an AWS user. If this user does not have the required permissions to access an S3 bucket or object, a 403 permission denied error will be returned.
If you see a 403 error in Alluxio server log when accessing an S3 service, you should double-check
You are using the correct AWS credentials. See credentials setup.
Your AWS user has permissions to access the buckets and objects mounted to Alluxio.
Read more AWS troubleshooting guidance for 403 error.
Alluxio file system sets the file owner based on the AWS account configured in Alluxio to connect to S3. Since there is no group in S3 ACL, the owner is reused as the group.
By default, Alluxio extracts the display name of this AWS account as the file owner. In case this display name is not available, this AWS user's canonical user ID will be used. This canonical user ID is typically a long string (like 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
), thus often inconvenient to read and use in practice. Optionally, the property alluxio.underfs.s3.owner.id.to.username.mapping
can be used to specify a preset mapping from canonical user IDs to Alluxio usernames, in the format "id1=user1;id2=user2". For example, edit alluxio-site.properties
to include
This configuration helps Alluxio recognize all objects owned by this AWS account as owned by the user john
in Alluxio namespace. To find out the AWS S3 canonical ID of your account, check the console https://console.aws.amazon.com/iam/home?#/security_credentials
, expand the "Account Identifiers" tab and refer to "Canonical User ID".
chown
, chgrp
, and chmod
of Alluxio directories and files do NOT propagate to the underlying S3 buckets nor objects.
Alluxio supports authentication via the [AWS AssumeRole API](https://docs.aws.amazon .com/STS/latest/APIReference/API_AssumeRole.html) to connect to AWS S3. When AssumeRole is enabled, the AWS access key and secret key will only be used to obtain temporary security credentials. All subsequent accesses will utilize these temporary credentials, which are generated through AssumeRole.
To enable AssumeRole in Alluxio, the following properties are required on workers and coordinators:
Note: Ensure the specified role exists, and the user associated with the provided access key and secret key has permission to assume the role defined by the target role ARN.
In addition to the mandatory properties, you can also configure the following optional settings for greater control over session behavior and network configurations:
Note: If the proxy host and port are not set in the Alluxio configuration, the JVM/System environment variables HTTP(S)_PROXY
, http(s)_proxy
, http(s).proxyHost
, and http(s).proxyPort
will automatically be picked up by the AWS SDK.
Below is a sample configuration for setting up AssumeRole in Alluxio:
Summary:
Temporary Credentials: AWS access keys are only used to request temporary credentials; all future operations rely on those credentials.
Automatic Session Refresh: Sessions are automatically refreshed by the AWS SDK, requiring no manual intervention.
Customizable Configuration: You can modify session duration, proxy settings, and session prefixes to suit your security and environment needs.
By setting up these properties, Alluxio can effectively authenticate and manage access to AWS S3 using the temporary credentials obtained via AssumeRole.
If issues are encountered when running against your S3 backend, enable additional logging to track HTTP traffic. Modify conf/log4j2.xml
to add the following properties:
See Amazon's documentation for more details.
Alluxio may create zero-byte files in S3 as a performance optimization when listing the contents of the underlying storage. If a bucket is mounted with read-only access, creating zero-byte file creation via S3 PUT operation will be disallowed. To disable this optimization, set the following configuration.
<S3_BUCKET>
Create a new S3 bucket or use an existing bucket
<S3_DIRECTORY>
The directory you want to use in that container, either by creating a new directory or using an existing one.
<S3_ACCESS_KEY_ID>
Used to sign programmatic requests made to AWS. See How to Obtain Access Key ID and Secret Access Key
<S3_SECRET_KEY>
Used to sign programmatic requests made to AWS. See How to Obtain Access Key ID and Secret Access Key