Access via S3 API

Alluxio supports a RESTful API that is compatible with the basic operations of the Amazon S3 API.

The Alluxio S3 API should be used by applications designed to communicate with an S3-like storage and would benefit from the other features provided by Alluxio, such as data caching, data sharing with file system based applications, and storage system abstraction (e.g., using Ceph instead of S3 as the backing store). For example, a simple application that downloads reports generated by analytic tasks can use the S3 API instead of the more complex file system API.

For a detailed description of each supported operation, refer to the S3 API Usage.

Prerequisites

To use the S3 API provided in the worker process, you need to modify conf/alluxio-site.properties to include:

alluxio.worker.s3.api.enabled=true

It is recommended to set up a load balancer to distribute the API calls among all the worker nodes. You can consider using different load balancing solutions such as DNS, Nginx, or LVS. The address of the load balancer will be used as the S3 endpoint address when configuring clients.

Limitations and Disclaimers

Access Ports

By default, after S3 API enabled, you can connect port 29998 through HTTP protocol to access S3 API. After Configure the HTTPs through Alluxio S3 API, you can connect port 29996 through HTTPs protocol to access S3 API.

If you want to only enable HTTPs port, configure the following property in the alluxio-site.properties:

alluxio.worker.s3.only.https.access=true

Alluxio Filesystem Limitations

Only top-level Alluxio directories are treated as buckets by the S3 API. Hence the root directory of the Alluxio filesystem is not treated as an S3 bucket. Any root-level objects (eg: alluxio://file) will be inaccessible through the Alluxio S3 API.

Alluxio uses / as a reserved separator. Therefore, any S3 paths with objects or folders named /; s3://example-bucket// will cause undefined behavior.

Also note that the Alluxio filesystem does not handle the following special characters and patterns:

  • Question mark ('?')

  • Patterns with period ('./' and '../')

  • Backslash ('\')

No Bucket Virtual Hosting

Virtual hosting of buckets is not supported. S3 clients must utilize path-style requests (i.e: http://s3.amazonaws.com/{bucket}/{object}) rather than http://{bucket}.s3.amazonaws.com/{object}.

S3 Writes Implicitly Overwrite

As described in the AWS S3 docs for PutObject:

Amazon S3 is a distributed system. If it receives multiple write requests for the same object simultaneously, it overwrites all but the last object written.Amazon S3 does not provide object locking; if you need this, make sure to build it into your application layer or use versioning instead.

  • Note that at the moment the Alluxio S3 API does not support object versioning

Alluxio S3 will overwrite the existing key and the temporary directory for multipart upload.

Folders in ListObjects(V2)

All sub-directories in Alluxio are returned as 0-byte folders when using ListObjects(V2). This behavior is in accordance with if you used the AWS S3 console to create all parent folders for each object.

Tagging & Metadata Limits

To support the Tagging function in S3 API, you need to modify conf/alluxio-site.properties to include:

alluxio.underfs.xattr.change.enabled=true

User-defined tags on buckets & objects are limited to 10 and obey the S3 tag restrictions.

  • Set the property key alluxio.proxy.s3.tagging.restrictions.enabled=false to disable this behavior.

The maximum size for user-defined metadata in PUT-requests is 2KB by default in accordance with S3 object metadata restrictions.

  • Set the property key alluxio.proxy.s3.header.metadata.max.size to change this behavior.

HTTP Persistent Connection

HTTP persistent connection (also called HTTP keep-alive), is the idea of using a single TCP connection to send and receive multiple HTTP requests/responses, as opposed to opening a new connection for every single request/response pair.

The main advantages of persistent connections include:

  • Reduced Latency: Minimizes delay caused by frequent requests.

  • Resource Savings: Reduces server and client resource consumption through fewer connections and less repeated requests.

  • Real-time Capability: Enables quick transmission of the latest data.

However, long connections also have some drawbacks, such as:

  • Increased Server Pressure: Many open connections can increase the memory and CPU burden on the server.

  • Timeout Issues: Requires handling cases where connections are unresponsive for a long time to ensure the effectiveness of timeout mechanisms.

In summary, HTTP long connections is an effective technology suitable for scenarios with high real-time requirements but also aiming to save resources.

To enable HTTP long connection keep-alive for S3 API, you need to modify the conf/alluxio-site.properties file to include the following content:

alluxio.worker.s3.connection.keep.alive.enabled=true

# When the connection idle time exceeds this configuration, the connection will be closed. 0 means to turn off this function.
alluxio.worker.s3.connection.idle.max.time=0sec

Performance

Since the S3 API implementation adopts a redirection mechanism and data zero-copy by default, the client is expected to support HTTP redirection. If using a client that doesn't support HTTP redirection, such as the python boto3 library, configure the following property to disable redirects:

alluxio.worker.s3.redirect.enabled=false

A redirect response for a request via the S3 API interface is a common response because the worker serving the request for the client is often unlikely the worker that is supposed to serve the data for the request. In the default case, the client receives a redirect response to directly establish a connection with the relevant worker to fetch data. When redirect is disabled, the worker that initially handles the request must make the data request, causing data to flow through an extra step, from the worker with the data to the worker handling the request and finally back to the client.

Global request headers

Header
Content
Description

AWS4-HMAC-SHA256 Credential={user}/..., SignedHeaders=..., Signature=...

There is currently no support for access & secret keys in the Alluxio S3 API. The only supported authentication scheme is the SIMPLE authentication type. By default, the user that is used to perform any operations is the user that was used to launch the Alluxio process. Therefore this header is used exclusively to specify an Alluxio ACL username to perform an operation with. In order to remain compatible with other S3 clients, the header is still expected to follow theAWS Signature Version 4 format. When supplying an access key to an S3 client, put the intended Alluxio ACL username. The secret key is unused, so you may use any dummy value.

Supported S3 API Actions

The following table describes the support status for current S3 API Actions:

S3 API Action
Supported Headers
Supported Query Parameters
  • Content-Type,

  • x-amz-copy-source,

  • x-amz-metadata-directive,

  • x-amz-tagging-directive,

  • x-amz-tagging

N/A

Range

N/A

N/A

delimiter, encoding-type, marker, max-keys, prefix

N/A

continuation-token, delimiter, encoding-type, max-keys, prefix, start-after

N/A

N/A

  • Content-Length,

  • Content-MD5,

  • Content-Type,

  • x-amz-tagging

N/A

  • Content-Length,

  • Content-MD5,

N/A

Examples

boto3 client

The following example python script shows how to initialize a boto3 client and test it with a list buckets request.

import boto3
from botocore.exceptions import ClientError

ALLUXIO_S3_ENDPOINT = "http://<LOAD_BALANCER_ADDRESS>"  # Alluxio's S3 API endpoint when using a load balancer to distribute requests to all workers
# ALLUXIO_S3_ENDPOINT = "http://<ALLUXIO_WORKER>:29998"  # an alternative to a load balancer is to directly connect to a worker
ACCESS_KEY = "placeholder"  # Alluxio does not validate credentials
SECRET_KEY = "placeholder"
REGION = "us-east-1"

FOLDER_PREFIX_TO_LIST = "/"

def main():
    try:
        s3 = boto3.client(
            "s3",
            aws_access_key_id=ACCESS_KEY,
            aws_secret_access_key=SECRET_KEY,
            region_name=REGION,
            endpoint_url=ALLUXIO_S3_ENDPOINT
        )
        print("Client initialized successfully.")

        # Example: list objects with prefix
        response = s3.list_buckets()
        print("Buckets (Alluxio mount points):")
        for bucket in response.get("Buckets", []):
            print(f" - {bucket['Name']}")
    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

This assumes boto3 is installed by pip install -r requirements.txt, with boto3 as the only entry inside requirements.txt.

pytorch

Since the pytorch client is unable to process redirect responses, explicitly disable redirects by configuring

alluxio.worker.s3.redirect.enabled=false

The following example python script uses the S3 connector for pytorch to read data. It assumes a UFS has been mounted along the path /s3-mount.

# ref https://github.com/awslabs/s3-connector-for-pytorch/tree/main?tab=readme-ov-file#sample-examples

from s3torchconnector import S3MapDataset, S3IterableDataset, S3ClientConfig
import random

S3_ENDPOINT_URL = "http://<LOAD_BALANCER_ADDRESS>"  # Alluxio's S3 API endpoint when using a load balancer to distribute requests to all workers
# S3_ENDPOINT_URL = "http://<ALLUXIO_WORKER>:29998"  # an alternative to a load balancer is to directly connect to a worker
DATASET_URI="s3://s3-mount"
REGION = "us-east-1"

s3_client_config = S3ClientConfig(
  force_path_style=True,
)

iterable_dataset = S3IterableDataset.from_prefix(DATASET_URI,
  region=REGION,
  endpoint=S3_ENDPOINT_URL,
  s3client_config=s3_client_config,
)

for item in iterable_dataset:
  content = item.read()
  print(f"{item.key}:{len(content)}")

map_dataset = S3MapDataset.from_prefix(DATASET_URI,
  region=REGION,
  endpoint=S3_ENDPOINT_URL,
  s3client_config=s3_client_config,
)

# Randomly access to an item in map_dataset.
item = random.choice(map_dataset)
# # Learn about bucket, key, and content of the object
bucket = item.bucket
key = item.key
content = item.read()
print(f"{bucket} {key} {len(content)}")

This assumes pytorch and related libraries are installed with pip.

$ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
$ pip install --upgrade pip
$ pip install s3torchconnector

Nvidia Triton inference server

The following steps shows how to prepare a Triton model repository, server, and client. It assumes the following preparation for Alluxio:

  • Alluxio is deployed in K8s

  • The Alluxio S3 endpoint is available at <LOAD_BALANCER_ADDRESS>

  • An S3 bucket, named <MY_BUCKET> is mounted in Alluxio at the mount point /s3-mount

Prepare model repository and upload to the mounted S3 bucket.

$ kubectl run -it --rm debug-shell --image=ubuntu:22.04 --restart=Never -- sh
$ apt update -y
$ apt install -y awscli git python3 python3.10-venv wget
$ git clone -b r25.06 https://github.com/triton-inference-server/server.git
$ cd server/docs/examples
$ ./fetch_models.sh

# upload to s3. note that "/triton_model_repo" it will be used for the triton server
$ aws s3 sync model_repository s3://<MY_BUCKET>/triton_model_repo

Create triton-server.yaml and deploy it with kubectl create -f triton-server.yaml.

apiVersion: v1
kind: Pod
metadata:
  name: triton-inference-server-s3
  labels:
    app: triton-s3
spec:
  hostNetwork: true
  containers:
    - name: triton-s3-server
      image: nvcr.io/nvidia/tritonserver:24.05-py3
      imagePullPolicy: IfNotPresent
      ports:
        - name: http
          containerPort: 8000
          protocol: TCP
        - name: grpc
          containerPort: 8001
          protocol: TCP
        - name: metrics
          containerPort: 8002
          protocol: TCP
      command: ["/opt/tritonserver/bin/tritonserver"]
      args:
        - "--model-repository=s3://<LOAD_BALANCER_ADDRESS>/s3-mount/triton_model_repo"
        - "--log-verbose=1"
        - "--log-info=true"
      readinessProbe:
        httpGet:
          path: /v2/health/ready
          port: 8000
        initialDelaySeconds: 30
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
      livenessProbe:
        httpGet:
          path: /v2/health/live
          port: 8000
        initialDelaySeconds: 60
        periodSeconds: 30
        timeoutSeconds: 5
        failureThreshold: 3

As part of starting the server, the model data will be read and therefore cached in Alluxio.

Create triton-client.yaml and deploy it with kubectl create -f triton-client.yaml.

apiVersion: v1
kind: Pod
metadata:
  name: triton-client
  labels:
    app: triton-s3
spec:
  hostNetwork: true
  containers:
    - image: nvcr.io/nvidia/tritonserver:24.05-py3-sdk
      imagePullPolicy: IfNotPresent
      name: tritonserver-client-test
      command: ["sleep", "infinity"]

Send a request from within the client

$ kubectl exec -it triton-client -- /workspace/install/bin/image_client -u $(kubectl get pod triton-inference-server-s3 -o jsonpath='{.status.podIP}'):8000 -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
Request 0, batch size 1
Image '/workspace/images/mug.jpg':
    15.349564 (504) = COFFEE MUG
    13.227464 (968) = CUP
    10.424892 (505) = COFFEEPOT

Last updated