# Alluxio Python Filesystem API based on FSSpec

{% hint style="warning" %}
实验性功能
{% endhint %}

Alluxio FSSpec Python API (alluxiofs) 允许应用程序通过统一的Python文件系统接口无缝地与各种存储后端进行交互。 它利用高性能的分布式缓存层，即Alluxio服务器，显著提高数据访问的速度并降低延迟。 这对于数据密集型应用程序和工作流程尤其有益，特别是AI训练工作负载需要快速且重复地访问大型数据集。

## 先决条件

* 使用ETCD作为成员管理器的正在运行的Alluxio服务器
* Python版本在\[3.8, 3.9, 3.10]范围内

## 安装

### 安装存储后端

`Alluxiofs` 作为现有底层数据湖存储连接之上的缓存层， 需要安装底层数据湖存储的fsspec实现。

要连接到现有的底层存储，有三个要求：

1. Alluxio服务器和 `alluxiofs` 客户端都连接到同一个底层存储。
2. 底层存储分为fsspec默认存储（<https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations）和第三方存储(https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations)。如果使用第三方存储，请安装对应的fsspec第三方存储包> (<https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations)。> 对于所有内置存储（<https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations），不需要额外的Python库。>
3. 为底层存储进行配置，特别是连接配置。

示例：将S3部署为底层数据湖存储[安装第三方S3存储](https://s3fs.readthedocs.io/en/latest/)

```
pip install s3fs
```

### 安装 `alluxiofs`

```
pip install alluxiofs
```

## 将数据加载到Alluxio服务器

如果数据已经加载到Alluxio服务器中，请跳过此步骤。

这个简单的例子创建了一个客户端，它连接到使用本地的ETCD成员管理器和S3作为底层存储的Alluxio集群。 有关更多配置设置，请参见[连接到Alluxio服务器的高级参数](#连接到Alluxio服务器的参数)。

向Alluxio服务器提交分布式加载作业：

```
from alluxiofs.client import AlluxioClient

alluxio_client = AlluxioClient(etcd_hosts="localhost")
alluxio_client.submit_load("s3://bucket/path/to/dataset/")
```

这将异步触发加载作业。您可以等待加载完成或使用以下命令检查此加载过程的进度：

```
alluxio_client.load_progress("s3://bucket/path/to/dataset/")
```

取消分布式加载作业:

```
alluxio_client.stop_load("s3://bucket/path/to/dataset/")
```

## 创建 `alluxiofs` Alluxio Python API

这个简单的例子创建了一个文件系统处理器，它连接到使用localhost ETCD成员服务和S3作为底层存储的Alluxio集群。

```
import fsspec
from alluxiofs import AlluxioFileSystem

# Register Alluxio to fsspec
fsspec.register_implementation("alluxiofs", AlluxioFileSystem, clobber=True)

# Create Alluxio filesystem
alluxio_fs = fsspec.filesystem("alluxiofs", etcd_hosts="localhost", target_protocol="s3")
```

有关Alluxio服务器和/或存储系统连接的高级参数设置，请参见[高级初始化参数](#高级初始化参数)。

## 示例: `alluxiofs` Hello World

```
# list files
contents = alluxio_fs.ls("s3://bucket/path/to/dataset/", detail=True)

# Read files
with alluxio_fs.open("s3://bucket/path/to/dataset/file1.parquet", "rb") as f:
    data = f.read()
```

可以在这里（<https://filesystem-spec.readthedocs.io/en/latest/usage.html#use-a-file-system）找到更多Python文件系统操作示例。>

## 示例: Ray

Ray是一个快速且简单的框架，用于构建和运行分布式应用程序。 在Ray之上运行的PyTorch、TensorFlow和XGBoost训练器可以利用Ray的高级功能，如创建异构集群，用CPU机器来进行数据加载和预处理，用GPU机器来进行训练。 可以使用Ray Data并行化进行数据加载、预处理和训练。

像PyTorch这样的训练器会在每个时期重复读取相同的数据集。 对于PyTorch来说，在每个时期获取大量的数据集已经成为训练的瓶颈。 通过利用Alluxio的高性能分布式缓存，Ray上的训练器可以减少总训练时间，提高GPU利用率，并加快端到端的模型生命周期。

先决条件: Ray版本 >= 2.8.2

```
# Pass the initialized Alluxio filesystem to Ray and read the dataset
ds = ray.data.read_parquet("s3://bucket/path/to/dataset/file1.parquet", filesystem=alluxio_fs)

# Get a count of the number of records in the single file
ds.count()

# Display the schema derived from the file header record
ds.schema()

# Display the header record
ds.take(1)

# Display the first data record
ds.take(2)

# Read multiple files:
ds2 = ray.data.read_parquet("s3://bucket/path/to/dataset/", filesystem=alluxio_fs)

# Get a count of the number of records in the files
ds2.count()
```

## 示例: PyArrow

PyArrow通过提供高性能的内存列存储格式，可以让应用程序和数据无缝连接。 它实现了不同数据处理系统之间的高效数据交换。 通过将其存储接口委托给fsspec，PyArrow可以通过统一的接口访问各种存储后端。 通过使用alluxiofs，PyArrow可以利用Alluxio的分布式缓存能力来提高数据访问速度并降低延迟。

示例1：

```
# Pass the initialized Alluxio filesystem to Pyarrow and read the data set from the example csv file
import pyarrow.dataset as ds
dataset = ds.dataset("s3://bucket/path/to/dataset/file1.parquet", filesystem=alluxio_fs)

# Get a count of the number of records in the file
dataset.count_rows()

# Display the schema derived from the file header record
dataset.schema

# Display the first record
dataset.take(0)
```

示例2：

```
from pyarrow.fs import PyFileSystem, FSSpecHandler

# Create a python-based PyArrow filesystem using FsspecHandler
py_fs = PyFileSystem(FSSpecHandler(alluxio_fs))

# Read the data by using the Pyarrow filesystem interface
with py_fs.open_input_file("s3://bucket/path/to/dataset/file1.parquet") as f:
    f.read()
```

## 高级初始化参数

### 连接到Alluxio服务器的参数

* **etcd\_hosts** (str, 必需): ETCD服务器主机的逗号分隔列表，格式为 "host1:port1,host2:port2,..."。 ETCD用于动态发现Alluxio工作节点。
* **etcd\_port** (int, 可选): 每个ETCD服务器使用的端口号。默认为 2379。
* **options** (dict, 可选): 一个包含Alluxio配置选项的字典，其中键是属性名称，值是属性值。 这些选项配置Alluxio客户端的行为。

示例：配置Alluxio页面大小（注意，**Alluxio页面大小必须与Alluxio服务器中配置的完全相同**）

```
alluxio_options = {}
alluxio_options["alluxio.worker.page.store.page.size"] = "20MB"
```

（可选）初始化 `alluxio_client` 以进行分布式加载作业:

```
alluxio_client = AlluxioClient(etcd_hosts="host1,host2,host3", etcd_port=8888, )
```

初始化 `alluxio_fs` 以进行 fsspec 文件系统操作:

```
alluxio_fs = fsspec.filesystem("alluxiofs", etcd_hosts="localhost", target_protocol="s3", options=alluxio_options)
```

### 存储后端的参数

参数：

* **target\_protocol** (str, 可选): 指定要创建底层存储文件系统对象的底层存储协议。 常见的示例包括`s3`用于Amazon S3，`hdfs` 用于Hadoop分布式文件系统等。
* **target\_options** (dict, 可选): 提供一组与 `target_protocol` 相关的配置选项。 这些选项可能包括凭据、终端节点URL和其他与底层存储系统成功交互所需的特定协议的设置。
* fs (object, 可选): 直接提供用于访问Alluxio底层存储的文件系统对象实例

#### 示例: 连接到 S3

要连接到S3，可以按照以下步骤进行：

1. Review S3 fsspec documentation: Refer to the [s3fs documentation](https://s3fs.readthedocs.io/en/latest/) to find out the supported arguments for connecting to S3. Typical arguments include:

* **anon** bool (False): Whether to use anonymous connection (public buckets only). If False, uses the key/secret given, or boto's credential resolver; client\_kwargs, environment, variables, config files, EC2 IAM server, in that order
* **endpoint\_url** string (None): Use this endpoint\_url, if specified. Needed for connecting to non-AWS S3 buckets. Takes precedence over `endpoint_url` in client\_kwargs.
* **key** string (None): If not anonymous, use this access key ID, if specified. Takes precedence over `aws_access_key_id` in client\_kwargs.
* **secret** string (None): If not anonymous, use this secret access key, if specified. Takes precedence over `aws_secret_access_key` in client\_kwargs.
* **token** string (None): If not anonymous, use this security token, if specified

1. 查看S3 fsspec文档：参考[s3fs文档](https://s3fs.readthedocs.io/en/latest/) 以了解连接到S3所支持的参数。典型的参数包括：

* **anon bool** (False): 是否使用匿名连接（仅限公共存储桶）。 如果为False，则使用给定的key/secret，或boto的凭据解析器； 客户端参数、环境变量、配置文件、EC2 IAM服务器，依次进行。
* **endpoint\_url** string (None): 如果指定，就使用此endpoint\_url。连接到非AWS S3存储桶时会需要用到。优先于client\_kwargs中的 `endpoint_url`。
* **key** string (None): 如果不是匿名的，如果指定，请使用此访问密钥ID。优先于client\_kwargs中的 `aws_access_key_id`。
* **secret** string (None): 如果不是匿名的，如果指定，请使用此密钥访问密钥。优先于client\_kwargs中的 `aws_secret_access_key`。
* **token** string (None): 如果不是匿名的，如果指定，请使用此安全令牌。

2. 将支持的参数作为target\_options传递给Alluxio：然后，您可以使用这些参数使用fsspec创建Alluxio文件系统对象。

以下是如何使用fsspec创建连接到S3的Alluxio文件系统对象的示例：

```
import fsspec

# Example configuration for connecting to S3
s3_options = {
    "key": "your-aws-access-key",
    "secret": "your-aws-secret-key",
    "endpoint_url": "https://s3.your-endpoint.com"
}

# Creating the Alluxio file system object
alluxio_fs = fsspec.filesystem(
    "alluxiofs",
    etcd_hosts="localhost",
    target_protocol="s3",
    target_options=s3_options
)

# Now you can use alluxio_fs to interact with the Alluxio file system backed by S3
```

在此示例中：

* 将`your-aws-access-key` 和 `your-aws-secret-key` 替换为您实际的AWS凭据。
* 如果需要，将 `https://s3.your-endpoint.com` 替换为您的S3兼容服务的适当端点URL。

按照这些步骤，您可以使用fsspec有效地连接到具有S3后端的Alluxio。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-cn/ai-3.2/api/fsspec.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.