Access via PythonSDK/FSSpec
Last updated
Last updated
This feature is experimental.
Alluxio Python SDK (alluxiofs
) is based on , which allows applications to seamlessly interact with various storage backends using a unified Python filesystem interface.It leverages a high-performance distributed caching layer, the Alluxio cluster, to significantly enhance data access speeds and reduce latency.This is particularly beneficial for data-intensive applications and workflows, especially AI training workloads, where large datasets need to be accessed quickly and repeatedly.
A running Alluxio cluster
Python's version >= 3.8
Example: Deploy S3 as the underlying data lake storage
Start an Alluxio cluster with at least 1 coordinator and 1 worker
If you want to start cluster in Bare Metal, you can use ./bin/alluxio process start
cli.
Configure Alluxio with the UFS's Credentials
Interacting with alluxio using alluxiofs
If the data has already been loaded into the Alluxio cluster, skip this step.
Submit distributed load job to Alluxio cluster:
This will trigger a load job asynchronously. You can wait until the load finishes or check the progress of this loading process using the following command:
To cancel the distributed load job:
This simple example creates a filesystem handler that connects to an Alluxio cluster with an ETCD membership in Kubernetes with service called alluxio-etcd.alluxio-ai
with cluster_name alluxio
in namespace alluxio-ai
and S3 as the under storage.
Below are common operations supported by alluxio_fs
for interacting with files and directories:
ls
Lists the contents of a directory.
Parameters
path (str)
: The path of the directory to list.
detail (bool, optional)
: If True, returns detailed information about each entry. Defaults to False.
Returns
list[dict]
: A list of dicts with JSON-like structure.
Example
info
Retrieves information about a file or directory.
Parameters
path (str)
: The path of the file or directory.
Returns
dict
: JSON-like structure with file or directory info.
Example
isdir
Checks if a path is a directory.
Parameters
path (str)
: The path to check.
Returns
bool
: Whether the path is a directory.
Example
_open
Opens a file for reading or writing.
Parameters
path (str)
: The path of the file to open.
mode (str, optional)
: The mode in which to open the file. Defaults to "rb".
block_size (int, optional)
: The block size for reading.
autocommit (bool, optional)
: If True, commits changes automatically.
cache_options (dict, optional)
: Cache options.
**kwargs
: Additional keyword arguments.
Returns
AlluxioFile
: Object supporting read()
, write()
, etc.
Example
cat_file
Reads a range of bytes from a file (for little files <10MB).
Parameters
path (str)
: The path of the file to read.
start (int, optional)
: Starting byte. Defaults to 0.
end (int, optional)
: Ending byte. Defaults to None.
Returns
bytes
: The read bytes.
Example
mkdir
Creates a directory.
Parameters
path (str)
: The path of the new directory.
Returns
bool
: Whether the directory was created successfully.
Example
rm
Removes a file or directory.
Parameters: Multiple optional flags for control.
Returns
bool
: Whether the operation succeeded.
Example
touch
Creates an empty file.
Parameters
path (str)
: File path.
Returns
bool
: Whether the file was created.
Example
head
Reads the first few bytes of a file.
Parameters
path (str)
: File path.
num_of_bytes (int)
: Number of bytes.
Returns
bytes
: The read bytes.
Example
tail
Reads the last few bytes of a file.
Parameters
path (str)
: File path.
num_of_bytes (int)
: Number of bytes.
Returns
bytes
: The read bytes.
Example
mv
Moves or renames a file or directory.
Parameters
path1 (str)
: Source path.
path2 (str)
: Destination path.
Returns
bool
: Whether the move/rename succeeded.
Example
copy
/ cp_file
Copies a file or directory.
Parameters: Various options for recursion, force, threading, etc.
Returns
bool
: Whether the copy succeeded.
Example
read
Reads an entire file (for little files <10MB).
Parameters
path (str)
: File path.
Returns
bytes
: The read content.
Example
好的,我会将你刚刚补充的这些方法统一风格整理并添加进文档中。以下是续写部分,衔接在已有文档之后:
rename
/ move
Alias for mv
. Renames or moves a file or directory.
Parameters
path1 (str)
: Source path.
path2 (str)
: Destination path.
Returns
bool
: Whether the operation succeeded.
Example
cp_file
Alias for copy
. Copies a file or directory.
Parameters
path1 (str)
: Source path.
path2 (str)
: Destination path.
recursive (bool)
: Whether to copy recursively for directories.
Returns
bool
: Whether the copy succeeded.
Example
upload
Upload a large file from the local OS file system to Alluxio (and UFS, depending on the WriteType).
Parameters
lpath (str)
: The source path of the file in the local OS.
rpath (str)
: The destination path of the file in Alluxio/UFS.
Returns
bool
: Whether the upload task completed successfully.
Example
upload_data
Uploads a large file to Alluxio using a byte stream. Different from upload
which reads from disk, upload_data
accepts bytes directly.
Parameters
lpath (str)
: The destination path in Alluxio/UFS.
data (bytes)
: The byte content to upload.
Returns
bool
: Whether the upload task completed successfully.
Example
download
Downloads a large file from Alluxio to the local OS file system.
Parameters
lpath (str)
: The destination path in the local file system.
rpath (str)
: The source path in Alluxio.
Returns
bool
: Whether the download task completed successfully.
Example
download_data
Downloads a file from Alluxio and returns it as an in-memory byte stream.
Parameters
lpath (str)
: The path of the file in Alluxio.
Returns
io.BytesIO
: A byte stream containing the file content.
Example
write
Writes byte data to a file in Alluxio. Equivalent to upload_data
.
Parameters
path (str)
: Path to write the data to.
value (bytes)
: The byte content.
Returns
bool
: Whether the write succeeded.
Example
Trainers like PyTorch will read the same dataset again and again for each epoch. Getting a large dataset for PyTorch in each epoch becomes the training bottleneck. By leveraging Alluxio's high-performance distributed caching, trainers on Ray can reduce total training time, improve GPU utilization rate, and speed up the end-to-end model lifecycle.
Prerequisites: Ray version >= 2.8.2
PyArrow allows applications and data to seamlessly connect with each other by providing a high-performance, in-memory columnar storage format. It enables efficient data interchange between different data processing systems. By delegating its storage interface to fsspec, PyArrow can access various storage backends through a unified interface. By using alluxiofs, PyArrow can leverage Alluxio's distributed caching capabilities to enhance data access speeds and reduce latency.
Example 1:
Example 2:
etcd_hosts (str, required): A comma-separated list of ETCD server hosts in the format "host1:port1,host2:port2,...". ETCD is used for dynamic discovery of Alluxio workers.
etcd_port (int, optional): The port number used by each ETCD server. Defaults to 2379
.
options (dict, optional): A dictionary of Alluxio configuration options where keys are property names and values are property values. These options configure the Alluxio client behavior.
Example: Configure Alluxio fsspec. Note that the following options must be the same between alluxiofs and alluxio cluster
alluxio.worker.page.store.page.size
(default 1MB
): Size of each page in worker paged block store. Recommend to set to 20MB
for large parquet files.
alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker
(default 2000
): This is the number of virtual nodes for one worker in the consistent hashing algorithm. In a consistent hashing algorithm, on membership changes, some virtual nodes are re-distributed instead of rebuilding the whole hash table. This guarantees the hash table is changed only in a minimal. In order to achieve that, the number of virtual nodes should be X times the physical nodes in the cluster, where X is a balance between redistribution granularity and size. Recommend to set to 5
.
(Optional) Init alluxio_client
for distributed load operations:
Init alluxio_fs
for fsspec filesystem operations:
config the logger of alluxio_fs
Arguments:
target_protocol (str, optional): Specifies the under storage protocol to create the under storage file system object. Common examples include s3
for Amazon S3, hdfs
for Hadoop Distributed File System, and others.
target_options (dict, optional): Provides a set of configuration options relevant to the target_protocol
. These options might include credentials, endpoint URLs, and other protocol-specific settings required to successfully interact with the under storage system.
fs (object, optional): Directly supplies an instance of a file system object for accessing the underlying storage of Alluxio
logger(object, optional): config the the path to store log files and level of logger, the path is the current path default, and the level is logging.INFO default.
To connect to S3, you can follow these steps:
anon bool (False): Whether to use anonymous connection (public buckets only). If False, uses the key/secret given, or boto's credential resolver; client_kwargs, environment, variables, config files, EC2 IAM server, in that order
endpoint_url string (None): Use this endpoint_url, if specified. Needed for connecting to non-AWS S3 buckets. Takes precedence over endpoint_url
in client_kwargs.
key string (None): If not anonymous, use this access key ID, if specified. Takes precedence over aws_access_key_id
in client_kwargs.
secret string (None): If not anonymous, use this secret access key, if specified. Takes precedence over aws_secret_access_key
in client_kwargs.
token string (None): If not anonymous, use this security token, if specified
Pass the supported arguments as target_options to Alluxio: You can then use these arguments to create an Alluxio file system object using fsspec.
Here's how to create an Alluxio file system object connected to S3 using fsspec:
In this example:
Replace your-aws-access-key
and your-aws-secret-key
with your actual AWS credentials.
Replace https://s3.your-endpoint.com
with the appropriate endpoint URL for your S3-compatible service if needed.
By following these steps, you can effectively connect to Alluxio with an S3 backend using fsspec.
Prometheus Installation and Configuration
Edit the prometheus.yml
configuration file and then start Prometheus:
Grafana Installation and Configuration Start Grafana:
Metric Name
Description
Labels
Unit
Implementation Code
alluxio_http_server_call_latency_ms
Histogram of HTTP service call latency (Bucket boundaries: [10, 40, 160, 640] ms)
method
, success
Milliseconds (ms)
HistogramWrapper
alluxio_http_server_result_total
Total count of HTTP service results
method
, state
Count
CounterWrapper
alluxio_http_server_call_latency_ms_sum
Total latency of HTTP service calls
method
, success
Milliseconds (ms)
HistogramWrapper
alluxio_http_server_call_latency_ms_count
Count of HTTP service calls
method
, success
Count
HistogramWrapper
If you want to start cluster in K8s, please refer to .
For details,you can go to and .
This simple example creates a client to connects to an Alluxio cluster with ETCD membership in Kubernetes with service called alluxio-etcd.alluxio-ai
with cluster_name alluxio
in namespace alluxio-ai
and S3 as the under storage.
See for more configuration settings.
See to set advanced arguments for Alluxio cluster and/or storage system connections.
More Python filesystem operations examples can be found .
is a fast and simple framework for building and running distributed applications. PyTorch, TensorFlow, and XGBoost trainers, running on top of Ray, can leverage Ray's advanced functionalities like creating heterogeneous clusters consisting of CPU machines for data loading and preprocessing, and GPU machines for training. Data loading, preprocessing, and training can be parallelized using Ray Data.
Review S3 fsspec documentation: Refer to the to find out the supported arguments for connecting to S3. Typical arguments include:
For more detailed instructions on setting up the monitoring system, refer to the .