Python API via FSSpec
This feature is experimental.
Alluxio Python SDK (alluxiofs) is based on FSSpec, which allows applications to seamlessly interact with various storage backends using a unified Python filesystem interface.It leverages a high-performance distributed caching layer, the Alluxio cluster, to significantly enhance data access speeds and reduce latency.This is particularly beneficial for data-intensive applications and workflows, especially AI training workloads, where large datasets need to be accessed quickly and repeatedly.
Prerequisites
A running Alluxio cluster
Python's version >= 3.8
Installation
Install Dependencies
Install alluxiofs
pip install alluxiofs(Optional) Install UFS's client base on FSSpec(such as s3, oss)
Example: If your UFS is S3, and you want to have the ability to fallback to UFS after Alluxio fails, you need to download s3fs
pip install s3fsEnvironment Setup
Start an Alluxio cluster with at least 1 coordinator and 1 worker
If you want to start cluster in K8s, please refer to Alluxio Official Documentation).
If you want to start cluster in Bare Metal, you can use
./bin/alluxio process startcli.
Interacting with alluxio using alluxiofs
For details,you can go to Create alluxiofs's Instance and alluxiofs's Basic File Operations.
Create alluxiofs's Instance
This simple example creates a filesystem handler that connects to an Alluxio cluster, using "localhost" as load balance domain, using S3 as the under storage.
See advanced init arguments to set advanced arguments for Alluxio cluster and/or storage system connections.
alluxiofs's Basic File Operations
Below are common operations supported by alluxio_fs for interacting with files and directories:
1. ls
lsLists the contents of a directory.
Parameters
path (str): The path of the directory to list.detail (bool, optional): If True, returns detailed information about each entry. Defaults to False.
Returns
list[dict]: A list of dicts with JSON-like structure.
Example
2. info
infoRetrieves information about a file or directory.
Parameters
path (str): The path of the file or directory.
Returns
dict: JSON-like structure with file or directory info.
Example
3. isdir
isdirChecks if a path is a directory.
Parameters
path (str): The path to check.
Returns
bool: Whether the path is a directory.
Example
4. open
openOpens a file for reading or writing.
Parameters
path (str): The path of the file to open.mode (str, optional): The mode in which to open the file. Defaults to "rb".block_size (int, optional): The block size for reading.autocommit (bool, optional): If True, commits changes automatically.cache_options (dict, optional): Cache options.**kwargs: Additional keyword arguments.
Returns
AlluxioFile: Object supportingread(),write(), etc.
Example
5. mkdir
mkdirCreates a directory.
Parameters
path (str): The path of the new directory.
Returns
bool: Whether the directory was created successfully.
Example
6. rm
rmRemoves a file or directory.
Parameters: Multiple optional flags for control.
Returns
bool: Whether the operation succeeded.
Example
7. touch
touchCreates an empty file.
Parameters
path (str): File path.
Returns
bool: Whether the file was created.
Example
8. head
headReads the first few bytes of a file.
Parameters
path (str): File path.num_of_bytes (int): Number of bytes.
Returns
bytes: The read bytes.
Example
9. tail
tailReads the last few bytes of a file.
Parameters
path (str): File path.num_of_bytes (int): Number of bytes.
Returns
bytes: The read bytes.
Example
10. mv
mvMoves or renames a file or directory.
Parameters
path1 (str): Source path.path2 (str): Destination path.
Returns
bool: Whether the move/rename succeeded.
Example
11. cp
cpCopies a file or directory.
Parameters: Various options for recursion, force, threading, etc.
Returns
bool: Whether the copy succeeded.
Example
12. read_bytes
read_bytesReads an entire file (for little files <10MB).
Parameters
path (str): File path.
Returns
bytes: The read content.
Example
13. write_bytes
write_bytesWrites byte data to a file in Alluxio.
Parameters
path (str): Path to write the data to.value (bytes): The byte content.
Returns
bool: Whether the write succeeded.
Example
14. upload
uploadUpload a large file from the local OS file system to Alluxio (and UFS, depending on the WriteType).
Parameters
lpath (str): The source path of the file in the local OS.rpath (str): The destination path of the file in Alluxio/UFS.
Returns
bool: Whether the upload task completed successfully.
Example
15. download
downloadDownloads a large file from Alluxio to the local OS file system.
Parameters
lpath (str): The destination path in the local file system.rpath (str): The source path in Alluxio.
Returns
bool: Whether the download task completed successfully.
Example
More Python filesystem operations examples can be found here.
Integration with Other Frameworks
Example: Ray
Ray is a fast and simple framework for building and running distributed applications. PyTorch, TensorFlow, and XGBoost trainers, running on top of Ray, can leverage Ray's advanced functionalities like creating heterogeneous clusters consisting of CPU machines for data loading and preprocessing, and GPU machines for training. Data loading, preprocessing, and training can be parallelized using Ray Data.
Trainers like PyTorch will read the same dataset again and again for each epoch. Getting a large dataset for PyTorch in each epoch becomes the training bottleneck. By leveraging Alluxio's high-performance distributed caching, trainers on Ray can reduce total training time, improve GPU utilization rate, and speed up the end-to-end model lifecycle.
Prerequisites: Ray version >= 2.8.2
Example: PyArrow
PyArrow allows applications and data to seamlessly connect with each other by providing a high-performance, in-memory columnar storage format. It enables efficient data interchange between different data processing systems. By delegating its storage interface to fsspec, PyArrow can access various storage backends through a unified interface. By using alluxiofs, PyArrow can leverage Alluxio's distributed caching capabilities to enhance data access speeds and reduce latency.
Example 1:
Example 2:
Advanced Initialization Parameters
Arguments to Connect to Alluxio Cluster
options (dict, optional): A dictionary of Alluxio configuration options where keys are property names and values are property values. These options configure the Alluxio client behavior.
Example: Configure Alluxio fsspec. Note that the following options must be the same between alluxiofs and alluxio cluster
alluxio.worker.page.store.page.size(default1MB): Size of each page in worker paged block store. Recommend to set to20MBfor large parquet files.alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker(default2000): This is the number of virtual nodes for one worker in the consistent hashing algorithm. In a consistent hashing algorithm, on membership changes, some virtual nodes are re-distributed instead of rebuilding the whole hash table. This guarantees the hash table is changed only in a minimal. In order to achieve that, the number of virtual nodes should be X times the physical nodes in the cluster, where X is a balance between redistribution granularity and size. Recommend to set to5.
(Optional) Init alluxio_client for distributed load operations:
Init alluxio_fs for fsspec filesystem operations:
config the logger of alluxio_fs
Arguments for storage backend
Arguments:
target_protocol (str, optional): Specifies the under storage protocol to create the under storage file system object. Common examples include
s3for Amazon S3,hdfsfor Hadoop Distributed File System, and others.target_options (dict, optional): Provides a set of configuration options relevant to the
target_protocol. These options might include credentials, endpoint URLs, and other protocol-specific settings required to successfully interact with the under storage system.fs (object, optional): Directly supplies an instance of a file system object for accessing the underlying storage of Alluxio
logger(object, optional): config the the path to store log files and level of logger, the path is the current path default, and the level is logging.INFO default.
Example: connect to S3
To connect to S3, you can follow these steps:
Review S3 fsspec documentation: Refer to the s3fs documentation to find out the supported arguments for connecting to S3. Typical arguments include:
anon bool (False): Whether to use anonymous connection (public buckets only). If False, uses the key/secret given, or boto's credential resolver; client_kwargs, environment, variables, config files, EC2 IAM server, in that order
endpoint_url string (None): Use this endpoint_url, if specified. Needed for connecting to non-AWS S3 buckets. Takes precedence over
endpoint_urlin client_kwargs.key string (None): If not anonymous, use this access key ID, if specified. Takes precedence over
aws_access_key_idin client_kwargs.secret string (None): If not anonymous, use this secret access key, if specified. Takes precedence over
aws_secret_access_keyin client_kwargs.token string (None): If not anonymous, use this security token, if specified
Pass the supported arguments as target_options to Alluxio: You can then use these arguments to create an Alluxio file system object using fsspec.
Here's how to create an Alluxio file system object connected to S3 using fsspec:
In this example:
Replace
your-aws-access-keyandyour-aws-secret-keywith your actual AWS credentials.Replace
https://s3.your-endpoint.comwith the appropriate endpoint URL for your S3-compatible service if needed.
By following these steps, you can effectively connect to Alluxio with an S3 backend using fsspec.
Monitoring Metrics
Monitoring System Setup
Prometheus Installation and Configuration Edit the
prometheus.ymlconfiguration file and then start Prometheus:Grafana Installation and Configuration Start Grafana:
For more detailed instructions on setting up the monitoring system, refer to the Alluxio Official Documentation.
Explanation of Monitoring Metrics
Metric Name
Description
Labels
Unit
Implementation Code
alluxio_http_server_call_latency_ms
Histogram of HTTP service call latency (Bucket boundaries: [10, 40, 160, 640] ms)
method, success
Milliseconds (ms)
HistogramWrapper
alluxio_http_server_result_total
Total count of HTTP service results
method, state
Count
CounterWrapper
alluxio_http_server_call_latency_ms_sum
Total latency of HTTP service calls
method, success
Milliseconds (ms)
HistogramWrapper
alluxio_http_server_call_latency_ms_count
Count of HTTP service calls
method, success
Count
HistogramWrapper
Last updated