POSIX API
Last updated
Last updated
The Alluxio POSIX API is a feature that allows mounting an Alluxio File System as a standard file system on most flavors of Unix. By using this feature, standard tools (for example, ls
, cat
or mkdir
) will have basic access to the Alluxio namespace. More importantly, with the POSIX API integration applications can interact with the Alluxio no matter what language (C, C++, Python, Ruby, Perl, or Java) they are written in without any Alluxio library integrations.
Note that Alluxio-FUSE is different from projects like s3fs, mountableHdfs which mount specific storage services like S3 or HDFS to the local filesystem. The Alluxio POSIX API is a generic solution for the many storage systems supported by Alluxio. Data orchestration and caching features from Alluxio speed up I/O access to frequently used data.
Right now Alluxio POSIX API mainly targets the ML/AI workloads (especially read heavy workloads).
The Alluxio POSIX API is based on the Filesystem in Userspace (FUSE) project. Most basic file system operations are supported. However, given the intrinsic characteristics of Alluxio, like its write-once/read-many-times file data model, the mounted file system does not have full POSIX semantics and contains some limitations. Please read the functionalities and limitations for details.
For additional limitation on file path names on Alluxio please check : Alluxio limitations
This example shows how to mount the whole Alluxio cluster to a local directory and run operations against the directory.
The followings are the basic requirements running ALLUXIO POSIX API. Installing Alluxio POSIX API using Docker and Kubernetes can further simplify the setup.
Have a running Alluxio cluster
On one of the following supported operating systems
MacOS 10.10 or later
CentOS - 6.8 or 7
RHEL - 7.x
Ubuntu - 16.04
Install JDK 11, or newer
JDK 8 has been reported to have some bugs that may crash the FUSE applications, see issue for more details.
Install libfuse
On Linux, we support libfuse both version 2 and 3
To use with libfuse2, install libfuse 2.9.3 or newer (2.8.3 has been reported to also work with some warnings). For example on a Redhat, run yum install fuse fuse-devel
To use with libfuse3, install libfuse 3.2.6 or newer (We are currently testing against 3.2.6). For example on a Redhat, run yum install fuse3 fuse3-devel
See Select which libfuse version to use to learn more about the libfuse version used by alluxio
On MacOS, install osxfuse 3.7.1 or newer. For example, run brew install osxfuse
After properly configuring and starting an Alluxio cluster; Run the following command on the node where you want to create the mount point:
This will spawn a background user-space java process (AlluxioFuse
) that will mount the Alluxio path specified at <alluxio_path>
to the local file system on the specified <mount_point>
.
For example, running the following commands from the ${ALLUXIO_HOME}
directory will mount the Alluxio path /people
to the directory /mnt/people
on the local file system.
Note that the <mount_point>
must be an existing and empty path in your local file system hierarchy and that the user that runs the integration/fuse/bin/alluxio-fuse
script must own the mount point and have read and write permissions on it.
Multiple Alluxio FUSE mount points can be created in the same node. All the AlluxioFuse
processes share the same log output at ${ALLUXIO_HOME}/logs/fuse.log
, which is useful for troubleshooting when errors happen on operations under the filesystem.
See configuration section for how to improve the Alluxio POSIX API performance especially during training workloads.
FUSE mount points can be checked via mount
command:
FUSE processes can be found via jps
or ps
commands.
Mounted Alluxio path information can be found via Alluxio FUSE script:
After mounting, one can run operations (e.g. shell commands, training) against the local directory:
The operations will be translated and executed by the Alluxio system and may be executed on the under storage based on configuration.
Note that unlike Alluxio CLIs which show detailed error messages, user operations via Alluxio Fuse mount point will only receive error message pre-defined by FUSE which may not be informative. For example, once an error happens, it is common to see:
In this case, check Alluxio Fuse logs (located at ${ALLUXIO_HOME}/logs/fuse.log
) for the actual error message. For example, the command may fail because unable to connect to the Alluxio master:
Umount a mounted FUSE mount point:
For example,
See umount options for more advanced umount settings.
Most basic file system operations are supported. However, due to Alluxio implicit characteristics, some operations are not fully supported.
Metadata Write
Create file, delete file, create directory, delete directory, rename, change owner, change group, change mode
Symlink, link, change access/modification time (utimens), change special file attributes (chattr), sticky bit
Metadata Read
Get file status, get directory status, list directory status
Data Write
Sequential write
Append write, random write, overwrite, truncate, concurrently write the same file by multiple threads/clients
Data Read
Sequential read, random read, multiple threads/clients concurrently read the same file
Combinations
FIFO special file type, Rename when writing the source file, reading and writing concurrently on the same file
Note that all file/dir permissions are checked against the user launching the AlluxioFuse process instead of the end user running the operations. See Security section for more details about the configuration and limitation of Alluxio POSIX API security.
Alluxio FUSE can be launched and ran without extra configuration for basic workloads. This section lists configuration suggestions to improve the performance and stability of training workloads which involve much smaller files and have much higher concurrency.
The following configurations are validated in training production workloads to help improve the training performance and/or system efficiency. Add the configuration before starting the corresponding services (Master/Worker/Fuse process).
<ALLUXIO_HOME>/conf/alluxio-env.sh
:
<ALLUXIO_HOME>/conf/alluxio-site.properties
:
When using POSIX API with a large amount of small files, recommend setting the following extra properties:
When an application runs an operation against the local FUSE mount point. The request will be processed by FUSE kernel, Fuse process, and Alluxio system sequentially. If at any level, cache is enabled and there is a hit, cached metadata/data will be returned to the application without going through the whole process to improve the overall read performance.
While Alluxio system (master and worker) provides remote distributed metadata/data cache to speed up the metadata/data access of Alluxio under storage files/directories, Alluxio FUSE provides another layer of local metadata/data cache on the application nodes to further speed up the metadata/data access.
Alluxio FUSE can provide two kinds of metadata/data cache, the kernel cache and the userspace cache.
Kernel cache is executed by Linux kernel with metadata/data stored in operating system kernel cache.
Userspace cache is controlled and managed by Alluxio FUSE process with metadata/data stored in user configured location (process memory for metadata, ramdisk/disk for data).
The following illustration shows the layers of cache — FUSE kernel cache, FUSE userspace cache, Alluxio system cache.
Since FUSE kernel cache and userspace cache both provide caching capability, although they can be enabled at the same time, it is recommended to choose only one of them to avoid double memory consumption. Here is a guideline on how to choose between the two cache types based on your environment and needs.
Kernel Cache (Recommended): kernel cache provides significantly better performance, scalability, and resource consumption compared to userspace cache. However, kernel cache is managed by the underlying operating system instead of Alluxio or end-users. High kernel memory usage may affect the Alluxio FUSE pod stability in the kubernetes environment. This is something to watch out for when using kernel cache.
Userspace Cache: userspace cache in contrast is relatively worse in performance, scalability, and resource consumption. It also requires pre-calculated and pre-allocated cache resources when launching the process. Despite the disadvantages, users can have more fine-grain control on the cache (e.g. maximum cache size, eviction policy) and the cache will not affect other applications in containerized environment unexpectedly.
Alluxio FUSE cache (Userspace cache or Kernel cache) is a single-node cache solution, which means modifications to the underlying Alluxio cluster through other Alluxio clients or other Alluxio FUSE mount points may not be visible immediately by the current Alluxio FUSE cache. This would cause cached data to become stale. Some examples are listed below:
metadata cache: the file or directory metadata such as size, or modification timestamp cached on Node A
might be stale if the file is being modified concurrently by an application on Node B
.
data cache: Node A
may read a cached file without knowing that Node B had already deleted or overwritten the file in the underlying Alluxio cluster. When this happens the content read by Node A
is stale.
Metadata cache may significantly improve the read training performance especially when loading a large amount of small files repeatedly. FUSE kernel issues extra metadata read operations (sometimes can be 3 - 7 times more) compared to Alluxio Java API) when applications are doing metadata operations or even data operations. Even a 1-minute temporary metadata cache may double metadata read throughput or small file data loading throughput.
The security of the Alluxio POSIX API does not exactly follow the POSIX standard. This is a known limitation and we are working to improve it.
All file/dir permissions in Alluxio POSIX API are checked against the user launching the AlluxioFuse process instead of the end user running the operations.
User group policies decide the user/group of the created file/dir and the user/group shown in the get file/dir path status operations.
Three user group policies can be chosen from:
Security Guard
Weak
Strong
Weak
Performance Overhead
Low
High. Each create/list file/dir operation needs to do user/group translation
Low
The user/group of the file/dir created through Alluxio POSIX API
The user/group that launches the Alluxio FUSE application
The user/group that runs the file/dir creation operation
The configured customize user/group
The user/group of the file/dir listed through Alluxio POSIX API
The user/group that launches the Alluxio FUSE application
The actual file/dir user/group, or -1 if user/group not found in the local system
The configured customize user/group
The detailed configuration and example usage are listed below:
Alluxio now supports both libfuse2 and libfuse3. Alluxio FUSE on libfuse2 is more stable and has been tested in production. Alluxio FUSE on libfuse3 is currently experimental but under active development. Alluxio will focus more on libfuse3 and utilize new features provided.
If only one version of libfuse is installed, that version is used. In most distros, libfuse2 and libfuse3 can coexist. If both versions are installed, libfuse2 will be used by default (for backward compatibility).
To set the version explicitly, add the following configuration in ${ALLUXIO_HOME}/conf/alluxio-site.properties
.
Valid values are 2
(use libfuse2 only), 3
(use libfuse3 only) or other integer value (load libfuse2 first, and if failed, load libfuse3).
See logs/fuse.out
for which version is used.
You can use alluxio-fuse mount -o [comma separated mount options]
to set mount options when launching the standalone Fuse process. If no mount option is provided, the value of alluxio configuration alluxio.fuse.mount.options
(default: direct_io
) will be used.
Different versions of libfuse
and osxfuse
may support different mount options. The available Linux mount options are listed here. The mount options of MacOS with osxfuse are listed here . Some mount options (e.g. allow_other
and allow_root
) need additional set-up and the set-up process may be different depending on the platform.
These are the configuration parameters for Alluxio POSIX API.
Alluxio fuse has two kinds of unmount operation, soft unmount and hard umount.
The unmount operation is soft unmount by default.
You can use -w [unmount_wait_timeout_in_seconds]
to set the unmount wait time in seconds. The unmount operation will kill the Fuse process and wait up to [unmount_wait_timeout_in_seconds]
for the Fuse process to be killed. However, if the Fuse process is still alive after the wait timeout, the unmount operation will error out.
In Alluxio Fuse implementation, alluxio.fuse.umount.timeout
(default value: 0
) defines the maximum timeout to wait for all in-progress read/write operations to finish. If there are still in-progress read/write operations left after timeout, the alluxio-fuse umount <mount_point>
operation is a no-op. Alluxio Fuse process is still running, and fuse mount point is still functioning. Note that when alluxio.fuse.umount.timeout=0
(by default), umount operations will not wait for in-progress read/write operations.
Recommend to set -w [unmount_wait_timeout_in_seconds]
to a value that is slightly larger than alluxio.fuse.umount.timeout
.
Hard umount will always kill the fuse process and umount fuse mount point immediately.
This section talks about how to troubleshoot issues related to Alluxio POSIX API. Note that the errors or problems of Alluxio POSIX API may come from the underlying Alluxio system. For general guideline in troubleshooting, please refer to troubleshooting documentation
When encountering the out of direct memory issue, add the following JVM opts to ${ALLUXIO_HOME}/conf/alluxio-env.sh
to increase the max amount of direct memory.
Depending on the Fuse deployment type, Fuse metrics can be exposed as worker metrics (Fuse on worker process) or client metrics (Standalone FUSE process). Check out the metrics introduction doc for how to get Fuse metrics.
Fuse metrics include Fuse specific metrics and general client metrics. Check out the Fuse metrics list about more details of what metrics are recorded and how to use those metrics.
Each I/O operation by users can be translated into a sequence of Fuse operations. Operations longer than alluxio.user.logging.threshold
(default 10s
) will be logged as warnings to users.
Sometimes Fuse error comes from unexpected Fuse operation combinations. In this case, enabling debug logging in FUSE operations helps understand the sequence and shows time elapsed of each Fuse operation.
For example, a typical flow to write a file seen by FUSE is an initial Fuse.create
which creates a file, followed by a sequence of Fuse.write
to write data to that file, and lastly a Fuse.release
to close file to commit a file written to Alluxio file system.
One can set alluxio.fuse.debug.enabled=true
in ${ALLUXIO_HOME}/conf/alluxio-site.properties
before mounting the Alluxio FUSE to enable debug logging.
For more information about logging, please check out this page.
The following diagram shows the stack when using Alluxio POSIX API:
Essentially, Alluxio POSIX API is implemented as a FUSE integration which is simply a long-running Alluxio client. In the following stack, the performance overhead can be introduced in one or more components among
Application
Fuse library
Alluxio related components
It is very helpful to understand the following questions with respect to how the applications interact with Alluxio POSIX API:
How is the applications accessing Alluxio POSIX API? Is it mostly read or write or a mixed workload?
Is the access heavy in data or metadata?
Is the concurrency level sufficient to sustain high throughput?
Is there any lock contention?
Fuse, especially the libfuse and FUSE kernel code, may also introduce performance overhead. Based on our investigation and mdtest benchmarking, libfuse with local filesystem implementation does not scale well in terms of metadata read/write operations. For example, create file operation throughput of libfuse with local filesystem implementation peaks at 2 processes and get file status operation throughput peaks around 4 to 12 processes. Higher concurrency may lead to worse performance.
libfuse worker threads
The concurrency on Alluxio POSIX API is the joint effort of
The concurrency of application operations interacting with Fuse kernel code and libfuse
The concurrency of libfuse worker threads interacting with Alluxio POSIX API limited by MAX_IDLE_THREADS
libfuse configuration.
Enlarge the MAX_IDLE_THRAEDS
to make sure it's not the performance bottleneck. One can use jstack
or visualvm
to see how many libfuse threads exist and whether the libfuse threads keep being created/destroyed.
Alluxio general performance tuning provides more information about how to investigate and tune the performance of Alluxio Java client and servers.
Clock time tracing
Tracing is a good method to understand which operation consumes most of the clock time.
From the Fuse.<FUSE_OPERATION_NAME>
metrics documented in the Fuse metrics doc, we can know how long each operation consumes and which operation(s) dominate the time spent in Alluxio. For example, if the application is metadata heavy, Fuse.getattr
or Fuse.readdir
may have much longer total duration compared to other operations. If the application is data heavy, Fuse.read
or Fuse.write
may consume most of the clock time. Fuse metrics help us to narrow down the performance investigation target.
If Fuse.read
consumes most of the clock time, enables the Alluxio property alluxio.user.block.read.metrics.enabled=true
and Alluxio metric Client.BlockReadChunkRemote
will be recorded. This metric shows the duration statistics of reading data from remote workers via gRPC.
If the application spends relatively long time in RPC calls, try enlarging the client pool sizes Alluxio properties based on the workload.
If thread pool size is not the limitation, try enlarging the CPU/memory resources. GRPC threads consume CPU resources.
One can follow the Alluxio opentelemetry doc to trace the gRPC calls. If some gRPC calls take extremely long time and only a small amount of time is used to do actual work, there may be too many concurrent gRPC calls or high resource contention. If a long time is spent in fulfilling the gRPC requests, we can jump to the server side to see where the slowness come from.
CPU/memory/lock tracing
Async Profiler can trace the following kinds of events:
CPU cycles
Allocations in Java Heap
Contented lock attempts, including both Java object monitors and ReentrantLocks
Install async profiler and run the following commands to get the information of target Alluxio process
-d
define the duration. Try to cover the whole POSIX API testing duration
-e
define the profiling target
-f
define the file name to dump the profile information to