Use Cases
Last updated
Last updated
Many leading companies around the world run Alluxio in production to extract value from their data. Some of them are listed in our Powered-By page. In this section, we will introduce some of the most common Alluxio use cases.
Many organizations are running analytics and machine learning workloads (Spark, Presto, Hive, Tensorflow, etc.) on object storage in the public cloud (AWS S3, Google Cloud, or Microsoft Azure).
Though cloud object stores are often more cost-effective, easier to use, and easier to scale, there are some challenges:
Performance is variable and consistent SLAs are hard
Metadata operations are expensive and slowdown workloads
Embedded caching is ineffective for ephemeral clusters
Alluxio addresses these challenges by providing intelligent multi-tiered caching and metadata management. Deploying Alluxio on the compute cluster helps:
Achieve consistent performance for analytics engines
Reduce AI training time and cost
Eliminate repeated storage access costs
Achieve off-cluster caching for ephemeral workloads
See this example use case from Electronic Arts.
Running data-driven applications on top of an object store deployed on-premise brings the following challenges:
Poor performance for analytics and AI workloads
Lack of enough native support for popular frameworks
Expensive and slow metadata operations
Alluxio solves these problems by providing caching and API translation. Deploying Alluxio on the application side brings:
Improved performance for analytics and AI workloads
Flexibility of segregated storage
Support for multiple APIs with no changes to the end-user experience
Reduce the overall storage cost
See this example use case from DBS.
As more organizations are migrating to the cloud, a common intermediate step is to utilize compute resources in the cloud while retrieving data from on-premise data sources. However, this hybrid architecture brings the following problems:
Data access across the network is slow and inconsistent
Copying data to cloud storage is time-consuming, error-prone, and complex
Compliance and data sovereignty requirements may prohibit copying data into the cloud
Alluxio provides "zero-copy" cloud bursting which enables compute engines in the cloud to access data on-premise without the need of a persistent copy of the data in the cloud that needs to be periodically synchronized to the original data on-premises. This brings the following benefits:
Performance as if data is on the cloud compute cluster
No changes to end-user experience and security model
Common data access layer with access-based or policy-based data movement
Utilization of elastic cloud compute resources and cost savings
See this example use case from Walmart.
Another hybrid cloud architecture is to access cloud storage from a private datacenter. Using this architecture usually causes the following problems:
No unified view for cloud and on-premise storage
Prohibitively high network egress costs
Inability to utilize compute on-premises for data in the cloud
Inadequate performance for analytics and AI
Alluxio solves these problems by acting as a hybrid cloud storage gateway that utilizes on-premise compute for data in the cloud. When deployed with the compute on-premise, Alluxio manages the compute cluster’s storage and provides data locality to applications, achieving:
High performance for reads and writes using intelligent distributed caching
Network cost savings by eliminating replication
No changes to the end-user experience with flexible APIs and security model on cloud storage
See this example use case from Comcast.
Many organizations maintain satellite compute clusters that are independent of their main data cluster for the purposes of performance, security, or resource isolation. These satellite clusters need to access data remotely from the main cluster, which is challenging because:
Cross-datacenter copies are manual and time-consuming
Unnecessary network traffic for replication is expensive
Replication jobs on an overloaded storage cluster dramatically impact the performance of existing workloads
Alluxio can be deployed on the compute nodes in the satellite cluster and configured to connect to the main data cluster, serving as one logical copy of data. Thus:
No redundant data copies across datacenters
Elimination of complex data synchronization
Improved performance compared to remote region data access
Self-service data infrastructure across business units