MLPerf Storage Benchmark
MLPerf Storage Benchmark Overview
MLPerf Storage is a benchmark suite designed to characterize the performance of storage systems supporting machine learning workloads. This document describes how to conduct end-to-end testing of Alluxio using MLPerf Storage.
Results Summary
bert
1
1.3TB
99%
0.1
49.3
unet3d
1
719 GB
99%
409.5
2.9
bert
128
2.4 TB
98%
14.8
6217
unet3d
20
3.8 TB
97%-99%
7911.4
56.59
The test results are based on an Alluxio cluster configured as follows, with all server instances available on AWS:
Alluxio Cluster: One Alluxio Fuse node and two Alluxio Worker nodes.
Alluxio Worker Instance: i3en.metal: 96c + 768GB memory + 100Gb network + 8 nvme
Alluxio Fuse Instance c6in.metal: 128c + 256GB memory + 200Gb network
Preparing the Test Environment
Operating System Image: Ubuntu 22.02
Preparing MLPerf Storage Test Tools
Generating the Dataset
We recommend generating the dataset locally and then uploading it to remote storage. Determine the data size to generate:
workload: Options are unet3d and bert.
num-accelerators: The simulated number of GPUs. The larger the number, the more processes can run on a single machine. For datasets of the same size, training time is shorter. However, this increases the demands on storage I/O.
host-memory-in-gb: The simulated memory size, which can be freely specified, even exceeding the actual memory of your machine. Larger memory sizes generate larger datasets and require longer training times.
After this command, you will get a result like:
Next, you can generate the corresponding dataset with the following command:
After generating the dataset locally, upload it to UFS.
Configuring Alluxio
We recommend using Alluxio version 3.1 or above for MLPerf testing. Additionally, we recommend setting the following configurations in alluxio-site.properties for optimal read performance:
For other Alluxio-related configurations, refer to the Fio Tests section.
You can configure one or more Alluxio Workers as a cache cluster.
Additionally, each MLPerf test node needs to start the Alluxio Fuse process to read data.
Ensure that the dataset has been completely loaded into the Alluxio cache from UFS.
Running the Test
After completing the test, you can find the summary.json file in the results-dir, similar to:
The train_au_percentage attribute represents GPU utilization.
Additionally, you can run the test multiple times and save the results in the following format:
Then, use the following command to aggregate the results of multiple tests:
The final aggregated result will look like this:
Last updated