MLPerf Storage is a benchmark suite designed to characterize the performance of storage systems supporting machine learning workloads. This document describes how to conduct end-to-end testing of Alluxio using MLPerf Storage.
Results Summary
The following results are based on MLPerf Storage v0.5, using A100 GPUs as the accelerator.
Model
Accelerators (GPUs)
Dataset
AU
Throughput (MB/sec)
Throughput (samples/sec)
bert
1
1.3TB
99%
0.1
49.3
unet3d
1
719 GB
99%
409.5
2.9
bert
128
2.4 TB
98%
14.8
6217
unet3d
20
3.8 TB
97%-99%
7911.4
56.59
The test results are based on an Alluxio cluster configured as follows, with all server instances available on AWS:
Alluxio Cluster: One Alluxio Fuse node and two Alluxio Worker nodes.
We recommend generating the dataset locally and then uploading it to remote storage. Determine the data size to generate:
# Don't forget to replace the parameters with your own../benchmark.shdatasize--workloadunet3d--num-accelerators4--host-memory-in-gb32
workload: Options are unet3d and bert.
num-accelerators: The simulated number of GPUs. The larger the number, the more processes can run on a single machine. For datasets of the same size, training time is shorter. However, this increases the demands on storage I/O.
host-memory-in-gb: The simulated memory size, which can be freely specified, even exceeding the actual memory of your machine. Larger memory sizes generate larger datasets and require longer training times.
After generating the dataset locally, upload it to UFS.
Configuring Alluxio
We recommend using Alluxio version 3.1 or above for MLPerf testing. Additionally, we recommend setting the following configurations in alluxio-site.properties for optimal read performance: