Benchmarking ML Training Performance with MLPerf
Benchmarking ML Training Performance with MLPerf
MLPerf™ Storage is a benchmark suite to characterize the performance of storage systems that support AI workloads. The suite consists of AI Training workload category:
It simulates the I/O patterns of models like '3D-Unet', 'ResNet-50', and 'Cosmoflow' to evaluate storage throughput and I/O efficiency.
This guide explains how to use the MLPerf v2.0 Storage benchmark to test the performance of an Alluxio cluster.
Benchmark Highlights
The following results were achieved using the MLPerf Storage v2.0 benchmark (https://mlcommons.org/benchmarks/storage/), with data fully cached in Alluxio and H100 GPUs as the training accelerators. The "Accelerator Utilization (AU)" metric indicates how effectively the storage system kept the GPUs busy.
Resnet50
16
500 GB
1
2
99.57%
3.03075
0.1894
Resnet50
32
1.0 TB
2
2
99.57%
6.05086
0.1891
Resnet50
128
4.0 TB
8
8
99.57%
24.1364
0.1886
U-Net3D
1
500 GB
1
2
99.02%
2.92255
2.9225
U-Net3D
2
1.0 TB
2
2
99.02%
5.80375
2.8946
U-Net3D
8
4.0 TB
8
8
99.02%
23.1569
2.8946
Cosmoflow
5
500 GB
1
2
74.97%
2.69356
0.5387
Cosmoflow
8
1.0 TB
8
8
74.97%
4.31414
0.5393
Test Environment
The benchmark results were generated using the following environment, with all instances deployed in the same AWS availability zone.
Alluxio Cluster:
Worker Nodes (
i3en.12xlarge: 48 cores, 384GiB RAM, 3000GB NVMe SSDs)FUSE Client Node (
c5n.9xlarge: 36 cores, 96GiB RAM, 50GB EBS)
Operating System: Ubuntu 24.04
Setup and Configuration
1. Install MLPerf Storage Tools
On the client node where you will run the benchmark:
The working directory structure is as follows
The benchmark simulation will be performed through the dlio_benchmark code, a benchmark suite for emulating I/O patterns for deep learning workloads. dlio_benchmark is listed as a prerequisite to a specific git branch. A future release will update the installer to pull DLIO from PyPi. The DLIO configuration of each workload is specified through a yaml file. You can see the configs of all MLPerf Storage workloads in the configs folder.
2. Configure Alluxio
For optimal read performance during ML training, we recommend setting the following properties in your conf/alluxio-site.properties file on your Alluxio cluster nodes.
Before running the benchmark, ensure that:
The Alluxio FUSE process is running on the client node.
The training dataset has been fully loaded into the Alluxio cache.
Running the Benchmark
The benchmark process involves generating a synthetic dataset and then running the training simulation against it.
Step 1: Generate the Dataset
Note: Steps described in this section must be run only in one client host(launcher client). The datasize command relies on the accelerator being emulated, the max number of accelerators to support, the system memory in the benchmark clients, and the number of benchmark clients. The two rules that generally dictate the datasize are: The disk capacity must be at least 5x the cumulative system memory of the benchmark clients. The benchmark must run for 500 iterations of the given batch size for all GPUs. If the list of clients is passed in for this command, the amount of memory is found programmatically. Otherwise, the user can provide the number of clients and the amount of memory per client for the calculation.
Example: To calculate minimum dataset size for a unet3d model running on 2 client machines with 128 GB each with overall 8 simulated a100 accelerators
Synthetic data is generated based on the workload requested by the user.
Example: For generating training data of 56,000 files for unet3d workload into unet3d_data directory using 8 parallel jobs distributed on 2 nodes.
After generating the dataset, upload it to your UFS and ensure it is loaded into Alluxio.
Step 2: Running a Training Benchmark
Example:
For running benchmark on unet3d workload with data located in unet3d_data directory using 2 h100 accelerators spread across 2 client hosts(with IPs 10.117.61.121,10.117.61.165) and results on unet3d_results directory
Step 3: To generate the benchmark report
Benchmark submission report is generated by aggregating the individual run results. The reporting command provides the associated functions to generate a report for a given results directory.
Note: The reportgen script must be run in the launcher client host.
Last updated