Benchmarking ML Training Performance with MLPerf
The MLPerf™ Storage benchmark is a suite designed by MLCommons® to measure how well a storage system performs for realistic machine learning (ML) training workloads. It simulates the I/O patterns of models like BERT
and U-Net3D
to evaluate storage throughput and I/O efficiency.
This guide explains how to use the MLPerf Storage benchmark to test the performance of an Alluxio cluster.
Benchmark Highlights
The following results were achieved using the MLPerf Storage v0.5 benchmark, with data fully cached in Alluxio and A100 GPUs as the training accelerators. The "Accelerator Utilization (AU)" metric indicates how effectively the storage system kept the GPUs busy.
BERT
1
1.3 TB
99%
0.1
49.3
BERT
128
2.4 TB
98%
14.8
6,217
U-Net3D
1
719 GB
99%
409.5
2.9
U-Net3D
20
3.8 TB
97%-99%
7,911.4
56.59
Test Environment
The benchmark results were generated using the following environment, with all instances deployed in the same AWS availability zone.
Alluxio Cluster:
2 Worker Nodes (
i3en.metal
: 96 cores, 768GB RAM, 8 NVMe SSDs)1 FUSE Client Node (
c6in.metal
: 128 cores, 256GB RAM)
Operating System: Ubuntu 22.04
Setup and Configuration
1. Install MLPerf Storage Tools
On the client node where you will run the benchmark:
# Install MPI dependency
sudo apt-get install mpich
# Clone the benchmark repository and install Python dependencies
git clone -b v0.5 --recurse-submodules https://github.com/mlcommons/storage.git
cd storage
pip3 install -r dlio_benchmark/requirements.txt
2. Configure Alluxio
For optimal read performance during ML training, we recommend setting the following properties in your conf/alluxio-site.properties
file on your Alluxio cluster nodes.
alluxio.user.position.reader.streaming.async.prefetch.enable=true
alluxio.user.position.reader.streaming.async.prefetch.thread=256
alluxio.user.position.reader.streaming.async.prefetch.part.length=4MB
alluxio.user.position.reader.streaming.async.prefetch.max.part.number=4
Before running the benchmark, ensure that:
The Alluxio FUSE process is running on the client node.
The training dataset has been fully loaded into the Alluxio cache.
Running the Benchmark
The benchmark process involves generating a synthetic dataset and then running the training simulation against it.
Step 1: Generate the Dataset
First, determine the required size of the dataset based on your simulated hardware.
# Example for U-Net3D with 4 simulated accelerators
./benchmark.sh datasize --workload unet3d --num-accelerators 4 --host-memory-in-gb 32
This command will output the number of files needed. Use this value to generate the actual data files.
# Example data generation command
./benchmark.sh datagen --workload unet3d --num-parallel 8 --param dataset.num_files_train=1600 --param dataset.data_folder=${dataset.data_folder}
After generating the dataset, upload it to your UFS and ensure it is loaded into Alluxio.
Step 2: Run the Benchmark
Execute the benchmark using the run
command. The data_folder
parameter should point to the dataset within the Alluxio FUSE mount.
./benchmark.sh run --workload unet3d --num-accelerators 4 --results-dir ${results-dir} --param dataset.data_folder=${dataset.data_folder} --param dataset.num_files_train=${dataset.num_files_train}
Step 3: Review and Aggregate Results
After a run completes, a summary.json
file is created in your results directory. This file contains detailed metrics, including GPU utilization (train_au_percentage
) and throughput.
Example summary.json
summary.json
{
"model": "unet3d",
"start": "2024-05-27T14:46:24.458325",
"num_accelerators": 20,
"metric": {
"train_au_percentage": [
99.18125818824699,
99.01649117920554,
...
],
"train_au_mean_percentage": 98.74588296364462,
"train_throughput_mean_samples_per_second": 56.90265822935148,
"train_io_mean_MB_per_second": 7955.518180172248
},
...
}
To get a final result, the benchmark should be run multiple times (e.g., 5 times). Organize the output directories from each run and use the reportgen
command to produce an aggregated summary.
# Aggregate results from multiple runs
./benchmark.sh reportgen --results-dir sample-results
This will generate a final JSON output with the overall mean and standard deviation for throughput and other key metrics across all runs.
Last updated