Benchmarking ML Training Performance with MLPerf

The MLPerf™ Storage benchmark is a suite designed by MLCommons® to measure how well a storage system performs for realistic machine learning (ML) training workloads. It simulates the I/O patterns of models like BERT and U-Net3D to evaluate storage throughput and I/O efficiency.

This guide explains how to use the MLPerf Storage benchmark to test the performance of an Alluxio cluster.

Benchmark Highlights

The following results were achieved using the MLPerf Storage v0.5 benchmark, with data fully cached in Alluxio and A100 GPUs as the training accelerators. The "Accelerator Utilization (AU)" metric indicates how effectively the storage system kept the GPUs busy.

Model

# of Accelerators (GPUs)

Dataset Size

Accelerator Utilization (AU)

Throughput (MB/sec)

Throughput (samples/sec)

BERT

1.3 TB

99%

0.1

49.3

BERT

128

2.4 TB

98%

14.8

6,217

U-Net3D

719 GB

99%

409.5

2.9

U-Net3D

3.8 TB

97%-99%

7,911.4

56.59

Test Environment

The benchmark results were generated using the following environment, with all instances deployed in the same AWS availability zone.

Alluxio Cluster:
- 2 Worker Nodes (i3en.metal: 96 cores, 768GB RAM, 8 NVMe SSDs)
- 1 FUSE Client Node (c6in.metal: 128 cores, 256GB RAM)
Operating System: Ubuntu 22.04

Setup and Configuration

1. Install MLPerf Storage Tools

On the client node where you will run the benchmark:

# Install MPI dependency
sudo apt-get install mpich

# Clone the benchmark repository and install Python dependencies
git clone -b v0.5 --recurse-submodules https://github.com/mlcommons/storage.git
cd storage
pip3 install -r dlio_benchmark/requirements.txt

2. Configure Alluxio

For optimal read performance during ML training, we recommend setting the following properties in your conf/alluxio-site.properties file on your Alluxio cluster nodes.

alluxio.user.position.reader.streaming.async.prefetch.enable=true
alluxio.user.position.reader.streaming.async.prefetch.thread=256
alluxio.user.position.reader.streaming.async.prefetch.part.length=4MB
alluxio.user.position.reader.streaming.async.prefetch.max.part.number=4

Before running the benchmark, ensure that:

The Alluxio FUSE process is running on the client node.
The training dataset has been fully loaded into the Alluxio cache.

Running the Benchmark

The benchmark process involves generating a synthetic dataset and then running the training simulation against it.

Step 1: Generate the Dataset

First, determine the required size of the dataset based on your simulated hardware.

# Example for U-Net3D with 4 simulated accelerators
./benchmark.sh datasize --workload unet3d --num-accelerators 4 --host-memory-in-gb 32

This command will output the number of files needed. Use this value to generate the actual data files.

# Example data generation command
./benchmark.sh datagen --workload unet3d --num-parallel 8 --param dataset.num_files_train=1600 --param dataset.data_folder=${dataset.data_folder}

After generating the dataset, upload it to your UFS and ensure it is loaded into Alluxio.

Step 2: Run the Benchmark

Execute the benchmark using the run command. The data_folder parameter should point to the dataset within the Alluxio FUSE mount.

./benchmark.sh run --workload unet3d --num-accelerators 4 --results-dir ${results-dir} --param dataset.data_folder=${dataset.data_folder} --param dataset.num_files_train=${dataset.num_files_train}

Step 3: Review and Aggregate Results

After a run completes, a summary.json file is created in your results directory. This file contains detailed metrics, including GPU utilization (train_au_percentage) and throughput.

Example `summary.json`

{
  "model": "unet3d",
  "start": "2024-05-27T14:46:24.458325",
  "num_accelerators": 20,
  "metric": {
    "train_au_percentage": [
      99.18125818824699,
      99.01649117920554,
      ...
    ],
    "train_au_mean_percentage": 98.74588296364462,
    "train_throughput_mean_samples_per_second": 56.90265822935148,
    "train_io_mean_MB_per_second": 7955.518180172248
  },
  ...
}

To get a final result, the benchmark should be run multiple times (e.g., 5 times). Organize the output directories from each run and use the reportgen command to produce an aggregated summary.

# Aggregate results from multiple runs
./benchmark.sh reportgen --results-dir sample-results

This will generate a final JSON output with the overall mean and standard deviation for throughput and other key metrics across all runs.

Last updated 2 months ago