MLPerf Storage 基准测试概览
MLPerf Storage是专门针对机器学习任务的存储系统性能基准测试套件。
本文档介绍如何通过 MLPerf Storage 来对 Alluxio 进行端到端测试。
测试结果摘要
模型 | 加速器 (GPUs) | 数据集 | 加速器利用 | 吞吐量 (兆字节/秒) | 吞吐量 (样本数/秒) |
---|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
测试结果基于如下配置的 Alluxio 集群,所有服务器实例均在 AWS 上可用:
Alluxio 集群: 一个 Alluxio Fuse 节点和两个 Alluxio Worker 节点。
Alluxio Worker 实例: i3en.metal: 96内核 + 768GB 内存+ 100Gb网络 + 8 nvme固态硬盘
Alluxio Fuse 实例 c6in.metal: 128内核 + 256GB 内存 + 200Gb网络
准备测试环境
操作系统镜像:Ubuntu 22.02
准备 MLPerf Storage 测试工具
sudo apt-get install mpich
git clone -b v0.5 --recurse-submodules https://github.com/mlcommons/storage.git
cd storage
pip3 install -r dlio_benchmark/requirements.txt
生成数据集
我们建议在本地生成数据集,然后上传到远端存储。 确定要生成的数据大小:
./benchmark.sh datasize --workload unet3d --num-accelerators 4 --host-memory-in-gb 32
**工作负载:**选项为 unet3d 和 bert。
num-accelerators: 模拟的 GPU 数量。数量越多,单台机器上运行的进程就越多。对于相同大小的数据集,训练时间更短。不过,这会增加对存储 I/O 的需求。
host-memory-in-gb: 模拟的内存大小,可以自由指定,甚至可以超过机器的实际内存大小。内存越大,生成的数据集也就越大,需要的训练时间也就越长。
执行此命令后,您将得到如下结果:
./benchmark.sh datasize --workload unet3d --num-accelerators 4 --host-memory-in-gb 32
The benchmark will run for approx 11 minutes (best case)
Minimum 1600 files are required, which will consume 218 GB of storage
----------------------------------------------
Set --param dataset.num_files_train=1600 with ./benchmark.sh datagen/run commands
接下来,您可以使用以下命令生成相应的数据集:
./benchmark.sh datagen --workload unet3d --num-parallel ${num-parallel} --param dataset.num_files_train=1600 --param dataset.data_folder=${dataset.data_folder}
在本地生成数据集后,将其上传到 UFS。
配置 Alluxio
我们推荐使用 Alluxio 3.1 或更高版本进行 MLPerf 测试。 此外,建议在 alluxio-site.properties
中进行以下配置,以获得最佳读取性能:
alluxio.user.position.reader.streaming.async.prefetch.enable=true
alluxio.user.position.reader.streaming.async.prefetch.thread=256
alluxio.user.position.reader.streaming.async.prefetch.part.length=4MB
alluxio.user.position.reader.streaming.async.prefetch.max.part.number=4
有关其他 Alluxio 相关配置,请参阅 Fio Tests 部分。
可将一个或多个 Alluxio Worker 配置为缓存集群。
此外,在每个 MLPerf 测试节点上都需要启动 Alluxio Fuse 进程来读取数据。
确保数据集已从 UFS 完全加载到 Alluxio 缓存中。
运行测试
./benchmark.sh run --workload ${workload} --num-accelerators ${num-accelerators} --results-dir ${results-dir} --param dataset.data_folder=${dataset.data_folder} --param dataset.num_files_train=${dataset.num_files_train}
完成测试后,您可在 results-dir
中找到如下的summary.json
文件:
{
"model": "unet3d",
"start": "2024-05-27T14:46:24.458325",
"num_accelerators": 20,
"hostname": "ip-172-31-24-47",
"metric": {
"train_au_percentage": [
99.18125818824699,
99.01649117920554,
98.95473494676878,
98.31108303926722,
98.2658474647346
],
"train_au_mean_percentage": 98.74588296364462,
"train_au_meet_expectation": "success",
"train_au_stdev_percentage": 0.38102089124716115,
"train_throughput_samples_per_second": [
57.07382805038776,
57.1334916113455,
56.93601336110315,
56.72469392071424,
56.64526420320678
],
"train_throughput_mean_samples_per_second": 56.90265822935148,
"train_throughput_stdev_samples_per_second": 0.19058788132211907,
"train_io_mean_MB_per_second": 7955.518180172248,
"train_io_stdev_MB_per_second": 26.64594945050442
},
"num_files_train": 28125,
"num_files_eval": 0,
"num_samples_per_file": 1,
"epochs": 5,
"end": "2024-05-27T15:27:39.203932"
}
train_au_percentage
属性代表 GPU 利用率。
此外,您还可以多次运行测试,将运行结果按以下格式保存:
sample-results
|---run-1
|---host-1
|---summary.json
|---host-2
|---summary.json
....
|---host-n
|---summary.json
|---run-2
|---host-1
|---summary.json
|---host-2
|---summary.json
....
|---host-n
|---summary.json
.....
|---run-5
|---host-1
|---summary.json
|---host-2
|---summary.json
....
|---host-n
|---summary.json
然后,使用以下命令汇总多个测试结果:
./benchmark.sh reportgen --results-dir sample-results
最终的汇总结果如下所示:
{
"overall": {
"model": "unet3d",
"num_client_hosts": 1,
"num_benchmark_runs": 5,
"train_num_accelerators": "20",
"num_files_train": 28125,
"num_samples_per_file": 1,
"train_throughput_mean_samples_per_second": 56.587322998616344,
"train_throughput_stdev_samples_per_second": 0.3842685544298719,
"train_throughput_mean_MB_per_second": 7911.431396900177,
"train_throughput_stdev_MB_per_second": 53.72429981238494
},
"runs": {
"run-5": {
"train_throughput_samples_per_second": 57.06105089062497,
"train_throughput_MB_per_second": 7977.662939935283,
"train_num_accelerators": "20",
"model": "unet3d",
"num_files_train": 28125,
"num_samples_per_file": 1
},
"run-2": {
"train_throughput_samples_per_second": 56.18386238258097,
"train_throughput_MB_per_second": 7855.023869277903,
"train_num_accelerators": "20",
"model": "unet3d",
"num_files_train": 28125,
"num_samples_per_file": 1
},
"run-1": {
"train_throughput_samples_per_second": 56.90265822935148,
"train_throughput_MB_per_second": 7955.518180172248,
"train_num_accelerators": "20",
"model": "unet3d",
"num_files_train": 28125,
"num_samples_per_file": 1
},
"run-3": {
"train_throughput_samples_per_second": 56.69229017116294,
"train_throughput_MB_per_second": 7926.10677895614,
"train_num_accelerators": "20",
"model": "unet3d",
"num_files_train": 28125,
"num_samples_per_file": 1
},
"run-4": {
"train_throughput_samples_per_second": 56.09675331936137,
"train_throughput_MB_per_second": 7842.845216159307,
"train_num_accelerators": "20",
"model": "unet3d",
"num_files_train": 28125,
"num_samples_per_file": 1
}
}
}