# MLPerf Storage 基准测试

## MLPerf Storage 基准测试概览

[MLPerf Storage](https://github.com/mlcommons/storage)是专门针对机器学习任务的存储系统性能基准测试套件。

本文档介绍如何通过 MLPerf Storage 来对 Alluxio 进行端到端测试。

## 测试结果摘要

以下是使用A100 显卡 (GPU) 进行MLPerf Storage v0.5 测试的结果。

| 模型     | 加速器 (GPUs) | 数据集    | 加速器利用   | 吞吐量 (兆字节/秒) | 吞吐量 (样本数/秒) |
| ------ | ---------- | ------ | ------- | ----------- | ----------- |
| bert   | 1          | 1.3TB  | 99%     | 0.1         | 49.3        |
| unet3d | 1          | 719 GB | 99%     | 409.5       | 2.9         |
| bert   | 128        | 2.4 TB | 98%     | 14.8        | 6217        |
| unet3d | 20         | 3.8 TB | 97%-99% | 7911.4      | 56.59       |

测试结果基于如下配置的 Alluxio 集群，所有服务器实例均在 AWS 上可用：

* **Alluxio 集群:** 一个 Alluxio Fuse 节点和两个 Alluxio Worker 节点。
* **Alluxio Worker 实例:** [i3en.metal](https://aws.amazon.com/cn/ec2/instance-types/i3en/): 96内核 + 768GB 内存+ 100Gb网络 + 8 nvme固态硬盘
* **Alluxio Fuse 实例** [c6in.metal](https://aws.amazon.com/ec2/instance-types/c6i/): 128内核 + 256GB 内存 + 200Gb网络

## 准备测试环境

操作系统镜像：Ubuntu 22.02

### 准备 MLPerf Storage 测试工具

```bash
sudo apt-get install mpich
git clone -b v0.5 --recurse-submodules https://github.com/mlcommons/storage.git
cd storage
pip3 install -r dlio_benchmark/requirements.txt
```

### 生成数据集

我们建议在本地生成数据集，然后上传到远端存储。 确定要生成的数据大小：

```bash
./benchmark.sh datasize --workload unet3d --num-accelerators 4 --host-memory-in-gb 32
```

* **workfload:** 选项为 `unet3d` 和 `bert`。
* **num-accelerators：** 模拟的 GPU 数量。数量越多，单台机器上运行的进程就越多。对于相同大小的数据集，训练时间更短。不过，这会增加对存储 I/O 的需求。
* **host-memory-in-gb:** 模拟的内存大小，可以自由指定，甚至可以超过机器的实际内存大小。内存越大，生成的数据集也就越大，需要的训练时间也就越长。

执行此命令后，您将得到如下结果：

```bash
./benchmark.sh datasize --workload unet3d --num-accelerators 4 --host-memory-in-gb 32
The benchmark will run for approx 11 minutes (best case)
Minimum 1600 files are required, which will consume 218 GB of storage
----------------------------------------------
Set --param dataset.num_files_train=1600 with ./benchmark.sh datagen/run commands
```

接下来，您可以使用以下命令生成相应的数据集：

```bash
./benchmark.sh datagen --workload unet3d --num-parallel 8 --param dataset.num_files_train=1600 --param dataset.data_folder=${dataset.data_folder}
```

其中 **num-parallel** 设置用于生成数据集的并行线程数。

在本地生成数据集后，将其上传到 UFS。

### 配置 Alluxio

我们推荐使用 Alluxio 3.1 或更高版本进行 MLPerf 测试。 此外，建议在 `alluxio-site.properties` 中进行以下配置，以获得最佳读取性能：

```properties
alluxio.user.position.reader.streaming.async.prefetch.enable=true
alluxio.user.position.reader.streaming.async.prefetch.thread=256
alluxio.user.position.reader.streaming.async.prefetch.part.length=4MB
alluxio.user.position.reader.streaming.async.prefetch.max.part.number=4
```

有关其他 Alluxio 相关配置，请参阅 [Fio Tests](https://documentation.alluxio.io/ee-ai-cn/ai-3.6/benchmark/fio) 部分。

* 可将一个或多个 Alluxio Worker 配置为缓存集群。
* 此外，在每个 MLPerf 测试节点上都需要启动 Alluxio Fuse 进程来读取数据。
* 确保数据集已从 UFS 完全加载到 Alluxio 缓存中。

### 运行测试

```bash
./benchmark.sh run --workload unet3d --num-accelerators 4 --results-dir ${results-dir} --param dataset.data_folder=${dataset.data_folder} --param dataset.num_files_train=${dataset.num_files_train}
```

完成测试后，您可在 `results-dir` 中找到如下的`summary.json` 文件：

```json
{
  "model": "unet3d",
  "start": "2024-05-27T14:46:24.458325",
  "num_accelerators": 20,
  "hostname": "ip-172-31-24-47",
  "metric": {
    "train_au_percentage": [
      99.18125818824699,
      99.01649117920554,
      98.95473494676878,
      98.31108303926722,
      98.2658474647346
    ],
    "train_au_mean_percentage": 98.74588296364462,
    "train_au_meet_expectation": "success",
    "train_au_stdev_percentage": 0.38102089124716115,
    "train_throughput_samples_per_second": [
      57.07382805038776,
      57.1334916113455,
      56.93601336110315,
      56.72469392071424,
      56.64526420320678
    ],
    "train_throughput_mean_samples_per_second": 56.90265822935148,
    "train_throughput_stdev_samples_per_second": 0.19058788132211907,
    "train_io_mean_MB_per_second": 7955.518180172248,
    "train_io_stdev_MB_per_second": 26.64594945050442
  },
  "num_files_train": 28125,
  "num_files_eval": 0,
  "num_samples_per_file": 1,
  "epochs": 5,
  "end": "2024-05-27T15:27:39.203932"
}
```

`train_au_percentage` 属性代表 GPU 利用率。

此外，您还可以多次运行测试，将运行结果按以下格式保存：

```
sample-results
	|---run-1
	       |---host-1
	                |---summary.json
	       |---host-2
	                |---summary.json
	          ....
	       |---host-n
	                |---summary.json
	|---run-2
	       |---host-1
 	               |---summary.json
	       |---host-2
	                |---summary.json
	          ....
 	       |---host-n
 	               |---summary.json
	    .....
	|---run-5
	       |---host-1
	                |---summary.json
	       |---host-2
 	               |---summary.json
 	          ....
 	       |---host-n
 	               |---summary.json
```

然后，使用以下命令汇总多个测试结果：

```bash
./benchmark.sh reportgen --results-dir sample-results
```

最终的汇总结果如下所示：

```json
{
    "overall": {
        "model": "unet3d",
        "num_client_hosts": 1,
        "num_benchmark_runs": 5,
        "train_num_accelerators": "20",
        "num_files_train": 28125,
        "num_samples_per_file": 1,
        "train_throughput_mean_samples_per_second": 56.587322998616344,
        "train_throughput_stdev_samples_per_second": 0.3842685544298719,
        "train_throughput_mean_MB_per_second": 7911.431396900177,
        "train_throughput_stdev_MB_per_second": 53.72429981238494
    },
    "runs": {
        "run-5": {
            "train_throughput_samples_per_second": 57.06105089062497,
            "train_throughput_MB_per_second": 7977.662939935283,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        },
        "run-2": {
            "train_throughput_samples_per_second": 56.18386238258097,
            "train_throughput_MB_per_second": 7855.023869277903,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        },
        "run-1": {
            "train_throughput_samples_per_second": 56.90265822935148,
            "train_throughput_MB_per_second": 7955.518180172248,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        },
        "run-3": {
            "train_throughput_samples_per_second": 56.69229017116294,
            "train_throughput_MB_per_second": 7926.10677895614,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        },
        "run-4": {
            "train_throughput_samples_per_second": 56.09675331936137,
            "train_throughput_MB_per_second": 7842.845216159307,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        }
    }
}
```