缓存预加载

分布式加载允许用户高效地将数据从 UFS 加载到 Alluxio 集群。 这可用于初始化 Alluxio 集群,以便在 Alluxio 上运行工作负载时能够立即提供缓存数据。 例如,分布式加载可用于为机器学习作业预取数据,从而加快训练过程。

使用方法

有两种触发分布式加载的推荐方法:

任务加载 CLI

任务加载命令可用于将数据从 UFS(底层文件系统)加载到 Alluxio 集群。 CLI 会向 Alluxio coordinator 发送加载请求,coordinator 随后会将加载操作分发到所有 worker 节点。

bin/alluxio job load [flags] <path>

# 输出示例
Progress for loading path '/path':
        Settings:       bandwidth: unlimited    verify: false
        Job State: SUCCEEDED
        Files Processed: 1000
        Bytes Loaded: 125.00MB
        Throughput: 2509.80KB/s
        Block load failure rate: 0.00%
        Files Failed: 0

有关 CLI 的详细用法,请参阅 job load 文档。

REST API

与 CLI 类似,REST API 也可用于加载数据。 请求可以发送到任意 worker 节点,worker 节点会将请求转发给 Alluxio coordinator,由 coordinator 分发到所有其他 worker 节点。

通过发送 POST 请求提交作业,请求中应包含目录路径,并将 submit 作为 optType 查询参数。


curl -v -H "Content-Type: application/json"\
  http://coordinator_host:19999/api/v1/master/submit_job/load \
    -d '{ "path": "s3://bucket/ufs/",
          "options": {
          "replicas":"1",
          "batchSize": "200",
          "partialListing": "true",
          "loadMetadataOnly": "false",
          "skipIfExists": "false"
          }
       }'

请求和响应示例:

curl -v -H 'Content-Type: application/json' \ http://coordinator_host:19999/api/v1/master/submit_job/load \
  -d '{ "path": "s3://bucket/ufs/",
        "options": {
          "replicas":"1",
          "batchSize": "200",
          "partialListing": "true",
          "loadMetadataOnly": "false",
          "skipIfExists": "false"
       }
     }'
* Host coordinator_host:19999 was resolved.
* IPv6: (none)
* IPv4: 52.26.153.198, 35.162.171.204, 44.236.69.116, 35.80.181.135
*   Trying 52.26.153.198:19999...
* Connected to coordinator_host (52.26.153.198) port 19999
> POST /api/v1/master/submit_job/load HTTP/1.1
> Host: coordinator_host:19999
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 229
> 
* upload completely sent off: 229 bytes
< HTTP/1.1 200 OK
< Date: Mon, 27 Jan 2025 08:08:04 GMT
< Content-Type: application/json
< Content-Length: 4
< Server: Jetty(9.4.53.v20231009)
<
* Connection #0 to host coordinator_host left intact
true

可以通过发送路径相同的 GET 请求,并将 progress 作为 opType 查询参数来检查进度。

curl -v -H 'Content-Type: application/json' -X GET http://coordinator_host:19999/api/v1/master/progress_job/load -d '{ "path": "s3:/bucket/ufs/", "verbose": "true", "format": "TEXT" } }'

请求和响应示例:

curl -v -H 'Content-Type: application/json' -X GET http://coordinator_host:19999/api/v1/master/progress_job/load \
  -d '{ "path": "s3://bucket/ufs/", "verbose": "true", "format": "TEXT" } }'
 
* Host coordinator_hostm:19999 was resolved.
* IPv6: (none)
* IPv4: 52.26.153.198, 44.236.69.116, 35.162.171.204, 35.80.181.135
*   Trying 52.26.153.198:19999...
* Connected to coordinator_host (52.26.153.198) port 19999
> GET /api/v1/master/progress_job/load HTTP/1.1
> Host: coordinator_host:19999
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 73
> 
* upload completely sent off: 73 bytes
< HTTP/1.1 200 OK
< Date: Mon, 27 Jan 2025 18:20:38 GMT
< Content-Type: application/json
< Content-Length: 462
< Server: Jetty(9.4.53.v20231009)
<
* Connection #0 to host coordinator_host left intact
"\tSettings:\tbandwidth: unlimited\tverify: false\tmetadata-only: false \tquota-check:false\n\tTime Elapsed: 00:00:08\n\tJob State: SUCCEEDED\n\tInodes Scanned: 62\n\tNon Empty File Copies Loaded: 54\n\tBytes Loaded: 2168.14MB\n\tThroughput: 271.02MB/s\n\tFile Failure rate: 0.00%\n\tSubtask Failure rate: 0.00%\n\tFiles Failed: 0\n\tRecent failed subtasks: \n\tRecent retrying subtasks: \n\tSubtask Retry rate: 0.00%\n\tSubtasks on Retry Dead Letter Queue: 0\

可以通过相同路径发送 POST 请求来终止加载操作,并将 stop 作为 opType 查询参数。

curl -v -H 'Content-Type: application/json' http://coordinator_host:19999/api/v1/master/stop_job/load \
  -d '{ "path": "s3://bucket/ufs/" }'

请求和响应示例:

curl -v -H 'Content-Type: application/json' http://coordinator_host:19999/api/v1/master/stop_job/load \
  -d '{ "path": "s3://bucket/ufs/" }'
* Host coordinator_host:19999 was resolved.
* IPv6: (none)
* IPv4: 35.80.181.135, 44.236.69.116, 52.26.153.198, 35.162.171.204
*   Trying 35.80.181.135:19999...
* Connected to coordinator_host (35.80.181.135) port 19999
> POST /api/v1/master/stop_job/load HTTP/1.1
> Host: coordinator_host:19999
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 35
> 
* upload completely sent off: 35 bytes
< HTTP/1.1 200 OK
< Date: Mon, 27 Jan 2025 18:45:02 GMT
< Content-Type: application/json
< Content-Length: 5
< Server: Jetty(9.4.53.v20231009)
<
* Connection #0 to host coordinator_host left intact
true

Last updated