Cache Preloading
Distributed load allows users to load data from UFS to Alluxio cluster efficiently. This can be used to initialize the Alluxio cluster to be able to immediately serve cached data when running workloads on top of Alluxio. For example, distributed load can be used to prefetch data for machine learning jobs, speeding up the training process. Distributed load can utilize file segmentation and multi-replication to enhance file distribution in scenarios with highly concurrent data access.
Usage
There are two recommended ways to trigger distributed load:
job load CLI
The job load
command can be used to load data from UFS (Under File System) to the Alluxio cluster. The CLI sends a load request to the Alluxio coordinator, which subsequently distributes the load operation to all worker nodes.
bin/alluxio job load [flags] <path>
# Example output
Progress for loading path '/path':
Settings: bandwidth: unlimited verify: false
Job State: SUCCEEDED
Files Processed: 1000
Bytes Loaded: 125.00MB
Throughput: 2509.80KB/s
Block load failure rate: 0.00%
Files Failed: 0
For detailed usage of CLI, please refer to the job load documentation.
REST API
Similar to the CLI, the REST API can also be used to load data. Requests are sent directly to the coordinator.
curl -H "Content-Type: application/json" -v -X POST http://coordinator_host:19999/api/v1/master/submit_job/load -d '{
"path": "s3://alluxiow/testm/dir-1/",
"options": {
"replicas":"2",
"batchSize": "300",
"partialListing": "true",
"loadMetadataOnly": "true",
"skipIfExists": "true"
}
}'
Progress can be checked by sending a GET request with the same path.
curl -H "Content-Type: application/json" -v -X GET http://coordinator_host:19999/api/v1/master/progress_job/load -d '{
"path or indexFile": "s3://bucket/dir-1/",
"format": "TEXT[default] | JSON",
"verbose": "true"
}'
The load operation can be terminated by sending a POST request.
curl -H "Content-Type: application/json" -v -X POST http://coordinator_host:19999/api/v1/master/stop_job/load -d '{
"path || indexFile": "s3://alluxiow/testm/dir-1/"
}'
The load jobs can be list by sending a POST request.
curl http://ip:1999/api/v1/master/list_job?[job-type=LOAD[&job-state=[RUNNING|VERIFYING|STOPPED|SUCCEEDED|FAILED|ALL]]
Last updated