Distributed load allows users to load data from UFS to Alluxio cluster efficiently. This can be used to initialize the Alluxio cluster to be able to immediately serve cached data when running workloads on top of Alluxio. For example, distributed load can be used to prefetch data for machine learning jobs, speeding up the training process. Distributed load can utilize file segmentation and multi-replication to enhance file distribution in scenarios with highly concurrent data access.
Usage
There are two recommended ways to trigger distributed load:
job load CLI
The job load command can be used to load data from UFS (Under File System) to the Alluxio cluster. The CLI sends a load request to the Alluxio master, which subsequently distributes the load operation to all worker nodes.
bin/alluxiojobload [flags] <path># Example outputProgressforloadingpath'/path':Settings:bandwidth:unlimitedverify:falseJobState:SUCCEEDEDFilesProcessed:1000BytesLoaded:125.00MBThroughput:2509.80KB/sBlockloadfailurerate:0.00%FilesFailed:0
For detailed usage of CLI, please refer to the job load documentation.
REST API
Similar to the CLI, the REST API can also be used to load data. Requests can be sent to any worker node, which would forward it to the Alluxio master to distribute to all other worker nodes.
Submit the job by sending a POST request with the path to the directory and submit as the optType query parameter.
curl-v-XPOSThttp://172.30.16.110:19999/v1/load?path=s3://test&opType=submit# Example output* About to connect() to 172.30.16.110 port 19999 (#0)*Trying172.30.16.110...* Connected to 172.30.16.110 (172.30.16.110) port 19999 (#0)>POST/v1/load?path=s3://test&opType=submitHTTP/1.1> User-Agent: curl/7.29.0> Host: 172.30.16.110:19999> Accept: */*> Content-Length: 183>* upload completely sent off: 183 out of 183 bytes< HTTP/1.1 200 OK< Date: Wed, 10 Apr 2024 05:47:17 GMT< Content-Length: 4< Server: Jetty(9.4.53.v20231009)<* Connection #0 to host 172.30.16.110 left intacttrue
Progress can be checked by sending a GET request with the same path and progress as the opType query parameter.
curl-v-XGEThttp://host:19999/v1/load?path=s3://test&opType=progressresult:* About to connect() to 172.30.16.110 port 19999 (#0)*Trying172.30.16.110...* Connected to 172.30.16.110 (172.30.16.110) port 19999 (#0)>GET/v1/load?path=s3://test&opType=progress> User-Agent: curl/7.29.0> Host: 172.30.16.110:19999> Accept: */*> Content-Length: 81>* upload completely sent off: 81 out of 81 bytes< HTTP/1.1 200 OK< Date: Wed, 10 Apr 2024 05:48:49 GMT< Content-Length: 572< Server: Jetty(9.4.53.v20231009)<* Connection #0 to host 172.30.16.110 left intact"{\"mVerbose\":true,\"mJobState\":\"RUNNING\",\"mVerificationEnabled\":false,\"mSkippedByteCount\":0,\"mLoadedByteCount\":0,\"mScannedInodesCount\":18450,\"mLoadedNonEmptyFileCopiesCount\":0,\"mThroughput\":0,\"mFailureFilesPercentage\":0.0,\"mFailureSubTasksPercentage\":0.0,\"mRetrySubTasksPercentage\":0.0,\"mFailedFileCount\":0,\"mRecentFailedSubtasksWithReasons\":[],\"mRecentRetryingSubtasksWithReasons\":[],\"mSkipIfExists\":true,\"mMetadataOnly\":true,\"mRunningStage\":\"LOADING\",\"mRetryDeadLetterQueueSize\":0,\"mTimeElapsed\":87237,\"mSegmentEnabled\":false}"
The load operation can be terminated by sending a POST request with the same path and stop as the opType query parameter.
curl-v-XPOSThttp://host:19999/v1/load?path=s3://test&opType=stop* Example result:* About to connect() to 172.30.16.110 port 19999 (#0)*Trying172.30.16.110...* Connected to 172.30.16.110 (172.30.16.110) port 19999 (#0)>POST/v1/load?path=s3://test&opType=stop> User-Agent: curl/7.29.0> Host: 172.30.16.110:19999> Accept: */*> Content-Length: 42>* upload completely sent off: 42 out of 42 bytes< HTTP/1.1 200 OK< Date: Wed, 10 Apr 2024 05:51:56 GMT< Content-Length: 5< Server: Jetty(9.4.53.v20231009)<* Connection #0 to host 172.30.16.110 left intacttrue