Alluxio
ProductsLanguageHome
DA-3.2
DA-3.2
  • Overview
  • Getting Started with K8s
    • Resource Prerequisites and Compatibility
    • Install on Kubernetes
    • Monitoring and Metrics
    • Cluster Administration
    • System Health Check & Quick Recovery
    • Collecting Cluster Information
  • Storage Integrations
    • Storage Integrations Overview
    • Amazon AWS S3
    • HDFS
  • Compute Integrations
    • Trino on K8s
    • Spark on K8s
    • Data Lake Connectors
  • Client APIs
    • S3 API
    • Java HDFS-compatible API
  • Features
    • Alluxio Namespace and Under File System Namespaces
    • Cache Preloading
    • Cache Evicting
    • Cache Filtering
    • Cache Free
    • I/O Resiliency
  • Security
    • TLS Support
    • Apache Ranger Integration
  • Reference
    • User CLI
    • S3 API Usage
    • Third Party Licenses
  • Release Notes
Powered by GitBook
On this page
  • Usage
  • job load CLI
  • REST API
  1. Features

Cache Preloading

Last updated 1 month ago

Distributed load allows users to load data from UFS to Alluxio cluster efficiently. This can be used to initialize the Alluxio cluster to be able to immediately serve cached data when running workloads on top of Alluxio.

Usage

There are two recommended ways to trigger distributed load:

job load CLI

The job load command can be used to load data from UFS (Under File System) to the Alluxio cluster. The CLI sends a load request to the Alluxio coordinator, which subsequently distributes the load operation to all worker nodes.

bin/alluxio job load [flags] <path>

# Example output
Progress for loading path '/path':
        Settings:       bandwidth: unlimited    verify: false
        Job State: SUCCEEDED
        Files Processed: 1000
        Bytes Loaded: 125.00MB
        Throughput: 2509.80KB/s
        Block load failure rate: 0.00%
        Files Failed: 0

For detailed usage of CLI, please refer to the documentation.

REST API

Similar to the CLI, the REST API can also be used to load data. Requests can be sent to any worker node, which would forward it to the Alluxio coordinator to distribute to all other worker nodes.

Submit the job by sending a POST request with the path to the directory and submit as the optType query parameter.


curl -v -H "Content-Type: application/json"\
  http://coordinator_host:19999/api/v1/master/submit_job/load \
  -d '{ "path": "s3://bucket/ufs/",
        "options": {
          "replicas":"1",
          "batchSize": "200",
          "partialListing": "true",
          "loadMetadataOnly": "false",
          "skipIfExists": "false"
      }
    }'

Example request and response:


curl -v -H 'Content-Type: application/json' \ http://coordinator_host:19999/api/v1/master/submit_job/load \
  -d '{ "path": "s3://bucket/ufs/",
        "options": {
          "replicas":"1",
          "batchSize": "200",
          "partialListing": "true",
          "loadMetadataOnly": "false",
          "skipIfExists": "false"
       }
     }'
* Host coordinator_host:19999 was resolved.
* IPv6: (none)
* IPv4: 52.26.153.198, 35.162.171.204, 44.236.69.116, 35.80.181.135
*   Trying 52.26.153.198:19999...
* Connected to coordinator_host (52.26.153.198) port 19999
> POST /api/v1/master/submit_job/load HTTP/1.1
> Host: coordinator_host:19999
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 229
> 
* upload completely sent off: 229 bytes
< HTTP/1.1 200 OK
< Date: Mon, 27 Jan 2025 08:08:04 GMT
< Content-Type: application/json
< Content-Length: 4
< Server: Jetty(9.4.53.v20231009)
< 
* Connection #0 to host coordinator_host left intact
true

Progress can be checked by sending a GET request with the same path and progress as the opType query parameter.

curl -v -H 'Content-Type: application/json' -X GET http://coordinator_host:19999/api/v1/master/progress_job/load -d '{ "path": "s3:/bucket/ufs/", "verbose": "true", "format": "TEXT" } }'

Example request and response:


curl -v -H 'Content-Type: application/json' -X GET http://coordinator_host:19999/api/v1/master/progress_job/load \
  -d '{ "path": "s3://bucket/ufs/", "verbose": "true", "format": "TEXT" } }'
 
* Host coordinator_hostm:19999 was resolved.
* IPv6: (none)
* IPv4: 52.26.153.198, 44.236.69.116, 35.162.171.204, 35.80.181.135
*   Trying 52.26.153.198:19999...
* Connected to coordinator_host (52.26.153.198) port 19999
> GET /api/v1/master/progress_job/load HTTP/1.1
> Host: coordinator_host:19999
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 73
> 
* upload completely sent off: 73 bytes
< HTTP/1.1 200 OK
< Date: Mon, 27 Jan 2025 18:20:38 GMT
< Content-Type: application/json
< Content-Length: 462
< Server: Jetty(9.4.53.v20231009)
< 
* Connection #0 to host coordinator_host left intact
"\tSettings:\tbandwidth: unlimited\tverify: false\tmetadata-only: false \tquota-check:false\n\tTime Elapsed: 00:00:08\n\tJob State: SUCCEEDED\n\tInodes Scanned: 62\n\tNon Empty File Copies Loaded: 54\n\tBytes Loaded: 2168.14MB\n\tThroughput: 271.02MB/s\n\tFile Failure rate: 0.00%\n\tSubtask Failure rate: 0.00%\n\tFiles Failed: 0\n\tRecent failed subtasks: \n\tRecent retrying subtasks: \n\tSubtask Retry rate: 0.00%\n\tSubtasks on Retry Dead Letter Queue: 0\

The load operation can be terminated by sending a POST request with the same path and stop as the opType query parameter.


curl -v -H 'Content-Type: application/json' http://coordinator_host:19999/api/v1/master/stop_job/load \
  -d '{ "path": "s3://bucket/ufs/" }'

Example request and response:


curl -v -H 'Content-Type: application/json' http://coordinator_host:19999/api/v1/master/stop_job/load \
  -d '{ "path": "s3://bucket/ufs/" }'
* Host coordinator_host:19999 was resolved.
* IPv6: (none)
* IPv4: 35.80.181.135, 44.236.69.116, 52.26.153.198, 35.162.171.204
*   Trying 35.80.181.135:19999...
* Connected to coordinator_host (35.80.181.135) port 19999
> POST /api/v1/master/stop_job/load HTTP/1.1
> Host: coordinator_host:19999
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 35
> 
* upload completely sent off: 35 bytes
< HTTP/1.1 200 OK
< Date: Mon, 27 Jan 2025 18:45:02 GMT
< Content-Type: application/json
< Content-Length: 5
< Server: Jetty(9.4.53.v20231009)
< 
* Connection #0 to host coordinator_host left intact
true
job load