Alluxio
ProductsLanguageHome
DA-3.5 (stable)
DA-3.5 (stable)
  • Overview
  • Getting Started with K8s
    • Resource Prerequisites and Compatibility
    • Install on Kubernetes
    • Monitoring and Metrics
    • Cluster Administration
    • System Health Check & Quick Recovery
    • Collecting Cluster Information
  • Architecture
    • Alluxio Namespace and Under File System Namespaces
    • I/O Resiliency
    • Worker Management and Consistent Hashing
  • Storage Integrations
    • Amazon AWS S3
    • HDFS
    • Tencent COS
  • Compute Integrations
    • Trino on K8s
    • Spark on K8s
    • Data Lake Connectors
  • Client APIs
    • S3 API
    • Java HDFS-compatible API
  • Caching Operations
    • Cache Preloading
    • Cache Filter Policy
    • Cache Eviction
      • TTL Policy
      • Priority Policy
      • Free CLI Command
  • Resource Management
    • Directory-Based Cluster Quota
    • UFS Bandwidth Limiting
  • Performance Optimizations
    • Read Throughput Via Replicas
    • Reading Large Files
    • Metadata Listing
    • Data Prefetch
  • Security
    • TLS Support
    • Apache Ranger Integration
  • Reference
    • User CLI
    • Metrics
    • S3 API Usage
    • Third Party Licenses
  • Release Notes
Powered by GitBook
On this page
  • Usage
  • job load CLI
  • REST API
  1. Caching Operations

Cache Preloading

Last updated 1 month ago

Distributed load allows users to load data from UFS to Alluxio cluster efficiently. This can be used to initialize the Alluxio cluster to be able to immediately serve cached data when running workloads on top of Alluxio. Distributed load can utilize and to enhance file distribution in scenarios with highly concurrent data access.

Usage

There are two recommended ways to trigger distributed load:

job load CLI

The job load command can be used to load data from UFS (Under File System) to the Alluxio cluster. The CLI sends a load request to the Alluxio coordinator, which subsequently distributes the load operation to all worker nodes.

bin/alluxio job load [flags] <path>

# Example output
Progress for loading path '/path':
        Settings:       bandwidth: unlimited    verify: false
        Job State: SUCCEEDED
        Files Processed: 1000
        Bytes Loaded: 125.00MB
        Throughput: 2509.80KB/s
        Block load failure rate: 0.00%
        Files Failed: 0

For detailed usage of CLI, please refer to the documentation.

REST API

Similar to the CLI, the REST API can also be used to load data. Requests are sent directly to the coordinator.

curl -H "Content-Type: application/json"  -v -X POST http://coordinator_host:19999/api/v1/master/submit_job/load -d '{
    "path": "s3://alluxiow/testm/dir-1/",
    "options": {
         "replicas":"2",
         "batchSize": "300",
         "partialListing": "true",
         "loadMetadataOnly": "true",
         "skipIfExists": "true"
    }
}'

Progress can be checked by sending a GET request with the same path.

curl -H "Content-Type: application/json"  -v -X GET http://coordinator_host:19999/api/v1/master/progress_job/load -d '{
  "path or indexFile": "s3://bucket/dir-1/",
  "format": "TEXT[default] | JSON",
  "verbose": "true"
}'

The load operation can be terminated by sending a POST request.

curl -H "Content-Type: application/json"  -v -X POST http://coordinator_host:19999/api/v1/master/stop_job/load -d '{
  "path || indexFile": "s3://alluxiow/testm/dir-1/"
}'

The load jobs can be list by sending a POST request. The results only include load tasks within seven days. The residence time of historical tasks can be configured through alluxio.job.retention.time.

curl http://ip:19999/api/v1/master/list_job?[job-type=LOAD[&job-state=[RUNNING|VERIFYING|STOPPED|SUCCEEDED|FAILED|ALL]]
file segmentation
multi-replication
job load