Data Lake Connectors

Data lake connectors enable compute engines such as Trino and Spark to query data as structured tables.

The supported connectors include:

The instructions to configure each of these connectors are described in their respective compute engine documentation.

Trino

Known limitations

Iceberg

Due to the nature of how Iceberg handles metadata through files, it is highly recommended to avoid caching the corresponding metadata files. If metadata files end up being persisted to cache, subsequent errors and/or warnings may occur when accessing the related files.

After determining the locations of the metadata files, set the paths as skipCache via the cache filter feature. You can configure a cache filter on Iceberg metadata paths like below:

./bin/alluxio cache-filter add --rule skipCache --type metadata --pattern ".*/metadata/.*"

Delta Lake

Delta Lake transaction logs (_delta_log) are frequently updated. If they are cached in Alluxio, stale metadata may cause queries to reference files that no longer exist. To avoid this, it is recommended to skip caching Delta Lake metadata files by applying a cache filter like below on the _delta_log path:

./bin/alluxio cache-filter add --rule skipCache --type metadata --pattern ".*/_delta_log/.*"

Caching data when writing to HDFS

When writing data with HDFS as the UFS, data is not cached upon writing, even in the case that the write type is configured to persist the data to cache. Only during a cold read of the newly written data will it be persisted in Alluxio cache. Note that this behavior was observed using Trino connecting to HDFS, but it was not observed when using Trino connecting to S3.

Last updated 17 days ago