Trino on K8s
Trino is an open source distributed SQL query engine for running interactive analytic queries on data at a large scale. This guide describes how to run queries against Trino with Alluxio as a distributed caching layer, for any data storage systems that Alluxio supports, such as AWS S3 and HDFS. Alluxio allows Trino to access data regardless of the data source and transparently cache frequently accessed data (e.g., tables commonly used) into Alluxio distributed storage.
Prerequisites
This guide assumes that the Alluxio cluster is deployed on Kubernetes.
docker
is also required to build the custom Trino image.
Prepare image
To integrate with Trino, Alluxio jars and configuration files must be added within the Trino image. Trino containers need to be launched with this modified image in order to connect to the Alluxio cluster.
Among the files listed in the Alluxio installation instructions to download, locate the tarball named alluxio-enterprise-DA-3.2-8.0.0-release.tar.gz
. Extract the following Alluxio jars from the tarball:
client/alluxio-DA-3.2-8.0.0-client.jar
client/ufs/alluxio-underfs-s3a-shaded-DA-3.2-8.0.0.jar
if using a S3 bucket as an UFSclient/ufs/alluxio-underfs-hadoop-3.3-shaded-DA-3.2-8.0.0.jar
if using HDFS as an UFS
Prepare an empty directory as the working directory to build an image from. Within this directory, create the directory files/alluxio/
and copy the aforementioned jar files into it.
Download the commons-lang3 jar into files/
so that it can also be included in the image.
Create a Dockerfile
with the operations to modify the base Trino image. The following example defines arguments for:
TRINO_VERSION=449
as the Trino versionUFS_JAR=files/alluxio/alluxio-underfs-s3a-shaded-DA-3.2-8.0.0.jar
as the path to the UFS jar copied intofiles/alluxio/
CLIENT_JAR=files/alluxio/alluxio-DA-3.2-8.0.0-client.jar
as the path to the Alluxio client jar copied intofiles/alluxio/
COMMONS_LANG_JAR=files/commons-lang3-3.14.0.jar
as the path to the commons-lang3 jar downloaded tofiles/
This will copy the necessary jars into the image. For the UFS jar, a symlink is created in the Trino lib/
directory pointing to it. For the Alluxio client and commons-lang3 jars, iterate through each of the possible plugin directories and create a symlink pointing to each of them.
In the above example, the jars are copied into the plugin directories for Hive, Delta Lake, and Iceberg, but only a subset may be needed depending on which connector(s) are being utilized.
Note that for Trino versions earlier than 434, the destination to copy the
CLIENT_JAR
andCOMMONS_LANG_JAR
should be/usr/lib/trino/plugin/<PLUGIN_NAME>/
instead of/usr/lib/trino/plugin/<PLUGIN_NAME>/hdfs/
Build the image by running, replacing <PRIVATE_REGISTRY>
with the URL of your private container registry and <TRINO_VERSION>
with the corresponding Trino version.
Push the image by running:
Deploy Hive Metastore
To complete the Trino end-to-end example, a Hive Metastore will be launched. If there is already one available, this step can be skipped. The Hive Metastore URI will be needed to configure the Trino catalogs.
Use helm
to create a stand-alone Hive metastore.
Add the helm repository by running:
Install the Hive cluster by running:
Check the status of the Hive pods:
It is known that the hive-metastore-db-init-schema-*
pods result in an Error
status, but this does not impact the rest of the workflow.
When completed, delete the Hive cluster by running:
Configure AWS credentials for S3
If using S3, provide AWS credentials by editing the configmap:
In the editor, search for the hive-site.xml
properties fs.s3a.access.key
and fs.s3a.secret.key
and populate them with AWS credentials to access to your S3 bucket.
Restart the Hive Metastore pod by deleting it:
And it will restart automatically.
Configure Trino
Create a trino-alluxio.yaml
configuration file in the installation directory to define the Alluxio specific configurations. We'll break down this configuration file into several parts that can be joined into a single yaml file.
image and server
Specify the location of the custom Trino image that was previously built and pushed. Customize the Trino server specifications as needed.
additionalCatalogs
In addition to memory related configurations, the catalog configurations are added under this section. Depending on which connector will be used, add the corresponding catalog configuration under additionalCatalogs
. If following the previous section to launch the Hive Metastore, the value of hive.metastore.uri
will be thrift://hive-metastore.default:9083
.
coordinator and worker
The coordinator
and worker
configurations both require similar modifications to their additionalJVMConfig
and additionalConfigFiles
sections.
For the coordinator:
Note the 2 additional JVM configurations to define the alluxio.home
and alluxio.conf.dir
properties.
Under additionalConfigFiles
, 3 Alluxio configuration files are defined.
alluxio-core-site.xml
: The xml file defines which underlying Filesystem class should be used for specific schemes of file URIs. By setting the valuealluxio.hadoop.FileSystem
, this ensures the file is executed through the Alluxio filesystem. Depending on the UFS, set the corresponding configurations:For S3, set
fs.s3a.impl
toalluxio.hadoop.FileSystem
For HDFS, set
fs.hdfs.impl
toalluxio.hadoop.FileSystem
alluxio-site.properties
: The properties file defines the Alluxio specific configuration properties for the Alluxio client. The next section will describe how to populate this file.metrics.properties
: Adding this one line enables Alluxio specific metrics to be scraped from Trino.
For the worker, include the same 2 additional JVM configurations and copy the same 3 Alluxio configuration files under additionalConfigFiles
.
alluxio-site.properties
For the Trino cluster to properly communicate with the Alluxio cluster, certain properties must be aligned between the Alluxio client and Alluxio server.
To show alluxio-site.properties
from the Alluxio cluster config map, run
where alluxio-alluxio-conf
is the name of the configmap for the Alluxio cluster.
In this output, search for the following properties under alluxio-site.properties
and set them for the coordinator's and worker's alluxio-site.properties
:
alluxio.etcd.endpoints
: The ETCD hosts of the Alluxio cluster. Assuming the hostnames are accessible from the Trino cluster, the same value can be used for the Trino cluster'salluxio-site.properties
.alluxio.cluster.name
: This must be set with the exact same value for Trino cluster.alluxio.k8s.env.deployment=true
alluxio.mount.table.source=ETCD
alluxio.worker.membership.manager.type=ETCD
If following the Install Alluxio on Kubernetes instructions, the values should be:
Launch Trino
Add the helm chart to the local repository by running:
To deploy Trino, run:
Check the status of the Trino pods and identify the pod name of the Trino coordinator:
When completed, delete the Trino cluster by running:
Executing Queries
To enter the Trino coordinator pod, run:
where $POD_NAME
is the full pod name of the Trino coordinator, starting with trino-cluster-trino-coordinator-
Inside the Trino pod, run:
To create a simple schema in your UFS mount, run
where UFS_URI
is a path within one of the defined Alluxio mounts. Using a S3 bucket as an example, this could be s3a://myBucket/myPrefix/
.
Last updated