Trino on K8s
Trino is an open source distributed SQL query engine for running interactive analytic queries on data at a large scale. This guide describes how to run queries against Trino with Alluxio as a distributed caching layer, for any data storage systems that Alluxio supports, such as AWS S3 and HDFS. Alluxio allows Trino to access data regardless of the data source and transparently cache frequently accessed data (e.g., tables commonly used) into Alluxio distributed storage.
Trino can connect to Alluxio through Alluxio's S3 API interface or HDFS interface. S3 API interface is recommended due to its compatibility for newer Trino versions and is simpler to configure and deploy.
This guide assumes that the Alluxio cluster is deployed on Kubernetes.
S3 API Interface
The S3 API interface requires enabling the S3 API in Alluxio's configuration.
Set the following configurations to enable S3 API and avoid redirect responses from workers.
spec:
properties:
alluxio.worker.s3.api.enabled: "true"
alluxio.worker.s3.redirect.enabled: "false"If a metastore is not deployed, follow the steps to deploy Hive Metastore.
Deploy Trino
Create a trino-alluxio.yaml configuration file in the installation directory. We'll break down this configuration file into several parts that can be joined into a single yaml file.
Setting image, server, coordinator, and worker
image:
repository: trinodb/trino
tag: "476"
server:
config:
query.maxMemory: 8GB
coordinator:
config:
query.maxMemory: 2GB
jvm:
maxHeapSize: 8G
gcMethod:
type: UseG1GC
g1:
heapRegionSize: 32M
worker:
replicas: 2
config:
query.maxMemory: 10GB
jvm:
maxHeapSize: 10G
gcMethod:
type: UseG1GC
g1:
heapRegionSize: 32MSetting additionalCatalogs
additionalCatalogs:
memory: |
connector.name=memory
memory.max-data-per-node=128MBIn addition to memory related configurations, the catalog configurations are added under this section. Depending on which connector will be used, add the corresponding catalog configuration under additionalCatalogs.
If following the provided instructions to launch the Alluxio cluster and Hive Metastore:
The value of
hive.metastore.uriisthrift://hive-metastore.default:9083The value of
s3.endpointishttp://alluxio-worker.alx-ns.svc.cluster.local:29998
A load balancer could also be set up to serve as a single endpoint for all workers, in which case the load balancer's address is the value for s3.endpoint.
hive: |-
connector.name=hive
fs.native-s3.enabled=true
s3.endpoint=<ALLUXIO_WORKER_SERVICE_ENDPOINT>
s3.path-style-access=true
hive.metastore.uri=thrift://<HIVE_METASTORE_URI>Launch Trino
Add the helm chart to the local repository by running:
$ helm repo add trino https://trinodb.github.io/chartsTo deploy Trino, run:
$ helm install -f trino-alluxio.yaml trino-cluster trino/trinoCheck the status of the Trino pods and identify the pod name of the Trino coordinator:
$ kubectl get podsTest the functionality by executing a query.
When completed, delete the Trino cluster by running:
$ helm uninstall trino-clusterHDFS Interface
Please note that integrating with the HDFS interface is not compatible with version 476 or newer.
If a metastore is not deployed, follow the steps to deploy Hive Metastore
Prepare Image
To integrate with Trino, Alluxio jars and configuration files must be added within the Trino image. Trino containers need to be launched with this modified image in order to connect to the Alluxio cluster.
dockeris required to build the custom Trino image.
Among the files listed in the Alluxio installation instructions to download, locate the tarball named alluxio-enterprise-DA-3.7-13.0.1-release.tar.gz. Extract the following Alluxio jars from the tarball:
client/alluxio-DA-3.7-13.0.1-client.jarclient/ufs/alluxio-underfs-s3a-shaded-DA-3.7-13.0.1.jarif using a S3 bucket as an UFSclient/ufs/alluxio-underfs-hadoop-3.3-shaded-DA-3.7-13.0.1.jarif using HDFS as an UFS
Prepare an empty directory as the working directory to build an image from. Within this directory, create the directory files/alluxio/ and copy the aforementioned jar files into it.
Download the commons-lang3 jar into files/ so that it can also be included in the image.
wget https://repo1.maven.org/maven2/org/apache/commons/commons-lang3/3.14.0/commons-lang3-3.14.0.jar -O files/commons-lang3-3.14.0.jarCreate a Dockerfile with the operations to modify the base Trino image. The following example defines arguments for:
TRINO_VERSION=449as the Trino versionUFS_JAR=files/alluxio/alluxio-underfs-s3a-shaded-DA-3.7-13.0.1.jaras the path to the UFS jar copied intofiles/alluxio/CLIENT_JAR=files/alluxio/alluxio-DA-3.7-13.0.1-client.jaras the path to the Alluxio client jar copied intofiles/alluxio/COMMONS_LANG_JAR=files/commons-lang3-3.14.0.jaras the path to the commons-lang3 jar downloaded tofiles/
ARG TRINO_VERSION=449
ARG IMAGE=trinodb/trino:${TRINO_VERSION}
FROM $IMAGE
ARG UFS_JAR=files/alluxio/alluxio-underfs-s3a-shaded-DA-3.7-13.0.1.jar
ARG CLIENT_JAR=files/alluxio/alluxio-DA-3.7-13.0.1-client.jar
ARG COMMONS_LANG_JAR=files/commons-lang3-3.14.0.jar
# remove existing alluxio jars
RUN rm /usr/lib/trino/plugin/hive/alluxio*.jar \
&& rm /usr/lib/trino/plugin/delta-lake/alluxio*.jar \
&& rm /usr/lib/trino/plugin/iceberg/alluxio*.jar
# copy jars into image
ENV ALLUXIO_HOME=/usr/lib/trino/alluxio
RUN mkdir -p $ALLUXIO_HOME/lib && mkdir -p $ALLUXIO_HOME/conf
COPY --chown=trino:trino $UFS_JAR /usr/lib/trino/alluxio/lib/
COPY --chown=trino:trino $CLIENT_JAR /usr/lib/trino/alluxio/lib/
COPY --chown=trino:trino $COMMONS_LANG_JAR /usr/lib/trino/alluxio/lib/
# symlink UFS jar to lib/ and client+commons-lang3 jar to each plugin directory
RUN ln -s $ALLUXIO_HOME/lib/alluxio-underfs-*.jar /usr/lib/trino/lib/
RUN ln -s $ALLUXIO_HOME/lib/alluxio-*-client.jar /usr/lib/trino/plugin/hive/hdfs/ \
&& ln -s $ALLUXIO_HOME/lib/alluxio-*-client.jar /usr/lib/trino/plugin/delta-lake/hdfs/ \
&& ln -s $ALLUXIO_HOME/lib/alluxio-*-client.jar /usr/lib/trino/plugin/iceberg/hdfs/
RUN ln -s $ALLUXIO_HOME/lib/commons-lang3-3.14.0.jar /usr/lib/trino/plugin/hive/hdfs/ \
&& ln -s $ALLUXIO_HOME/lib/commons-lang3-3.14.0.jar /usr/lib/trino/plugin/delta-lake/hdfs/ \
&& ln -s $ALLUXIO_HOME/lib/commons-lang3-3.14.0.jar /usr/lib/trino/plugin/iceberg/hdfs/
USER trino:trinoThis will copy the necessary jars into the image. For the UFS jar, a symlink is created in the Trino lib/ directory pointing to it. For the Alluxio client and commons-lang3 jars, iterate through each of the possible plugin directories and create a symlink pointing to each of them.
In the above example, the jars are copied into the plugin directories for Hive, Delta Lake, and Iceberg, but only a subset may be needed depending on which connector(s) are being utilized.
Note that for Trino versions earlier than 434, the destination to copy the
CLIENT_JARandCOMMONS_LANG_JARshould be/usr/lib/trino/plugin/<PLUGIN_NAME>/instead of/usr/lib/trino/plugin/<PLUGIN_NAME>/hdfs/
Build the image by running, replacing <PRIVATE_REGISTRY> with the URL of your private container registry and <TRINO_VERSION> with the corresponding Trino version.
$ docker build --platform linux/amd64 -t <PRIVATE_REGISTRY>/alluxio/trino:<TRINO_VERSION> .Push the image by running:
$ docker push <PRIVATE_REGISTRY>/alluxio/trino:<TRINO_VERSION>Configure Trino
Create a trino-alluxio.yaml configuration file in the installation directory to define the Alluxio specific configurations. We'll break down this configuration file into several parts that can be joined into a single yaml file.
Setting image and server
image:
registry: <PRIVATE_REGISTRY>
repository: <PRIVATE_REPOSITORY>
tag: <TRINO_VERSION>
server:
workers: 2
log:
trino:
level: INFO
config:
query:
maxMemory: "4GB"Specify the location of the custom Trino image that was previously built and pushed. Customize the Trino server specifications as needed.
Setting additionalCatalogs
additionalCatalogs:
memory: |
connector.name=memory
memory.max-data-per-node=128MBIn addition to memory related configurations, the catalog configurations are added under this section. Depending on which connector will be used, add the corresponding catalog configuration under additionalCatalogs. If following the previous section to launch the Hive Metastore, the value of hive.metastore.uri will be thrift://hive-metastore.default:9083.
Setting coordinator and worker
The coordinator and worker configurations both require similar modifications to their additionalJVMConfig and additionalConfigFiles sections.
For the coordinator:
coordinator:
jvm:
maxHeapSize: "8G"
gcMethod:
type: "UseG1GC"
g1:
heapRegionSize: "32M"
config:
memory:
heapHeadroomPerNode: ""
query:
maxMemoryPerNode: "1GB"
additionalJVMConfig:
- "-Dalluxio.home=/usr/lib/trino/alluxio"
- "-Dalluxio.conf.dir=/etc/trino/"
- "-XX:+UnlockDiagnosticVMOptions"
- "-XX:G1NumCollectionsKeepPinned=10000000"
- "-Djava.security.manager=allow"
additionalConfigFiles:
alluxio-core-site.xml: |
<configuration>
<property>
<name>fs.s3a.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
</configuration>
alluxio-site.properties: |
alluxio.etcd.endpoints: http://<ETCD_POD_NAME>:2379
alluxio.cluster.name: default-alluxio
alluxio.k8s.env.deployment: true
alluxio.mount.table.source: ETCD
alluxio.worker.membership.manager.type: ETCD
metrics.properties: |
# Enable the Alluxio Jmx sink
sink.jmx.class=alluxio.metrics.sink.JmxSinkIf using Trino version 464 or later,
"-Djava.security.manager=allow"is required to be compatible with Java 23
Note the 2 additional JVM configurations to define the alluxio.home and alluxio.conf.dir properties.
Under additionalConfigFiles, 3 Alluxio configuration files are defined.
alluxio-core-site.xml: The xml file defines which underlying Filesystem class should be used for specific schemes of file URIs. By setting the valuealluxio.hadoop.FileSystem, this ensures the file is executed through the Alluxio filesystem. Depending on the UFS, set the corresponding configurations:For S3, set
fs.s3a.impltoalluxio.hadoop.FileSystemFor HDFS, set
fs.hdfs.impltoalluxio.hadoop.FileSystem
alluxio-site.properties: The properties file defines the Alluxio specific configuration properties for the Alluxio client. The next section will describe how to populate this file.metrics.properties: Adding this one line enables Alluxio specific metrics to be scraped from Trino.
For the worker, include the same 2 additional JVM configurations and copy the same 3 Alluxio configuration files under additionalConfigFiles.
worker:
jvm:
maxHeapSize: "8G"
gcMethod:
type: "UseG1GC"
g1:
heapRegionSize: "32M"
config:
memory:
heapHeadroomPerNode: ""
query:
maxMemoryPerNode: "1GB"
additionalJVMConfig:
- "-Dalluxio.home=/usr/lib/trino/alluxio"
- "-Dalluxio.conf.dir=/etc/trino/"
- "-XX:+UnlockDiagnosticVMOptions"
- "-XX:G1NumCollectionsKeepPinned=10000000"
- "-Djava.security.manager=allow"
additionalConfigFiles:
alluxio-core-site.xml: |
<configuration>
<property>
<name>fs.s3a.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
</configuration>
alluxio-site.properties: |
alluxio.etcd.endpoints: http://<ETCD_POD_NAME>:2379
alluxio.cluster.name: default-alluxio
alluxio.k8s.env.deployment: true
alluxio.mount.table.source: ETCD
alluxio.worker.membership.manager.type: ETCD
metrics.properties: |
sink.jmx.class=alluxio.metrics.sink.JmxSinkIf using Trino version 464 or later,
"-Djava.security.manager=allow"is required to be compatible with Java 23
alluxio-site.properties
For the Trino cluster to properly communicate with the Alluxio cluster, certain properties must be aligned between the Alluxio client and Alluxio server.
To show alluxio-site.properties from the Alluxio cluster config map, run
$ kubectl get configmap alluxio-alluxio-conf -o yamlwhere alluxio-alluxio-conf is the name of the configmap for the Alluxio cluster.
In this output, search for the following properties under alluxio-site.properties and set them for the coordinator's and worker's alluxio-site.properties:
alluxio.etcd.endpoints: The ETCD hosts of the Alluxio cluster. Assuming the hostnames are accessible from the Trino cluster, the same value can be used for the Trino cluster'salluxio-site.properties.alluxio.cluster.name: This must be set with the exact same value for Trino cluster.alluxio.k8s.env.deployment=truealluxio.mount.table.source=ETCDalluxio.worker.membership.manager.type=ETCD
If following the Install Alluxio on Kubernetes instructions, the values should be:
alluxio-site.properties: |
alluxio.etcd.endpoints: http://alluxio-etcd.default:2379
alluxio.cluster.name: default-alluxio
alluxio.k8s.env.deployment: true
alluxio.mount.table.source: ETCD
alluxio.worker.membership.manager.type: ETCDLaunch Trino
Add the helm chart to the local repository by running:
$ helm repo add trino https://trinodb.github.io/chartsTo deploy Trino, run:
$ helm install -f trino-alluxio.yaml trino-cluster trino/trinoCheck the status of the Trino pods and identify the pod name of the Trino coordinator:
$ kubectl get podsTest the functionality by executing a query.
When completed, delete the Trino cluster by running:
$ helm uninstall trino-clusterAppendix
Deploy Hive Metastore
To complete the Trino end-to-end example, a Hive Metastore will be launched. If there is already one available, this step can be skipped. The Hive Metastore URI will be needed to configure the Trino catalogs.
Use helm to create a stand-alone Hive metastore.
Add the helm repository by running:
$ helm repo add heva-helm-charts https://hevaweb.github.io/heva-helm-charts/Install the Hive cluster by running:
$ helm install hive-metastore heva-helm-charts/hive-metastoreCheck the status of the Hive pods:
$ kubectl get podsIt is known that the hive-metastore-db-init-schema-* pods result in an Error status, but this does not impact the rest of the workflow.
When completed, delete the Hive cluster by running:
$ helm uninstall hive-metastoreConfigure AWS credentials for S3
If using S3, provide AWS credentials by editing the configmap:
$ kubectl edit configmap hive-metastoreIn the editor, search for the hive-site.xml properties fs.s3a.access.key and fs.s3a.secret.key and populate them with AWS credentials to access to your S3 bucket.
Restart the Hive Metastore pod by deleting it:
$ kubectl delete pod hive-metastore-0And it will restart automatically.
Executing Queries
To enter the Trino coordinator pod, run:
$ kubectl exec -it $POD_NAME -- /bin/bashwhere $POD_NAME is the full pod name of the Trino coordinator, starting with trino-cluster-trino-coordinator-
Inside the Trino pod, run:
$ trinoTo create a simple schema in your UFS mount, run
trino> CREATE SCHEMA hive.example WITH (location = '<UFS_URI>');where UFS_URI is a path within one of the defined Alluxio mounts. Using a S3 bucket as an example, this could be s3a://myBucket/myPrefix/.
Last updated