OCI OKE

This page covers OCI-specific differences when deploying Alluxio on an existing Oracle Cloud Infrastructure Kubernetes Engine (OKE) cluster. For the generic operator and cluster install steps, see Kubernetes Installation.

Overview

Deploying Alluxio on OKE differs from the generic Kubernetes install in three areas:

  • Image registry. Images push to OCI Container Registry (OCIR). OCIR uses an Auth Token for authentication — distinct from the API key used for every other OCI CLI call.

  • Worker storage. General-purpose OCI shapes (e.g., VM.Standard.E5.Flex) have only ~36 GiB free on the boot disk after the OS and container runtime, so the page store uses emptyDir (or the default hostPath) capped at ~30 GiB. DenseIO shapes with local NVMe can point hostPath at the NVMe mount for larger, persistent cache. oci-bv PVC mode does not work for multi-worker deployments — the RWO volume would be shared across replicas.

  • Load balancer annotations. OCI service LBs are public by default; OCI-specific annotations on the Kubernetes Service switch them to internal.

OCI Object Storage as a UFS is covered in S3 Compatible Storages → Oracle Cloud Infrastructure (OCI) object storage and is not duplicated here. Alluxio EE does not support the oci:// scheme — mount OCI Object Storage via its S3-compatible endpoint only.

All examples below use us-phoenix-1 (OCIR alias phx.ocir.io). Adapt for your subscribed region.

Before You Start

These checks are in addition to the generic Kubernetes prerequisites. Provisioning the OKE cluster itself is out of scope; see the OKE documentationarrow-up-right for cluster setup.

circle-exclamation

Environment Variables

Installation Steps

1. Push Alluxio Images to OCIR

circle-exclamation

Resolve the OCIR login string and push with skopeo (no local Docker daemon required):

circle-info

If your OCI tenancy uses an Identity Domain (federated identity), the login username is <namespace>/oracleidentitycloudservice/<user-email>.

✅ Success: Both images appear in OCI Console → Developer Services → Container Registry.

2. Create the Image-Pull Secret

Every pod that pulls from OCIR needs a docker-registry secret. Create one per namespace that will pull the images:

3. Deploy Alluxio with OCI-Specific Values

From here, follow the generic Kubernetes Installation guide, starting at Step 1 — Prepare Helm Chart. Section 1 above replaces the generic Step 0 (pushing images to a private registry).

Apply the OCI-specific values below when you create alluxio-operator.yaml and alluxio-cluster.yaml.

Operator values (alluxio-operator.yaml):

Cluster image and pull secret (alluxio-cluster.yaml):

Worker page store — size the node-local storage, then point hostPath at it.

Alluxio worker cache capacity is bounded by the node-local storage available to each worker pod. On OKE, choose one of the following sizing strategies before sizing pagestore:

  • DenseIO shapes (BM.DenseIO.E5, VM.DenseIO2, etc.) expose local NVMe at high bandwidth. This is the recommended shape for production ML workloads. Mount the NVMe at /mnt/alluxio/pagestore on the node and use the default hostPath page store.

  • General-purpose shapes with a provisioned boot volume (VM.Standard.E5.Flex with bootVolumeSizeInGBs raised at node-pool creation, e.g., 500–2000 GiB). The bigger boot volume gives hostPath room to hold a real working-set cache.

  • General-purpose shapes with an attached block volume per worker node — provision an extra OCI Block Volume on each node (Terraform, cloud-init, or a DaemonSet), format and mount it at /mnt/alluxio/pagestore, then use hostPath. Each node gets its own volume, so there is no RWO sharing issue.

Then size the page store to the provisioned capacity:

circle-exclamation
circle-info

Evaluation-only shortcut. The default OKE E5.Flex boot disk is ~46 GiB (~36 GiB usable). If you accept a small, ephemeral cache purely for evaluation, pagestore.type: emptyDir with size: 30Gi works without any extra storage provisioning. This is not suitable for production — every pod restart reloads the cache from UFS.

Internal load balancer for the S3 gateway. OCI service LBs default to public; keep service traffic inside the VCN:

After helm upgrade --install, OCI provisions the LB in ~60–90 seconds. Retrieve its internal IP with:

4. Access the Cluster

Service LBs provisioned in Step 3 are internal to the VCN. Two common access patterns:

Option A — kubectl port-forward from your admin workstation:

Option B — SSH tunnel through a bastion VM inside the VCN:

Option A requires no additional VM; Option B gives you reusable URLs for a team.

Uninstall

Follow the generic Kubernetes Installation → Uninstall procedure as-is. After it completes, remove the OCIR image-pull secrets:

OCI infrastructure (VCN, OKE cluster, node pools) is managed separately — see the OKE documentationarrow-up-right for teardown. The Alluxio images pushed to OCIR can stay — reuse them on the next deployment.

Troubleshooting

Symptom: OCIR skopeo login or push returns unauthorized

Likely cause: Authenticating with the API key password instead of an Auth Token.

Fix: Generate an Auth Token in OCI Console → User → Auth Tokens, export it as OCIR_AUTH_TOKEN, and rerun skopeo login. API keys authenticate to oci CLI but never to OCIR image endpoints.

Symptom: OCIR login succeeds, but skopeo copy returns denied: requested access to the resource is denied

Likely cause: The user account lacks manage repos on the target compartment, or the repository path is wrong.

How to diagnose:

Confirm the target compartment, then check the IAM policy for your group includes Allow group <g> to manage repos in compartment <c>.

Fix: Grant the policy, or push to a compartment where the user has permission. Repository names are case-sensitive in OCIR.

Symptom: OCIR username rejected

Likely cause: Federation formatting. Identity-Domain tenancies need an extra path segment.

Fix: Non-federated users: <namespace>/<user-email>. Federated (Identity Domain enabled): <namespace>/oracleidentitycloudservice/<user-email>.

Symptom: Worker pods Pending, PVCs stay Pending

Likely cause: No oci-bv StorageClass, or the OKE cluster is too old to include the OCI Block Volume CSI driver.

How to diagnose:

Fix: Install or enable the OCI Block Volume CSI add-on for your OKE cluster. See OKE Block Volume CSI docsarrow-up-right.

Symptom: Worker pods CrashLoopBackOff with "quota exceeds total disk space"

Likely cause: pagestore.size exceeds the node-local storage available to the worker pod. On the default OKE boot disk (~46 GiB, ~36 GiB usable) any pagestore.size above ~36 GiB fails.

Fix: Provision larger node-local storage per §3 — raise bootVolumeSizeInGBs on the node pool, attach a secondary OCI Block Volume, or move to a DenseIO shape — and match pagestore.size to that capacity (leave ~10% headroom in reservedSize). For evaluation only, cap pagestore.size at 30 GiB on the default boot disk.

Symptom: Only one Worker starts; the others stay Pending with Multi-Attach error for volume

Likely cause: pagestore.type: persistentVolumeClaim with an RWO StorageClass like oci-bv. The Worker replicas share a single volume, and OCI only allows one node to attach it.

Fix: Switch to pagestore.type: emptyDir (see §3). RWX-capable storage is not typically available on OKE for block devices.

Symptom: Built-in etcd pod stuck in ImagePullBackOff

Likely cause: The bundled etcd image is not reachable. Recent reports: the docker.io/bitnami/etcd mirror has been delisted, breaking the default Alluxio chart path.

How to diagnose:

Look for manifest unknown or repository does not exist.

Fix: Deploy an external etcd cluster and point Alluxio at it. Disable the built-in etcd and set the endpoint explicitly — see Kubernetes Installation — Appendix B.3: Use External ETCD.

Symptom: UFS writes to OCI Object Storage return HTTP 501 "AWS chunked encoding not supported"

Likely cause: OCI's S3-compatibility API does not accept AWS SDK v2 chunked transfer encoding, which Alluxio uses by default.

Fix: Set alluxio.underfs.s3.sdk.version=1 on the UFS mount. Full property set is documented in S3 Compatible Storages → Oracle Cloud Infrastructure (OCI) object storage.

Symptom: S3 gateway Service has EXTERNAL-IP = <pending> indefinitely

Likely cause: The OKE cluster was created without a service-lb-subnet, or the subnet is exhausted.

How to diagnose:

Look for OCI LB provisioning errors referencing subnets or NSGs.

Fix: Ensure the cluster's service LB subnet has free IPs and allows ingress on the gateway port. To change service LB subnets, recreate the OKE cluster with --service-lb-subnet-ids.

Last updated