OCI OKE
This page covers OCI-specific differences when deploying Alluxio on an existing Oracle Cloud Infrastructure Kubernetes Engine (OKE) cluster. For the generic operator and cluster install steps, see Kubernetes Installation.
Overview
Deploying Alluxio on OKE differs from the generic Kubernetes install in three areas:
Image registry. Images push to OCI Container Registry (OCIR). OCIR uses an Auth Token for authentication — distinct from the API key used for every other OCI CLI call.
Worker storage. General-purpose OCI shapes (e.g., VM.Standard.E5.Flex) have only ~36 GiB free on the boot disk after the OS and container runtime, so the page store uses
emptyDir(or the defaulthostPath) capped at ~30 GiB. DenseIO shapes with local NVMe can pointhostPathat the NVMe mount for larger, persistent cache.oci-bvPVC mode does not work for multi-worker deployments — the RWO volume would be shared across replicas.Load balancer annotations. OCI service LBs are public by default; OCI-specific annotations on the Kubernetes
Serviceswitch them to internal.
OCI Object Storage as a UFS is covered in S3 Compatible Storages → Oracle Cloud Infrastructure (OCI) object storage and is not duplicated here. Alluxio EE does not support the oci:// scheme — mount OCI Object Storage via its S3-compatible endpoint only.
All examples below use us-phoenix-1 (OCIR alias phx.ocir.io). Adapt for your subscribed region.
Before You Start
These checks are in addition to the generic Kubernetes prerequisites. Provisioning the OKE cluster itself is out of scope; see the OKE documentation for cluster setup.
Decrypt a passphrase-protected API key. If oci setup config encrypts the private key, the CLI prompts interactively on every call and blocks automation:
Environment Variables
Installation Steps
1. Push Alluxio Images to OCIR
OCIR does not accept API keys for push. Authentication requires an Auth Token. The API key in ~/.oci/config is used for every other OCI CLI call but not for OCIR image push.
Resolve the OCIR login string and push with skopeo (no local Docker daemon required):
If your OCI tenancy uses an Identity Domain (federated identity), the login username is <namespace>/oracleidentitycloudservice/<user-email>.
✅ Success: Both images appear in OCI Console → Developer Services → Container Registry.
2. Create the Image-Pull Secret
Every pod that pulls from OCIR needs a docker-registry secret. Create one per namespace that will pull the images:
3. Deploy Alluxio with OCI-Specific Values
From here, follow the generic Kubernetes Installation guide, starting at Step 1 — Prepare Helm Chart. Section 1 above replaces the generic Step 0 (pushing images to a private registry).
Apply the OCI-specific values below when you create alluxio-operator.yaml and alluxio-cluster.yaml.
Operator values (alluxio-operator.yaml):
Cluster image and pull secret (alluxio-cluster.yaml):
Worker page store — size the node-local storage, then point hostPath at it.
Alluxio worker cache capacity is bounded by the node-local storage available to each worker pod. On OKE, choose one of the following sizing strategies before sizing pagestore:
DenseIO shapes (BM.DenseIO.E5, VM.DenseIO2, etc.) expose local NVMe at high bandwidth. This is the recommended shape for production ML workloads. Mount the NVMe at
/mnt/alluxio/pagestoreon the node and use the defaulthostPathpage store.General-purpose shapes with a provisioned boot volume (VM.Standard.E5.Flex with
bootVolumeSizeInGBsraised at node-pool creation, e.g., 500–2000 GiB). The bigger boot volume giveshostPathroom to hold a real working-set cache.General-purpose shapes with an attached block volume per worker node — provision an extra OCI Block Volume on each node (Terraform, cloud-init, or a DaemonSet), format and mount it at
/mnt/alluxio/pagestore, then usehostPath. Each node gets its own volume, so there is no RWO sharing issue.
Then size the page store to the provisioned capacity:
Do not use pagestore.type: persistentVolumeClaim with oci-bv. oci-bv is a ReadWriteOnce StorageClass; the Alluxio chart provisions a single PVC shared across all worker replicas, so only one replica can mount it and the rest stay Pending. Use node-local storage (hostPath or emptyDir) instead.
Evaluation-only shortcut. The default OKE E5.Flex boot disk is ~46 GiB (~36 GiB usable). If you accept a small, ephemeral cache purely for evaluation, pagestore.type: emptyDir with size: 30Gi works without any extra storage provisioning. This is not suitable for production — every pod restart reloads the cache from UFS.
Internal load balancer for the S3 gateway. OCI service LBs default to public; keep service traffic inside the VCN:
After helm upgrade --install, OCI provisions the LB in ~60–90 seconds. Retrieve its internal IP with:
4. Access the Cluster
Service LBs provisioned in Step 3 are internal to the VCN. Two common access patterns:
Option A — kubectl port-forward from your admin workstation:
Option B — SSH tunnel through a bastion VM inside the VCN:
Option A requires no additional VM; Option B gives you reusable URLs for a team.
Uninstall
Follow the generic Kubernetes Installation → Uninstall procedure as-is. After it completes, remove the OCIR image-pull secrets:
OCI infrastructure (VCN, OKE cluster, node pools) is managed separately — see the OKE documentation for teardown. The Alluxio images pushed to OCIR can stay — reuse them on the next deployment.
Troubleshooting
Symptom: OCIR skopeo login or push returns unauthorized
skopeo login or push returns unauthorizedLikely cause: Authenticating with the API key password instead of an Auth Token.
Fix: Generate an Auth Token in OCI Console → User → Auth Tokens, export it as OCIR_AUTH_TOKEN, and rerun skopeo login. API keys authenticate to oci CLI but never to OCIR image endpoints.
Symptom: OCIR login succeeds, but skopeo copy returns denied: requested access to the resource is denied
skopeo copy returns denied: requested access to the resource is deniedLikely cause: The user account lacks manage repos on the target compartment, or the repository path is wrong.
How to diagnose:
Confirm the target compartment, then check the IAM policy for your group includes Allow group <g> to manage repos in compartment <c>.
Fix: Grant the policy, or push to a compartment where the user has permission. Repository names are case-sensitive in OCIR.
Symptom: OCIR username rejected
Likely cause: Federation formatting. Identity-Domain tenancies need an extra path segment.
Fix: Non-federated users: <namespace>/<user-email>. Federated (Identity Domain enabled): <namespace>/oracleidentitycloudservice/<user-email>.
Symptom: Worker pods Pending, PVCs stay Pending
Pending, PVCs stay PendingLikely cause: No oci-bv StorageClass, or the OKE cluster is too old to include the OCI Block Volume CSI driver.
How to diagnose:
Fix: Install or enable the OCI Block Volume CSI add-on for your OKE cluster. See OKE Block Volume CSI docs.
Symptom: Worker pods CrashLoopBackOff with "quota exceeds total disk space"
Likely cause: pagestore.size exceeds the node-local storage available to the worker pod. On the default OKE boot disk (~46 GiB, ~36 GiB usable) any pagestore.size above ~36 GiB fails.
Fix: Provision larger node-local storage per §3 — raise bootVolumeSizeInGBs on the node pool, attach a secondary OCI Block Volume, or move to a DenseIO shape — and match pagestore.size to that capacity (leave ~10% headroom in reservedSize). For evaluation only, cap pagestore.size at 30 GiB on the default boot disk.
Symptom: Only one Worker starts; the others stay Pending with Multi-Attach error for volume
Pending with Multi-Attach error for volumeLikely cause: pagestore.type: persistentVolumeClaim with an RWO StorageClass like oci-bv. The Worker replicas share a single volume, and OCI only allows one node to attach it.
Fix: Switch to pagestore.type: emptyDir (see §3). RWX-capable storage is not typically available on OKE for block devices.
Symptom: Built-in etcd pod stuck in ImagePullBackOff
ImagePullBackOffLikely cause: The bundled etcd image is not reachable. Recent reports: the docker.io/bitnami/etcd mirror has been delisted, breaking the default Alluxio chart path.
How to diagnose:
Look for manifest unknown or repository does not exist.
Fix: Deploy an external etcd cluster and point Alluxio at it. Disable the built-in etcd and set the endpoint explicitly — see Kubernetes Installation — Appendix B.3: Use External ETCD.
Symptom: UFS writes to OCI Object Storage return HTTP 501 "AWS chunked encoding not supported"
Likely cause: OCI's S3-compatibility API does not accept AWS SDK v2 chunked transfer encoding, which Alluxio uses by default.
Fix: Set alluxio.underfs.s3.sdk.version=1 on the UFS mount. Full property set is documented in S3 Compatible Storages → Oracle Cloud Infrastructure (OCI) object storage.
Symptom: S3 gateway Service has EXTERNAL-IP = <pending> indefinitely
Service has EXTERNAL-IP = <pending> indefinitelyLikely cause: The OKE cluster was created without a service-lb-subnet, or the subnet is exhausted.
How to diagnose:
Look for OCI LB provisioning errors referencing subnets or NSGs.
Fix: Ensure the cluster's service LB subnet has free IPs and allows ingress on the gateway port. To change service LB subnets, recreate the OKE cluster with --service-lb-subnet-ids.
Related Documentation
Kubernetes Installation — Generic operator and cluster install steps
S3 Compatible Storages — Mounting OCI Object Storage via its S3-compatible API
Prerequisites — Hardware, networking ports, resource sizing, and etcd requirements
Worker Configuration — Page-store sizing, JVM tuning, and storage layout
Last updated