# OCI OKE

This page covers OCI-specific differences when deploying Alluxio on an existing Oracle Cloud Infrastructure Kubernetes Engine (OKE) cluster. For the generic operator and cluster install steps, see [Kubernetes Installation](/ee-ai-en/ai-3.8-15.1.x/start/installing-on-kubernetes.md).

## Overview

Deploying Alluxio on OKE differs from the generic Kubernetes install in three areas:

* **Image registry.** Images push to OCI Container Registry (OCIR). OCIR uses an **Auth Token** for authentication — distinct from the API key used for every other OCI CLI call.
* **Worker storage.** General-purpose OCI shapes (e.g., VM.Standard.E5.Flex) have only \~36 GiB free on the boot disk after the OS and container runtime, so the page store uses `emptyDir` (or the default `hostPath`) capped at \~30 GiB. DenseIO shapes with local NVMe can point `hostPath` at the NVMe mount for larger, persistent cache. `oci-bv` PVC mode does not work for multi-worker deployments — the RWO volume would be shared across replicas.
* **Load balancer annotations.** OCI service LBs are public by default; OCI-specific annotations on the Kubernetes `Service` switch them to internal.

OCI Object Storage as a UFS is covered in [S3 Compatible Storages → Oracle Cloud Infrastructure (OCI) object storage](/ee-ai-en/ai-3.8-15.1.x/ufs/s3-compatible.md#oracle-cloud-infrastructure-oci-object-storage) and is not duplicated here. Alluxio EE does not support the `oci://` scheme — mount OCI Object Storage via its S3-compatible endpoint only.

All examples below use `us-phoenix-1` (OCIR alias `phx.ocir.io`). Adapt for your subscribed region.

## Before You Start

These checks are **in addition to** the [generic Kubernetes prerequisites](/ee-ai-en/ai-3.8-15.1.x/start/installing-on-kubernetes.md#before-you-start). Provisioning the OKE cluster itself is out of scope; see the [OKE documentation](https://docs.oracle.com/en-us/iaas/Content/ContEng/home.htm) for cluster setup.

* [ ] **OKE cluster is running and reachable:**

  ```shell
  kubectl cluster-info
  kubectl get nodes
  ```
* [ ] **`oci-bv` StorageClass exists** (preinstalled on modern OKE clusters):

  ```shell
  kubectl get storageclass oci-bv
  ```
* [ ] **CLI tools installed:**

  ```shell
  brew install oci-cli kubectl helm skopeo jq          # macOS; adapt for Linux
  ```
* [ ] **OCI API key configured:**

  ```shell
  oci setup config
  oci iam user get --user-id "$OCI_USER_OCID"          # must succeed
  ```
* [ ] **OCIR Auth Token obtained** (OCI Console → User → Auth Tokens — this is a separate credential from the API key)
* [ ] **Alluxio artifacts downloaded:** operator image `.tar`, Alluxio image `.tar`, Helm chart `.tgz`, license

{% hint style="warning" %}
**Decrypt a passphrase-protected API key.** If `oci setup config` encrypts the private key, the CLI prompts interactively on every call and blocks automation:

```shell
openssl rsa -in ~/.oci/oci_api_key.pem -out ~/.oci/oci_api_key.pem.dec
mv  ~/.oci/oci_api_key.pem.dec ~/.oci/oci_api_key.pem
chmod 600 ~/.oci/oci_api_key.pem
```

{% endhint %}

## Environment Variables

```shell
export OCI_REGION=us-phoenix-1
export OCIR_HOST=phx.ocir.io                          # region alias; see OCIR docs
export OCI_TENANCY_OCID=ocid1.tenancy.oc1..xxxx
export OCI_USER_OCID=ocid1.user.oc1..xxxx
export OCI_OCIR_NAMESPACE=$(oci os ns get --region "$OCI_REGION" --query 'data' --raw-output)

export ALLUXIO_IMAGE_REPO="${OCIR_HOST}/${OCI_OCIR_NAMESPACE}/alluxio-ee"
export ALLUXIO_IMAGE_TAG=AI-3.8-15.1.2

export OCIR_AUTH_TOKEN='<auth-token-from-console>'
```

## Installation Steps

### 1. Push Alluxio Images to OCIR

{% hint style="warning" %}
**OCIR does not accept API keys for push.** Authentication requires an Auth Token. The API key in `~/.oci/config` is used for every other OCI CLI call but not for OCIR image push.
{% endhint %}

Resolve the OCIR login string and push with `skopeo` (no local Docker daemon required):

```shell
OCIR_USER=$(oci iam user get --user-id "$OCI_USER_OCID" --query 'data.name' --raw-output)
OCIR_LOGIN="${OCI_OCIR_NAMESPACE}/${OCIR_USER}"

echo "$OCIR_AUTH_TOKEN" | skopeo login "$OCIR_HOST" -u "$OCIR_LOGIN" --password-stdin

# Alluxio image
skopeo copy \
  --dest-tls-verify=true \
  "docker-archive:alluxio-enterprise-AI-3.8-15.1.2-linux-amd64-docker.tar" \
  "docker://${ALLUXIO_IMAGE_REPO}:${ALLUXIO_IMAGE_TAG}"

# Operator image
skopeo copy \
  --dest-tls-verify=true \
  "docker-archive:alluxio-operator-3.5.2-linux-amd64-docker.tar" \
  "docker://${OCIR_HOST}/${OCI_OCIR_NAMESPACE}/alluxio-operator:3.5.2"
```

{% hint style="info" %}
If your OCI tenancy uses an Identity Domain (federated identity), the login username is `<namespace>/oracleidentitycloudservice/<user-email>`.
{% endhint %}

**✅ Success:** Both images appear in OCI Console → Developer Services → Container Registry.

### 2. Create the Image-Pull Secret

Every pod that pulls from OCIR needs a `docker-registry` secret. Create one per namespace that will pull the images:

```shell
kubectl create namespace alx-ns
kubectl -n alx-ns create secret docker-registry ocir-pull \
  --docker-server="$OCIR_HOST" \
  --docker-username="$OCIR_LOGIN" \
  --docker-password="$OCIR_AUTH_TOKEN"

kubectl create namespace alluxio-operator
kubectl -n alluxio-operator create secret docker-registry ocir-pull \
  --docker-server="$OCIR_HOST" \
  --docker-username="$OCIR_LOGIN" \
  --docker-password="$OCIR_AUTH_TOKEN"
```

### 3. Deploy Alluxio with OCI-Specific Values

From here, follow the generic [Kubernetes Installation](/ee-ai-en/ai-3.8-15.1.x/start/installing-on-kubernetes.md) guide, **starting at Step 1 — Prepare Helm Chart**. Section 1 above replaces the generic Step 0 (pushing images to a private registry).

Apply the OCI-specific values below when you create `alluxio-operator.yaml` and `alluxio-cluster.yaml`.

**Operator values (`alluxio-operator.yaml`):**

```yaml
global:
  image: phx.ocir.io/<OCIR_NAMESPACE>/alluxio-operator
  imageTag: 3.5.2
  imagePullSecrets:
    - ocir-pull
```

**Cluster image and pull secret (`alluxio-cluster.yaml`):**

```yaml
spec:
  image: phx.ocir.io/<OCIR_NAMESPACE>/alluxio-ee
  imageTag: AI-3.8-15.1.2
  imagePullSecrets:
    - ocir-pull
```

**Worker page store — size the node-local storage, then point `hostPath` at it.**

Alluxio worker cache capacity is bounded by the node-local storage available to each worker pod. On OKE, choose **one** of the following sizing strategies before sizing `pagestore`:

* **DenseIO shapes** (BM.DenseIO.E5, VM.DenseIO2, etc.) expose local NVMe at high bandwidth. This is the recommended shape for production ML workloads. Mount the NVMe at `/mnt/alluxio/pagestore` on the node and use the default `hostPath` page store.
* **General-purpose shapes with a provisioned boot volume** (VM.Standard.E5.Flex with `bootVolumeSizeInGBs` raised at node-pool creation, e.g., 500–2000 GiB). The bigger boot volume gives `hostPath` room to hold a real working-set cache.
* **General-purpose shapes with an attached block volume** per worker node — provision an extra OCI Block Volume on each node (Terraform, cloud-init, or a DaemonSet), format and mount it at `/mnt/alluxio/pagestore`, then use `hostPath`. Each node gets its own volume, so there is no RWO sharing issue.

Then size the page store to the provisioned capacity:

```yaml
spec:
  worker:
    count: 3
    pagestore:
      # hostPath is the default; maps to /mnt/alluxio/pagestore on the node.
      # Set size to match the storage you provisioned above.
      size: 500Gi
      reservedSize: 50Gi
```

{% hint style="warning" %}
**Do not use `pagestore.type: persistentVolumeClaim` with `oci-bv`.** `oci-bv` is a `ReadWriteOnce` StorageClass; the Alluxio chart provisions a single PVC shared across all worker replicas, so only one replica can mount it and the rest stay `Pending`. Use node-local storage (`hostPath` or `emptyDir`) instead.
{% endhint %}

{% hint style="info" %}
**Evaluation-only shortcut.** The default OKE E5.Flex boot disk is \~46 GiB (\~36 GiB usable). If you accept a small, ephemeral cache purely for evaluation, `pagestore.type: emptyDir` with `size: 30Gi` works without any extra storage provisioning. This is not suitable for production — every pod restart reloads the cache from UFS.
{% endhint %}

**Internal load balancer for the S3 gateway.** OCI service LBs default to public; keep service traffic inside the VCN:

```yaml
spec:
  gateway:
    service:
      type: LoadBalancer
      annotations:
        oci.oraclecloud.com/load-balancer-type: "lb"
        service.beta.kubernetes.io/oci-load-balancer-internal: "true"
```

After `helm upgrade --install`, OCI provisions the LB in \~60–90 seconds. Retrieve its internal IP with:

```shell
kubectl -n alx-ns get svc alluxio-cluster-s3gateway \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
```

### 4. Access the Cluster

Service LBs provisioned in Step 3 are internal to the VCN. Two common access patterns:

**Option A — `kubectl port-forward` from your admin workstation:**

```shell
kubectl -n alx-ns port-forward svc/alluxio-cluster-s3gateway 39999:39999
aws s3 --endpoint-url http://localhost:39999 ls s3://<bucket>/

kubectl -n alx-ns port-forward svc/alluxio-cluster-coordinator 19999:19999
# open http://localhost:19999 for the Alluxio Web UI
```

**Option B — SSH tunnel through a bastion VM inside the VCN:**

```shell
ssh -f -N -L 39999:<s3gw-internal-lb-ip>:39999 ubuntu@<bastion-ip>
ssh -f -N -L 3000:<grafana-internal-lb-ip>:80 ubuntu@<bastion-ip>
```

Option A requires no additional VM; Option B gives you reusable URLs for a team.

## Uninstall

Follow the generic [Kubernetes Installation → Uninstall](/ee-ai-en/ai-3.8-15.1.x/start/installing-on-kubernetes.md#uninstall) procedure as-is. After it completes, remove the OCIR image-pull secrets:

```shell
kubectl -n alx-ns delete secret ocir-pull
kubectl -n alluxio-operator delete secret ocir-pull
```

OCI infrastructure (VCN, OKE cluster, node pools) is managed separately — see the [OKE documentation](https://docs.oracle.com/en-us/iaas/Content/ContEng/home.htm) for teardown. The Alluxio images pushed to OCIR can stay — reuse them on the next deployment.

## Troubleshooting

#### Symptom: OCIR `skopeo login` or push returns `unauthorized`

**Likely cause:** Authenticating with the API key password instead of an Auth Token.

**Fix:** Generate an Auth Token in OCI Console → User → Auth Tokens, export it as `OCIR_AUTH_TOKEN`, and rerun `skopeo login`. API keys authenticate to `oci` CLI but never to OCIR image endpoints.

#### Symptom: OCIR login succeeds, but `skopeo copy` returns `denied: requested access to the resource is denied`

**Likely cause:** The user account lacks `manage repos` on the target compartment, or the repository path is wrong.

**How to diagnose:**

```shell
oci iam compartment list --compartment-id "$OCI_TENANCY_OCID" --query 'data[].name'
```

Confirm the target compartment, then check the IAM policy for your group includes `Allow group <g> to manage repos in compartment <c>`.

**Fix:** Grant the policy, or push to a compartment where the user has permission. Repository names are case-sensitive in OCIR.

#### Symptom: OCIR username rejected

**Likely cause:** Federation formatting. Identity-Domain tenancies need an extra path segment.

**Fix:** Non-federated users: `<namespace>/<user-email>`. Federated (Identity Domain enabled): `<namespace>/oracleidentitycloudservice/<user-email>`.

#### Symptom: Worker pods `Pending`, PVCs stay `Pending`

**Likely cause:** No `oci-bv` StorageClass, or the OKE cluster is too old to include the OCI Block Volume CSI driver.

**How to diagnose:**

```shell
kubectl get storageclass
kubectl get pods -n kube-system -l app.kubernetes.io/name=oci-csi-node
```

**Fix:** Install or enable the OCI Block Volume CSI add-on for your OKE cluster. See [OKE Block Volume CSI docs](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingpvcondifferentstorageoptions.htm).

#### Symptom: Worker pods CrashLoopBackOff with "quota exceeds total disk space"

**Likely cause:** `pagestore.size` exceeds the node-local storage available to the worker pod. On the default OKE boot disk (\~46 GiB, \~36 GiB usable) any `pagestore.size` above \~36 GiB fails.

**Fix:** Provision larger node-local storage per §3 — raise `bootVolumeSizeInGBs` on the node pool, attach a secondary OCI Block Volume, or move to a DenseIO shape — and match `pagestore.size` to that capacity (leave \~10% headroom in `reservedSize`). For evaluation only, cap `pagestore.size` at 30 GiB on the default boot disk.

#### Symptom: Only one Worker starts; the others stay `Pending` with `Multi-Attach error for volume`

**Likely cause:** `pagestore.type: persistentVolumeClaim` with an RWO StorageClass like `oci-bv`. The Worker replicas share a single volume, and OCI only allows one node to attach it.

**Fix:** Switch to `pagestore.type: emptyDir` (see §3). RWX-capable storage is not typically available on OKE for block devices.

#### Symptom: Built-in etcd pod stuck in `ImagePullBackOff`

**Likely cause:** The bundled etcd image is not reachable. Recent reports: the `docker.io/bitnami/etcd` mirror has been delisted, breaking the default Alluxio chart path.

**How to diagnose:**

```shell
kubectl -n alx-ns describe pod alluxio-cluster-etcd-0 | grep -A3 Events
```

Look for `manifest unknown` or `repository does not exist`.

**Fix:** Deploy an external etcd cluster and point Alluxio at it. Disable the built-in etcd and set the endpoint explicitly — see [Kubernetes Installation — Appendix B.3: Use External ETCD](https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/start/installing-on-kubernetes/pages/XAZ4TaGn1TN9XC9Splud#b.3.-use-external-etcd).

#### Symptom: UFS writes to OCI Object Storage return HTTP 501 "AWS chunked encoding not supported"

**Likely cause:** OCI's S3-compatibility API does not accept AWS SDK v2 chunked transfer encoding, which Alluxio uses by default.

**Fix:** Set `alluxio.underfs.s3.sdk.version=1` on the UFS mount. Full property set is documented in [S3 Compatible Storages → Oracle Cloud Infrastructure (OCI) object storage](/ee-ai-en/ai-3.8-15.1.x/ufs/s3-compatible.md#oracle-cloud-infrastructure-oci-object-storage).

#### Symptom: S3 gateway `Service` has `EXTERNAL-IP = <pending>` indefinitely

**Likely cause:** The OKE cluster was created without a `service-lb-subnet`, or the subnet is exhausted.

**How to diagnose:**

```shell
kubectl -n alx-ns describe svc alluxio-cluster-s3gateway | grep -A2 Events
```

Look for OCI LB provisioning errors referencing subnets or NSGs.

**Fix:** Ensure the cluster's service LB subnet has free IPs and allows ingress on the gateway port. To change service LB subnets, recreate the OKE cluster with `--service-lb-subnet-ids`.

## Related Documentation

* [Kubernetes Installation](/ee-ai-en/ai-3.8-15.1.x/start/installing-on-kubernetes.md) — Generic operator and cluster install steps
* [S3 Compatible Storages](/ee-ai-en/ai-3.8-15.1.x/ufs/s3-compatible.md#oracle-cloud-infrastructure-oci-object-storage) — Mounting OCI Object Storage via its S3-compatible API
* [Prerequisites](/ee-ai-en/ai-3.8-15.1.x/start/prerequisites.md) — Hardware, networking ports, resource sizing, and etcd requirements
* [Worker Configuration](/ee-ai-en/ai-3.8-15.1.x/administration/managing-worker.md) — Page-store sizing, JVM tuning, and storage layout


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/start/installing-on-kubernetes/oci-oke.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.