K8s in Oracle Cloud Always Free tier (with Terraform)

Upd. March 2022: I've been banned at the Oracle Cloud because of the origin country. All attempts to restore the access were rejected with no explanation. Still, I hold this article as a nice exercise, although now I have to warn readers of possible consequences of using Oracle Cloud.

Upd. 2023: free domain zone .ga was taken over by the Gabonese government. It seems, all access to previously registered domains is lost.

Upd. 2024: free registrar Freenom stopped operations. All top-level domains ceased to exist.

Oracle Cloud offers really good terms in the Always Free tier. As of January 2022 it includes 4 CPUs and 24 GB of memory for ARM-based VMs.

There are 2 options: we can use free resources while staying on the Always Free tier, or we can upgrade to the Pay-as-You-Go subscription. There are some differences in available resources limits, that is why there will be differences in cluster architecture. This post describes approach for provisioning cluster in the Always Free tier.

N.B.: one won’t get charged in the Always Free tier, even after trial is over. One may get chargedafter upgrading to Pay-as-You-Go subscription.

Oracle Cloud has a resources of a managed K8s cluster, but unfortunately it is not available for Always Free tenancies (the limit is set to 0, all limits in this article are valid as of January 2022).

That means that for the Always Free tier we need to stick to the completely manual process of compute resources provisioning and K8s cluster deployment. So this post presents a way of provisioning a manually managed K8s cluster on Oracle Cloud ARM VMs. It describes architecture considerations, and also workarounds for issues that appeared on the road.

TL;DR: reproducible Terraform scripts and steps to get started could be found here.

Compute resources provisioning

We will be consuming all available compute resources in our cluster. We need a designated node for K8s control-plane (this will be a leader node), and multiple worker nodes. Each node will have 1 OCPU, which means that we can provision 1 leader node and 3 worker nodes. In our cluster leader node will have 3 GB of RAM, and each worker node will have 7 GB of RAM (making a total of 24 GB memory used).

N.B.: K8s issues a warning when a node has less then 2 CPUs. Our cluster is not a production one, and the goal is to maximize the number of nodes in the cluster. That is why we silence the error at K8s deployment.

The most often issue that Always Free tenancies face when provisioning VMs, is the Out of host capacity error.

The thing is that free compute resources resources are limited, and this error means that we got in the case when it has run out. Oracle says, that it is constantly adding capacity to its data centers, so we should just try another attempt in a couple of days.

Sometimes it helps to switch to another availability domain, if the region has it. Always Free tenancies can only provision resources in their home region only, that is why it should be selected carefully. A list of regions with appropriate availability domains could be found here.

At some point I also noticed, that if there are two accounts in the same region, but one of them has a Pay-as-You-Go subscription, and another does not, then the first one gets some priority in provisioning, so it was possible to provision ARM VMs, while the other one could not (it took another week to become available for the second account).

Network architecture

We need to provision a Virtual Cloud Network (VCN) to allow instances to connect to the internet, as well as become accessible from it. VCNs have subnets, which could be public or private. To open incoming and outgoing connectivity for resources in public subnets, a) VCN must have an Internet gateway, and b) each resource must have a public IP assigned. To open outgoing connectivity for resources in private subnets, VCN must have a NAT gateway. Incoming connectivity is initially unavailable.

So, desired network architecture will be as follows: VCN with 2 subnets, public and private, and compute resources are assigned to the private subnet. The trouble with this approach is that it works for Pay-as-You-Go tenancies only. As of January 2022 Oracle does not allow provisioning of NAT gateways in the Always Free tier, which leads to unavailable outgoing connectivity for nodes in private subnets.

To overcome this limitation, our VCN will have just a single public subnet, and in order to open outgoing connectivity, each compute resource will be assigned with a public IP. Luckily, Oracle does not limit availability of ephemeral public IPs.

Now each node becomes independently accessible from the internet (e.g. we can SSH to all of them). But we want to have as single efficient entry point for apps deployed to the cluster (as pods). We will achieve it by using a load balancer.

Load balancing

Oracle Cloud provides 2 types of load balancer. The first one works on the OSI level 7, which basically makes it a reverse proxy. E.g., it can handle SSL termination. But when we are creating a load balancer of this type, we need to select its shape. Load balancer shapes specify available bandwidth, and Always Free tenancies are eligible for a single 10 Mbps load balancer.

Another load balancer type is called Network Load Balancer (NLB). It works on OSI levels 3 and 4, and it can balance requests by IP-port pairs only. But this type does not have any specification for bandwidth limit, that is why we’ll use it in our cluster. We will put the NLB into the public subnet, and we will assign a reserved public IP, so it will become available from the Internet.

To enable load balancing, we also need to specify the following:

Listeners, which represent ports that are available from the Internet
Backend sets, which represent sets of resources the requests are balanced to.
For each backend set we need to add appropriate backends, which are target links to compute resources.

In our case we’ll make the NLB listen to the following TCP ports:

80 — for HTTP traffic forwarded to worker nodes.
443 — for HTTPS traffic to worker nodes.
6443 — kubectl traffic to the leader node (for remote K8s management and apps deployment).

We also configure appropriate VCN ingress rules, to allow traffic to reach appropriate nodes.

K8s deployment

We use kubeadm to make a completely silent installation of K8s components. First we need to deploy a control plane. To support automatic joining of worker nodes, a) each node has private in-cluster DNS name, and b) we generate a discovery token (kubeadm token generate), which is copied from the leader node to all worker nodes. After that we invoke kubeadm init. After control plane is up, we can set up worker nodes with kubeadm join. We need to allow TCP port 10250 in the inner-cluster communication, because that’s a management port for kubelet (K8s agent running on each node).

K8s requires an overlay network plugin for the pods communications, and we’ll use Flannel for it. It works on a designated port on each node in the cluster (UDP 8472), that’s why we need to open this port in the VCN rules.

We will also deploy some useful infrastructure in the cluster. First, we need an ingress controller (we’ll use one based on nginx), which will be used for exposing web apps using a route-based approach. We’ll deploy a NodePort Service to listen on ports 30080 and 30443 for HTTP and HTTPS respectively (BTW, these are the ports that are registered as targets in NLB). With that said, we have complete network architecture in our cluster.

Using this ingress controller, we’ll deploy a dashboard. Once it is available, we can open it in browser: https://{cluster-public-ip}/dashboard.

We’ll also deploy a cert-manager, which helps with issuance of Let’sEncrypt HTTPS certificates. After its deployment is complete, we will deploy ClusterIssuer for Let’sEncrypt. There is a small peculiarity, as it takes some time for the cert-manager to become available, and until that attempts to create a ClusterIssuer will fail with a cryptic error, and we can’t know about cert-manager readiness via some K8s API call. That’s why we retry creation of ClusterIssuer until it succeeds (usually it takes a minute or so).

It works in conjunction with ingress-controller. To enable or, Ingress resource must be set up with appropriate public DNS name as a host.

Bonus: free public domain name

We can register a free domain name at the Freenom registrar. It is reserved for a year (after it elapses, we should manually prolong it, we can do it for free as well).

Once we have it, we can use configure the domain to target to the reserved public IP of the NLB. Go to Services - My Domains - Manage Domain - Manage Freenom DNS. We can add multiple 3rd level domains, and target it to the same public IP.

On the image above you can see an example. The intention is to make cluster-specific apps available under cluster subdomain, while regular apps are to become available under the domain itself. As now we have a public domain, we can issue a proper LetsEncrypt HTTPS certificate for the app. These rules are to be setup using Ingress resources in the cluster. That’s how it could be done for the dashboard.

N.B.: as it is free, we are left with almost no warranty. I registered a domain in the zone .ga, and to my surprise I found out that it was not available in some locations (precisely, it could not be resolved from the US West coast, from New Zealand, from Singapore). I wrote to .ga zone support, and after a couple of days the issue got resolved (I did not get any response though).

Domain in another zone did not have such issues.

a.ns.ga: that’s how it SHOULD NOT be

So that’s how we can provision compute and network resources in the Oracle Cloud to deploy a K8s cluster with public IP and load balancing, while staying in the Always Free tier.