Tips for saving AWS EKS Cost

5 min readJul 31, 2022

Save cost without compromising high availability

Source: https://images.app.goo.gl/TYqouXD1PXCRUnJNA

Create Budget Alerts

If you want reduce the price, you should keep on monitoring it and stay on top of your AWS spendings. Make sure you tag all your resources while creating K8s cluster and all related resources around it such as load balancers, cloudfront and etc. Then create AWS budget alerts with those tags and keep monitoring the cost. If you see suddent increases in cost, you can dig in and find where/why it happened. These are few examples I encountered,

Some exepensive GPU instances kept running on off peak hours
Suden increase in AWS S3 getobject and listobject operations caused extra $$. This was a application issue though

Horizontal Pod autoscaling

One of the major benefit of using K8s is the capability of horizontal pod autoscaling. Which mean number of pods automtically scale up and down based on the load. In order to achieve this,

You must first configure the observability of the cluster and measure the current utilization of each of the applications / pods.
Then find out the minimum cpu, memory (and gpu) required to run the application
Configure pod resource requests and limits based on those observations. This is a continuous process where you have to monitor and optimize them
Configure Horizontal Pod Autoscler (HPA) so that at least minimum 2 pods for production systems and higher value for max.
In a ideal scenario, number of pods should increase based on the workloads and decrease when demand goes down. When it happens, cluster-autoscaler will scale up and down number of instances behind the scene. This will make sure you are not keeping your infra idle and being charged by AWS

Use SPOT Instances

Majority of EKS cost spent on EC2 instances running behind node groups. SPOT instances are way cheaper than ondemand but it has its own drawback. You can find more about SPOT instances. For an example t3.xlarge ondemand costs around $0.2112 per hour while same spot instance cost around$0.0634 which is 3x cheaper than former.

Your cluster will have several node groups seggreted based on the use cases as I explained here. For an example, you can create two node groups for production workloads with same node label. One node group with ondemand instances while other with spot instances. Then you can configure cluster autoscaler to first try to scale up using spot instances and if it fails go for ondemand instances.

I have been running non production environments entirely on spot instances without compromising high availability or performance. If there is no spot instances, system will automatically look for ondemand instances so you don’t have to worry about availability of the system.

Google Image Result for https://c.tenor.com/MSPxlRhGkJQAAAAC/money-burn.gif

Edit description

images.app.goo.gl

Factor in non AWS Costs

We always keep eye on AWS costs but we could miss out on non AWS costs such as Datadog, Splunk or any other systems that connected to our environment. Initially I used large pool of small spot instances (t3.medium) and later found out that Datadog charges for every node which is more expensive than AWS spot instance cost. Then I switched to larger instances and brought down number of instances in the cluster so that Datadog cost will not be so high. This is just a example, so keep eye on other systems you are using.

Shutdown compute on off hours

In most places, development or test environment is being used by internal developers and 90% time they are using it during office hours. Hence, no point of running those enviornments on off hours and burning cash.

Let me explain how I configured the automated off hour shutdown and bring up of non production environments. This environment consists of set of tools such as ArgoCD as I explained in this article.

Make sure all your deployments and statefulsets are configured with resource requests and limits. This is absolutely necessary for properly scale up and down the cluster
All autoscaling groups running behind node groups are configured with minimum 0instances and max x amount. You have to decide x based on your workloads.
Configure Horizontal Pod Autoscaler with minimum and maximum pods required for cluster.

Once you configure above and deploy the cluster, ASGs will automatically scale up and creates number of nodes / instances to cater the pods running on your enviornment. Next, interesting part of shutting down and bringing up the workloads.

Shutting down

Run a K8s cronjob to set replicaset to 0 of all your deployments and statefulsets.

kubectl scale --replicas=0 deployment <deplyment-name> -n <namespace>

Cronjob should be running on pod with credentials to login to EKS and run above command
Create AWS Assume role with these permissions ‘eks:UpdateClusterConfig’, ‘eks:ListClusters’, ‘eks:DescribeCluster’
Configure Trus relationship with assume role to ensure that only the service account from EKS OIDC provider is allowed to assume the role
Then create K8s roles with below permissions

- apiGroups: ["apps"]
  resources: ["deployments/scale","deployments/status","statefulsets/scale","statefulsets/status"]
  verbs: ["update", "get", "watch", "list"]
- apiGroups: ["apps"]
  resources: ["deployments","statefulsets"]
  verbs: ["update", "get", "watch", "list"]

Next create rolebindings and attach them to a kubernetes user
Now allow IAM role to access AWS EKS cluster using the K8s user specified in the role bindings. You can find detailed instructions here for configuring eks access.

rolearn: arn:aws:iam::111122223333:role/my-eks-access-role
username: k8s-username-configured-in-role-binding

All set for enabling scheduled shutdown except one thing. If you are using ArgoCD, above replicaset changes will be override automatically within few seconds. To avoid this, you can configure ArgoCD to ignore differences in replicasets. Configure your ArgoCD Applications to ignore differences in replicasets. You can find details here for argocd changes.

spec:
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas
  - group: apps
    kind: StatefulSet
    jsonPointers:
    - /spec/replicas

Now you have all the componets required. Discuss with team and decide the time you want to shutdown the system and configure K8s cronjob accordingly.
Once all your scalesets set 0, there will be no pods running on that particualar environment. After a while cluster autoscaler will shutdown all the nodes / instances

Bring up

Bringing up the workloads are no difference. Just run another K8s cronjob with all above features. Instead of setting the scaleset to 0, make it 1.

kubectl scale --replicas=1 deployment <deplyment-name> -n <namespace>

Once you set the replicasets to 1, EKS will start creating pods and trigger cluster autoscaler to create required number of nodes / instances

That’s all I have to share at this moment. I would like to hear if you have any other ways to save cost in EKS environment.