WellSky R&D Website

6 Steps to Save on Your Cloud Costs

Long gone are the days when development teams can develop an application, package it up, and ship it without much thought to how it will run. Modern development teams must focus on all kinds of operational aspects, requiring at least as much focus as the core business problem being solved. For those hosted in public cloud, a significant operational aspect is the cost incurred by the services used from those cloud providers. Teams that manage this well can thrive, effectively balancing the monetary costs with the productivity savings gained by relying on pre-built, composable building blocks of capabilities and infrastructure. Teams that do not, however, drown under the weight of the financial pressure, unable to leverage new capabilities where appropriate.

Context

WellSky’s Enterprise Platform group is responsible for three main areas:

Developer frameworks and utilities
Common / shared business services
Multi-end market solutions

We are largely cloud-native and relatively mature on the spectrum of modern software delivery practices. Originally the software was hosted in AWS, but near the end of 2021 we started migrating everything to GCP, completing the process in February 2023. We hit the height of our cloud spend in December 2022, where we had a significant presence in both AWS and GCP. We had recently released a massive set of services in October that brought with it a 3x total cost increase. We knew that without aggressive efforts to actively manage and lower our costs, they would soon impact our ability to develop new capabilities and even hire new team members.

So, how did we do it?

First Things First: Gain Visibility

Cloud providers do include a billing module that allow you to track your costs by various dimensions, but they generally lack some features that make this as effective as it really needs to be, including:

They’re specific to that cloud provider
They have limited grouping options
They do not support custom report-only labels
They do not easily provide multi-widget dashboards

To solve these problems, we invested in Ternary (https://ternary.app). In addition to fixing all the above problems, it also:

Offers cost saving recommendations across multiple services
Allows for easy budget creation and automated alerting when crossing thresholds within that budget
Can alert on cost anomalies

Below shows a screenshot of a portion of the dashboard we built for our group that gives us a high-level picture of our cost trends across various dimensions. One can drill into any of these widgets for deeper analysis.

A screenshot of a graphDescription automatically generated

Fix Resource Labels

Effective reporting is only as good as how consistently you apply your resource tags (AWS) or labels (GCP). These labels are the primary dimensions that you filter and group by and are therefore critically important in tracing your costs back to an owner, application, or environment.

For us, we’ve standardized on a set of label keys and broadly use Terraform for all infrastructure build-out in the cloud, so enforcement of having these labels on all resources was easy to accomplish:

variable "labels" {
description = "The global labels to add to all resources."
type = object({
application = string
business-unit = string
environment = string
owner = string
service = string
})
}

What was much more difficult was achieving consistency in the label values. Here’s a screenshot showing an example cost breakdown by the “application” label:

A screenshot of a computerDescription automatically generated

Some inconsistencies include:

Incorporation of the “environment” label value into the “application” label value when a separate environment label already exists
Inconsistent use of punctuation (hyphen in some places, underscore in others)
Inconsistent label values for the project vs. the resources within the project
A large bucket of costs going to no defined application, meaning that no application label was applied at all

After fixing these issues, it looks like this instead:

A table with numbers and wordsDescription automatically generated

It’s now much easier to see how much each application costs.

Cost Management Tactics

This next section goes into specific things we did to manage our costs once we had good visibility into what they were across the various applications and teams. Mileage may vary for your environment based on the services you use. These items are also specific to GCP services and pricing but may have general applicability for our cloud providers as well.

Item 1 – Destroy Resources in the “Old Cloud” ASAP

As mentioned, we were migrating services from AWS to GCP, so there was a period of time when resources did exist in both cloud environments. Obviously, paying for unused infrastructure is not good, but it was surprising how much the unused infrastructure cost just to sit idle. Additionally, we learned some lessons that guided us to setup things more efficiently in GCP than we did originally in AWS. For example, in AWS we had multiple single-team EKS clusters, while in GCP we setup a single shared GKE cluster for all teams in the organization to share.

Destroying infrastructure in a controlled manner can be easier said than done, however. We first attempted to remove things by commenting out various sections of Terraform code, but we were battling frustrating circular dependencies that were difficult to resolve without further investment into the Terraform design. Investing in a redesign of Terraform for an environment that we just wanted to destroy was a waste of time. In the end, we ended up abandoning any effort to keep the Terraform clean, elevated our IAM permissions, and destroyed things via the console.

The lesson here was: when it comes to tearing down infrastructure from an old cloud provider, don’t be unnecessarily surgical. The savings to be had are too great.

Monthly spend reduced: 27%

Item 2 – Consider Replacing Cloud Functions Gen1 w/ Gen2 or Cloud Run

Using Cloud Functions as the always-on backend for a web service is generally not a good idea for the following reasons:

You are charged for each invocation
Gen1 does not support processing multiple transactions concurrently in the same function instance

For a web service that has a steady state of 100 requests per second, you are charged for 8,640,000 service calls / day. That alone is $105.12 / month just in overhead, not counting the CPU and memory costs of each of those invocations. Additionally, as transaction volume waxes and wanes throughout the day, due to the lack of concurrency, functions will be continually created and destroyed. Each creation incurs a significant cold start cost, both in terms of throughput and latency but also financially.

Cloud Functions are an ideal fit for ad hoc processing that has a relatively low invocation count. For high-volume web services, particularly ASP.NET Core controller-based web APIs, like what we were running, Cloud Run is a much better choice.

By replacing Cloud Functions Gen1 with Cloud Run in our heaviest data pipelines, we reduced our monthly spending by an additional 32%.

Item 3 – Reduce Unnecessary Logging

Logging costs can add up fast at $0.50 / GB, especially when you are charged not just for the logging your application does, but any logging done by the cloud infrastructure itself. As an example, every Cloud Function invocation logs messages like “Starting function” and “Ending function” that cannot be disabled.

Look for redundant logging across different log files. For example, maybe your application logs a message every time you get an incoming request, but the cloud infrastructure may already log this information for you automatically. API audit logging can be a particularly costly option.

After analyzing our logs, disabling unnecessary logging, and removing redundant logging, we lowered our monthly spending by another 13%.

Item 4 – Consider Memorystore over Firestore

On this item, we were using GCP’s NoSQL document database Firestore to store pessimistic locks across a distributed system. Other workloads might have similar instances where Firestore is being used as a store for high-volume transient runtime data. A better service for such purposes is Memorystore (either Redis or Memcached). The main difference in pricing is that Firestore charges for each read and write operation while Memorystore charges based on the size of the Memorystore cluster. In our case, our storage needs were very small, but our number of read and write operations against that store was millions per day. Memorystore was not only cheaper but had lower latency times as well.

This switch gained us 4% in total monthly costs.

Item 5 – Downsize VMs Where Possible

Both Ternary Insights and GCP Insights will notify you of cost saving opportunities. The vCPU : GB ratio for the GCP’s “standard” VMs is 1:4, which is a good default but may not be proper for your workloads. Consider GCP’s “custom” VM sizing feature that allows for much more flexibility with these ratios such that you can create a VM that is tailor-made for the workload you run on it.

If you are running a standard GKE cluster, look for excess capacity and see if it’s possible to downsize the nodes in your node pool. If you have small bastion VMs that are used to get access to resources in your VPC, consider using a shared core machine type for these. Assign a resource policy to VMs in your non-prod environments to automatically turn them off during non-working hours and start them back up before the next workday begins. Don’t create boot disks larger than the boot image unless needed.

One thing to keep in mind is that while turning off a VM will stop its cost accrual, it will not stop cost accrual for any persistent disks attached to that VM. We had VMs that were shut down for over a year, but the attached PDs were still accruing costs that no one noticed.

Through various efforts in this space, we lowered total monthly costs by another 3%.

Item 6 – Use Spot Instances Where Possible

Spot instances can give big discounts, but they don’t make sense for many workloads. Target your non-prod environments but be careful in prod to ensure your workloads are resilient to failure.

An area where spot instances can be particularly useful is a GKE node pool. When auto scaling your cluster to add more nodes, GKE will automatically prefer the cheapest node within your set of node pools. Therefore, you can configure a default node pool with non-spot instances and another “spot” node pool with spot instances of the same size. This setup allows for use of non-spot instances if the spot instances get reclaimed, but otherwise your workloads will run on the spot instances for much cheaper rates.

One word of caution on this if you’re setting up your spot node pool in an existing cluster. Existing workloads will continue to run on their current node pool, and your spot node pool will remain at 0 nodes. To get the cost savings you need to force the migration of your pods over to the spot node pool so that your standard node pool can scale down to 0, but this pod migration will cause downtime as GKE is not smart enough to ensure the new pods on the other node pool have finished starting before stopping the current pods.

Switching our non-prod GKE clusters to use spot instances saved us 1%.

Total Savings

Through the various efforts described above, even with the addition of new workloads, by July 2023 our monthly costs were 72% lower than they were at that height in December 2022. This optimization allows us to put our resources into building new experiences for our clients instead of just operating our cloud infrastructure.

Final Recommendations

Some final words and guidance for managing your costs on an ongoing basis:

Assign a single point person for your organization to own your FinOps process and be the liaison between your organization and others. However, this doesn’t mean this single person needs to handle all cost-saving actions across your teams; that work should be distributed.

Make each department leader responsible for the costs incurred by their group.
Regularly review your costs, at least monthly – more often when actively combatting costs.
Don’t ignore the small things – they add up.
Create and manage to department-specific budgets, even if those are ultimately rolled up and managed at a higher corporate level.
Generally, ignore RI / Committed Use Discounts (CUDs) when looking at cost savings within your specific area. Any usage savings you get will be available for use by other groups in your account, thereby lowering their costs.

How WellSky’s Enterprise Platform Group Wrangled its Cloud Costs