Small mistakes in the cloud can have big consequences.
John Purcell, chief product officer at custom developer DoiT International Ltd., tells of one customer who made a keystroke error that caused the company to spin up an Amazon Web Services Inc. instance much larger than what was needed. A job that was supposed to finish on Friday was never turned off and ran all weekend, resulting in $300,000 in unnecessary charges. “There is a small single-digit percentage of companies that manage cloud costs well,” he said.
Fifteen years after Amazon.com Inc. launched the first modern cloud infrastructure service, customers are still coming to grips with how to plan for and manage in an environment with dozens of variables that don’t exist in the data center — including a nearly limitless capacity to waste money.
Not that this is slowing cloud adoption. The recently released 2022 State of IT Report from Spiceworks Inc. and Ziff Davis Inc. reported that 50% of business workloads are expected to run in the cloud by 2023, up from 40% in 2021. But information technology executives express frustration at the difficulty of getting the visibility they need to plan accurately for cloud infrastructure costs.
A recent survey of 350 IT and cloud decision-makers by cloud observability platform maker Virtana Inc. found that 82% said they had incurred unnecessary cloud costs, 56% lack tools to manage their spending programmatically and 86% can’t easily get a global view of all their costs when they need it. Gartner Inc. predicts that 60% of infrastructure and operations leaders will encounter public cloud cost overruns. And Flexera Software LLC’s 2020 State of the Cloud Report estimated that 30% of enterprise spending on cloud infrastructure is wasted.
“Nearly 50% of cloud infrastructure spend is unaccounted for,” estimated Asim Razzaq, chief executive of Yotascale Inc., which makes dynamic cost management software targeted at engineers.
The issue leapt into public view earlier this year in a post titled “The Cost of Cloud, a Trillion Dollar Paradox.” Martin Casado and Sarah Wang at the venture capital firm Andreessen Horowitz concluded that for software companies operating at large scale, the cost of cloud could double a firm’s infrastructure bill — resulting in a collective loss of $100 billion in market value based on the impact of cloud costs on margins.
Although not everyone agreed with the analysis, it’s clear that cloud costs can rise quickly and unexpectedly, and it’s something even staunch cloud advocates say needs to be addressed head-on. The topic is likely to be discussed this coming week in the exhibit halls at AWS’ re:Invent conference in Las Vegas, since AWS remains far and away the largest cloud services provider.
No one is laying the blame for the situation squarely at the door of infrastructure-as-a-service companies. “Every single cloud service provider wants ‘good revenue,’” said Eugene Khvostov, vice president of product and engineering at Apptio Inc., a maker of IT cost management products. “They don’t want to make money on resources that aren’t used.”
But the sheer complexity of options users have for deploying workloads, compounded by multiple discount plans, weak governance policies and epic bills can frustrate anyone trying to get a coordinated picture of how much they’re spending. “You want granularity, but the costs are coming in every day, every hour, very second,” Khvostov said.
“Cloud bills can be 50 pages long,” agreed Randy Randhawa, Virtana’s senior vice president of research and development and engineering. “Figuring out where the optimize is difficult.”
Not the data center
Much of the reason for cloud cost overruns comes down to organizations failing to understand and accommodate the fundamental differences between the data center capital-expense cost model and the operating-expense nature of the cloud. Simply stated, the costs of running a data center are front-end loaded into equipment procurement, but the marginal cost of operating that equipment once it’s up and running is relatively trivial.
In the cloud, there are no capital expenses. Rather, costs accrue over time based on the size, duration and other characteristics of the workload. That means budgeting for and managing cloud resources is a constant ongoing process that requires unique tools, oversight and governance.
To get a sense of where economies can be achieved, SiliconANGLE contacted numerous experts who specialize in cloud economics. Their approaches to helping clients rein in costs range from automated tools that look for cost-saving opportunities to consulting services centered on budgeting and organizational discipline. They identified roughly three major areas where money is most often wasted – provisioning, storage and foregone discounts — as well as an assortment of opportunities for what Yotascale’s Razzaq called “micro-wastage,” or small money drips that add up over time.
Provisioning infrastructure in the cloud is pretty much the same as it is in the data center. The application owner or developer specifies what hardware and software resources are needed and a dedicated virtual server – or “instance” – is allocated that matches those requirements.
If needs change, though, the time and cost dynamics of the data center and the cloud diverge. Getting access to additional on-premises memory or storage can take hours or days in the case of a virtual machine and weeks if new hardware must be procured.
In contrast, cloud providers allow additional resources and machines to be quickly allocated either on a temporary or a permanent basis. “There is no longer a need to provision a workload with all the capacity it will need over its lifespan,” said Karl Adriaenssens, chief technology officer at GCSIT Inc., an infrastructure engineering firm.
Old habits die hard, though. To accommodate the relative inflexibility of data center infrastructure, developers and application owners tend to overestimate the resources that will be needed for a given workload. “Developers are concerned about making sure their apps perform well and they tend to overprovision to be on the safe side,” said Harish Doddala, senior director of product management at cloud cost management firm Harness Inc.
Incentives reward this approach. “You don’t get into trouble if you overspend a little but you do get into trouble if the application doesn’t perform,” said Razzaq.
All cloud platform providers offer autoscaling capabilities that make allocating additional capacity automatic, with the only charges being for the additional capacity used. However, users often don’t think to deploy them.
As a result, “40% of cloud-based instances are at least one size too big,” estimates DoiT’s Purcell. “You’d be surprised how often workloads run at 5% to 10% utilization.”
As attractive as autoscaling sounds, experts advise using it with caution. On-demand instances, which are the most commonly used but also the most expensive type of cloud VM, can run up large bills if capacity expands too much. Autoscaling can become addictive, prompting users to create multiple database copies and log files that eat up storage and money.
“Autoscaling will allow you to meet just about any workload requirement, but running almost unlimited scaling with on-demand instances can get out of hand,” said Adriaenssens.
The tactic also doesn’t always work as easily in reverse. If database sizes grow during a period of high activity, they may exceed a threshold that makes it difficult to scale the instance back down again. “If you scale down [Amazon’s] RDS, it’s painful. It may change your license type,” said Travis Rehl, vice president of product at CloudCheckr Inc., a maker of cloud visibility and management software that’s being acquired by NetApp Inc. “It’s possible but the effort can be very high.”
The second type of overprovisioning occurs when cloud instances are left running after they’re no longer needed. In an on-premises environment, this isn’t a big problem, but the clock is always running in the cloud.
Usage policies that give users too much latitude to control their own instances are a common culprit. Someone may spin up an instance for a short-term project and then forget to shut it down. It may be months before anyone notices – if it’s noticed at all.
“Companies may have a policy of creating new accounts for new workloads, and after hundreds have been created, it becomes a bear to manage,” said Razzaq. IT administrators may fear shutting down instances because they don’t know what’s running in them and the person who could tell them has left the company, he said.
Developers, who are more motivated by creating software than managing costs, are often culprits, particularly when working on tight deadlines. “Typically, the budget is managed by finance, but the ones who actually cause the overruns are the developers themselves,” said Harness’ Doddala.
When Cognizant Technology Solutions Corp. was called in to help one financial services customer rein in its costs in the Microsoft Corp. Azure cloud, it found numerous unnecessary copies of databases, some of which exceeded a terabyte in size. Virtual machines were running round-the-clock whether needed or not.
“The company was prioritizing deadlines over efficiency,” said Ryan Lockard, Cognizant’s global chief technology officer. Cognizant cut its cloud costs by half mainly by imposing operational discipline.
A wide variety of automated tools from the cloud providers and their marketplace partners can help tame runaway instances, but customers often don’t have time to learn how to use them. Simple tactics can yield big savings, though, such as tagging instances so that administrators can view and manage them as a group. “You can specify policies for apps by tags and within those policy constructs define what you want to track and take actions,” said Virtana’s Randhawa.
All cloud providers offer automated tools to manage instances in bulk. For example, Amazon’s Systems Manager Automation can start and shut down instances on a pre-defined schedule and the company’s CloudWatch observability platform has drawn high praise for its ability to spot and stop overages. Microsoft’s Azure Cost Management and Billing does the same on the Azure platform and Google LLC’s Active Assist uses machine learning to automate a wide variety of administrative functions, including sizing instances appropriately and identifying discount opportunities.
Numerous well-funded startups are also active in this market, including NetApp Inc.’s Spot for optimizing the use of Spot Instances, ParkMyCloud for resource optimization and CloudZero for cost visibility. IBM Corp., VMware Inc., Nutanix Inc. and HashiCorp all have footholds in the market. Zesty Tech Ltd. just this week announced a $35 million Series A funding round for an approach that uses artificial intelligence to automatically adjust instances that allocate storage.
It’s cheap to move data into the cloud but expensive to take it out. That means data volumes and costs tend to grow over time, with charges accruing month by month.
This so-called “data gravity” is core to keeping customers in the fold, said Corey Quinn, chief cloud economist at The Duckbill Group. The more data the customer commits to a provider, the more applications tend to follow and the greater the risk of abandoned instances because “no one wants to delete data,” he said. As a result, “cloud providers will continue to grow even without new customers.”
The costs are attractive — AWS charges a little over a penny per gigabyte for infrequently accessed data – but that creates a temptation to shortcut discipline.
“Studies show that up to 46% of data is just trash,” said Gary Lyng, chief marketing officer at Aparavi Software Corp., a distributed data management software provider. “Get rid of that first before you back it up or move it to the cloud.”
Time-based pricing can also be insidious in the long term. The two cents per month per gigabyte that AWS charges for S3 storage becomes a dollar over a four-year period, making it far more expensive than local disk storage.
And getting it out adds to that cost. A customer that downloads 10 terabytes of data per month can expect to pay about $90 for the privilege. Extracting 150 terabytes costs $7,500. “If you want to leave, it can be massively expensive,” said David Friend, CEO of cloud storage service provider Wasabi Technologies Inc.
Cloud infrastructure customers may know how much storage they have but not how often they use it, Friend said. That can lead to overpaying for high-availability access to data that is rarely touched. “And the more data they have, the more expensive it is for you to leave,” he said.
Data and compute instances are functionally separate in cloud infrastructure, meaning that shutting down a virtual machine doesn’t affect the data. “You pay for everything, whether you use it or not,” Randhawa said.
Apptio has found “tens of thousands” of storage instances in the Azure cloud “that are orphaned, not because operations have bad intentions but because they forget to hit the switch to terminate them or move them to cold storage,” Khvostov said.
Cloud providers also bundle high-performance package offerings based on input/output operations per second to defined database sizes, mean that buyers seeking the fastest speed can inadvertently pay for too much storage. “Overprovisioning can get very expensive on the storage side very fast,” said GCSIT’s Adriaenssens.
As in the case of infrastructure, automated tools can move little-used storage to archive automatically, but customers need to know of their existence and take the time to configure them. In the meantime, cloud providers have little incentive to make it easy for customers to take data out, since it makes switching to other platforms that much more difficult.
Cloud infrastructure providers can deliver bills at nearly any level of granularity customer desires, but the tradeoff for specificity is nearly impossible complexity. “Cloud providers make all that data available to you but you have to be looking for it,” said DoiT’s Purcell.
Numerous discount plans are available, but it’s generally up to the customer to ask for them.
“The vendors are happy to teach people how to use the cloud as opposed to understanding the different modalities of working in the cloud,” said Aran Khanna, CEO of Archera.ai Inc., a maker of cloud cost management software. Cloud providers say they’re more than happy to help customers look for cost savings and provider calculators that weigh various options.
Amazon Spot and Reserved Instances (Microsoft calls them Spot VM and Reserved VM instances) offer customers deep discounts for committing to using capacity over an extended period of time in the case of Reserved Instances, or for buying surplus time temporarily as available in the case of Spot Instances. There are also discount plans for customers that are willing to exercise some flexibility in shifting workloads across regions.
However, DoiT’s Purcell estimates that fewer than 25% of customers take advantage of cost-savings plans such as reserved instances and spot instances. “It’s like going to the grocery store; I have a pocket full of coupons, but I have to make sure they’re the right ones,” he said.
They also tend to be reluctant to accept terms that limit their flexibility. “Where customers leave money on the table is where they buy the least risky thing and don’t negotiate,” said Archera’s Khanna. “It’s easier to buy the most expensive option.”
Fear of overcommitting can deter users from seeking the most aggressive long-term discounts, but the savings from three-year reserved instance plans, for example, can more than compensate for underutilization, experts say.
A prepaid three-year reserved instance on AWS provides for a discount of more than 60%, while the one-year version saves a little over 40%. A customer that is confident in needing an instance for two years would be better off buying the three-year option and letting one year go unused than opting for the smaller discount. AWS provides a marketplace for buying and selling instances and Microsoft will buy back unused time up to a limit.
“Having a negotiated global rate discount plan yields the first base of a strong discounted pricing portfolio,” said Cognizant’s Lockard. “Combining that with pay-as-you-go-style Reserved Instances allows for credits to be applied for planned future consumption.”
GCSIT’s Adriaenssens advises users to budget for a balance of reserved, on-demand and even spot instances so that the most economical options are available for a given workload. He also recommends creating a Cloud Center of Excellence team that’s responsible for measuring, planning and tuning deployment parameters so that workloads align with a cloud provider’s savings plans.
Taking control of costs
If you’re willing to pay someone else to get your cloud costs in order, there are plenty of businesses ready to take your money. Many say they typically save their customers 30% or more, making their fees easy to justify.
However, many of the savings can be achieved by simply applying more organizational discipline to the task. That starts with making informed decisions about which applications to put in the cloud in the first place. The perception that cloud platforms are cheaper is “baloney,” said CloudCheckr’s Rehl. “Cloud is more expensive but you are intentionally buying speed.”
That means leaving legacy applications where they are is often also a better strategy than moving them to the cloud, experts advise. Workloads built for a data center environment — one in which resources, once deployed, are always available — waste money by design in an opex spending model.
Legacy applications running in lifted-and-shifted virtual machines are “the most expensive way to operate in the cloud,” said Cognizant’s Lockard. “You are paying for the services and storage 24 hours a day, seven days a week, whether you use them or not.”
Legacy applications can also be opaque, with little documentation and no one around who built them in the first place. As a result, said Rehl, “We have seen customers who lift and shift bring over all sorts of thing they don’t need. They may import data sets they think were necessary even if they haven’t been touched in a very long time.”
Everyone agrees that the best way to optimize costs is to use applications built for the cloud. These take advantage of dynamic scaling, ephemeral servers, storage tiering and other cloud-native features by design. “Cloud management needs to be automated using the many tools the cloud providers have to offer,” said Chetan Mathur, CEO of cloud migration specialist Next Pathway Inc.
FinOps is a relatively new discipline that addresses the new reality that “a lot of things that finance and procurement would have taken care of is now the domain of engineers,” said Archera’s Khanna. FinOps brings together engineers and financial professionals to understand each other better and to set guidelines that enable more cost-efficient decision-making. A recent survey by the FinOps Foundation found that the discipline is now becoming mainstream across large enterprises in particular and that FinOps team sizes grew 75% in the last 12 months.
Major platform shifts always bring disruption and the move to the cloud is the biggest one most IT managers will see in their careers. Despite the adjustments they’re making to a new operating model, most are willing to accept the tradeoffs of business agility and speed to market for what FinOps Foundation Executive Director J.R. Storment recently told ZDNet: “The dirty little secret of cloud spend is that the bill never really goes down.”