In an exclusive interview with CXOToday, Nandini Ramani – VP, Monitoring and Observability, AWS shares insights on Observability and the role it plays in major digital transformation across industries.
- What is Observability and how does it help mitigate operational risks?
More and more customers are moving to the cloud to drive greater agility, increase flexibility, and take advantage of new technologies like serverless and Internet of Things (IoT) devices. While the cloud offers customers significant benefits, the sheer size and complexity of modern infrastructure and application architectures can make it challenging for customers to know exactly what’s going on inside their applications. Observability describes how well a customer can understand what happens in a system, by collecting and analysing metrics, logs, and traces, or using tools like profilers.
At AWS, we believe data should drive decision making. The data observability solutions provide across an application is essential to helping customers make decisions that mitigate operational risks. The process of collecting data, monitoring performance, analysing patterns, and acting on actionable insights allows customers to better understand their applications health and performance, improve developer productivity, and increase operational effectiveness and efficiency. A good observability solution makes it easier for customers to accomplish these tasks, so they can answer a wide range of operational and business issues at a moment’s notice, spot problems before they arise, respond quickly when they do happen, and improve the time it takes to resolve issues.
- How crucial is Observability for any major digital transformation initiative?
Customers regularly tell us they want to use AWS to drive digital transformation across their business. For many businesses, that means moving from the legacy approach of hosting everything on-premises, managing their own servers, and building custom tools to a more modern, cloud-based approach that takes advantage of technologies like containers, serverless, and fully-managed services from AWS.
My biggest piece of advice for any customer that wants to truly transform their business and take full advantage of the cloud is to be proactive and thoughtful about their observability practices from the start of their journey. Unfortunately, some customers only get serious about observability after they have faced a significant operational event that has caused issues for their end users. Observability is not just how customers manage known issues and troubleshoot when things go wrong, but also allows them to proactively detect, investigate, and remediate issues in the future by alerting organizations to operational problems before they become an issue. With good observability practices in place, customers can reduce the impact of these events on their business and further optimize their application’s performance to deliver the best experience to their end users. This also gives organizations more insight into resource utilization and cloud spend, which can help drive further cost savings and optimizations to help them get the most out of the cloud.
- What is Amazon CloudWatch? What are the strengths and benefits?
Amazon CloudWatch is our native monitoring and observability solution that provides data and actionable insights for AWS, hybrid, and on-premises applications and infrastructure. Many customers choose CloudWatch because of its native integration with over 100 AWS services, robust feature set, and proven ability to operate at scale. In fact, Amazon.com uses CloudWatch to support their observability needs, so customers know it can handle some of today’s most demanding applications. We continue to enhance CloudWatch with new features and capabilities to satisfy a diverse range of customers, including recent updates that have allowed us to expand our capabilities when it comes to Application Performance Monitoring (APM) such as CloudWatch Real User Monitoring (RUM) so customers can collect and view client-side data about their web application, and CloudWatch Evidently, which makes it easier for developers to introduce experiments and feature management in their application code.
While Amazon CloudWatch is the tool of choice for many major AWS customers like Tech Mahindra, Arctic Wolf, and Alexa Labs, we also offer a wide range of other tools to support a diverse range of customers. This includes AWS-native solutions like AWS X-Ray for distributed tracing to Amazon DevOps Guru, which uses machine learning models informed by years of Amazon.com and AWS operational excellence to identify anomalous application behaviour and surface critical issues before they cause problems for customers. We also have a broad range of open source solutions, including AWS Distro for OpenTelemetry, Amazon Managed Service for Prometheus, and Amazon Managed Grafana, so customers deeply invested in open source can use the tools they’re familiar with, without having to focus on the undifferentiated heavy lifting of securing and operating them at scale by using managed services.
- Can you tell us about the challenges you have noticed for companies adopting Observability in their technology operations?
One of the most fundamental challenges when implementing good observability practices today is that customers do not take a proactive approach to observability. If a customer treats observability as an afterthought, they may not design their system in such a way that they can get the right information when they need it. This can lead to them constantly adding new tools in hopes of “fixing” different issues, but it often creates more complexity. I often recommend customers be deliberate about the tools they choose to simplify their operations. By using a few strongly interconnected tools, customers can greatly simplify their experience and focus on the things that matter most to their business.
Another issue we often see is that customers do not plan for what happens when their applications scale. As applications grow in size and complexity, the sheer volume of data they need to deal with can quickly get out of hand. Many issues occur when an application pushes the boundaries or limits of its performance, and the system wasn’t designed to adequately handle those issues. By preparing for scale in advance, customers can better understand how systems will perform during critical moments, which can lead to smoother operations and better outcomes.
- How would you differentiate Observability from Monitoring?
The concept of Observability describes how well you can understand what is happening in a system, often (but not only) by instrumenting it to collect metrics, logs, or traces. By gathering this information, observability tools enable customers to efficiently detect, investigate, and remediate problems.
People will often see metrics, logs, and traces described as the “three pillars of observability,” however there are many other tools that can help customers achieve their observability needs, such as profilers or AI-powered operational tools like Amazon CodeGuru. Though the term “monitoring” is sometimes defined as different from (or as a precursor to) observability, monitoring is simply one of several types of tools and activities that make a system observable.
- In an evolving tech landscape, how do you envision the future of Observability and Monitoring?
Technology is constantly changing, which means our observability tools need to adapt and evolve with it. In the future, observability will be something that is intrinsically built into a system. While we’ve already started to see the first wave of these solutions, we expect to see even more machine learning and automation weaved into the fabric of observability tools so systems can respond autonomously to new issues and self-heal as needed.
While this is the north star for much of the work we’re doing around monitoring and observability at AWS and across the industry, we don’t expect to get there overnight. We see this as a four-part journey that starts with transitioning from reactive to proactive observability. As observability gets more embedded into every system, the next phase is focused on transitioning from tools that just provide insights about data to tools that proactively point customers in the direction of potential issues. After that, we see a larger focus on tools that can easily detect and remediate issues automatically using machine learning before they become a problem for customers. The final phase of this journey is systems that are smart enough to detect gradual degradation of system over time and take measures to self-heal with zero intervention.
Ultimately, this is a long journey for customers, but as we move toward building smarter, more capable observability tools, we can greatly optimize the performance of future applications and minimize downtime for end users.