Improve application reliability with effective SLOs

Arun Chandapillai
3 min readJun 18, 2024

--

At AWS, we consider reliability as a capability of services to withstand major disruptions within acceptable degradation parameters and to recover within an acceptable timeframe. Service reliability goes beyond traditional disciplines, such as availability and performance, to achieve its goal. Components of a system or application will eventually fail over time. Like our CTO Werner Vogels says, “everything fails, all the time”. The question is how your system or application can sustain failures without impacting the end users, and how resilient your system is in relation to failures. Our customers are constantly asking us to help to reduce the blast radius from incidents and meet the reliability, performance, and scalability expectations their businesses need.

In this post, you will learn about reliability best practices that will set your teams up for success by measuring performance objectively and reporting reliability with accuracy for a quick turn-around when incidents happen. You will also learn how to create, monitor, and alert on Service Level Objectives (SLOs) natively in AWS using any Amazon CloudWatch Metric with Amazon CloudWatch Application Signals.

Service Level Management (SLM)

Service Level Management (SLM) provides a framework or process to define, negotiate, and manage delivered IT services and service levels for the customers. This framework includes several critical elements such as service availability, quality, data security, and throughput. This aims to protect the objectives of both the customer and you, the service provider. Now, let’s get familiarized with terminology that represents the assurance you make to your customers and the trackable measurements that tell you how healthy your services are.

  • SLI (Service Level Indicator) is a carefully defined quantitative measure of some aspect of the level of service provided.
  • SLO (Service Level Objective) is a target value or range of values for a service level measured by an SLI over a period of time.
  • SLA (Service Level Agreement) is an agreement with your customer that outlines the level of service you promise to deliver. An SLA also details the course of action when requirements are not met, such as additional support or pricing discounts.
  • Error Budget is the rate at which SLOs can be missed. It is the difference between 100% reliability and the SLO target value. Simply put, an error budget is an SLO for meeting other SLOs!

The following diagram illustrates how SLAs, SLOs, and SLIs interact. The customer (service consumer) is external to the team that owns the service. Within the service team you have sales functions such as business owners and customer success engineers, you have the product owner (owning the roadmap), and the engineering team, creating and operating the service. The engineering team owns the SLIs measuring the service and driving the SLOs. Product and engineering typically jointly own the SLOs, which inform the SLAs. To close the loop: as a customer, you have visibility into the SLAs and you can see how the service is performing, however, SLOs and SLIs are usually not shared outside of the service team boundary.

Read the full blog @ https://aws.amazon.com/blogs/mt/improve-application-reliability-with-effective-slos/

--

--

Arun Chandapillai
Arun Chandapillai

Written by Arun Chandapillai

Senior Engineering Architect who is a diversity and inclusion champion. He is an automotive enthusiast, an avid speaker, and a philanthropist.

No responses yet