Design a failure prediction strategy
In this post, I will share my study notes that I took while preparing for the exam Azure AZ400 “Design a failure prediction strategy”.
Analyze behavior of system with regards to load and failure conditions
The first thing we will need is an SLI, Service Level Indicator:
Typically success vs. failure measures (does the service successfully complete an operation some percentage of the time), measures of timing (did we return an answer within a certain threshold of time), measures of throughput (did we process a certain amount of data) or combinations of all of these. For a simple example, we might say an SLI for our service is how often it returned success, indicated via an HTTP 200 code (vs. a 500 or some other code).
The second is an SLO, a service level objective:
We’re going to want to decide what level of reliability we expect or desire from it. For example, do we expect over a period of a day that we’ll see a failure rate of 20% from the service?
(…)
That expectation, created in collaboration with the service’s developer, is a Service Level Objective (SLO).