Member-only story

Design a failure prediction strategy

9 min readSep 13, 2021

In this post, I will share my study notes that I took while preparing for the exam Azure AZ400 “Design a failure prediction strategy”.

Analyze behavior of system with regards to load and failure conditions

The first thing we will need is an SLI, Service Level Indicator:

Typically success vs. failure measures (does the service successfully complete an operation some percentage of the time), measures of timing (did we return an answer within a certain threshold of time), measures of throughput (did we process a certain amount of data) or combinations of all of these. For a simple example, we might say an SLI for our service is how often it returned success, indicated via an HTTP 200 code (vs. a 500 or some other code).

Key SRE principles and practices: virtuous cycles - Learn

If it is really true that in some sense "you are what you do", then we've come to the heart of this module. In this…

docs.microsoft.com

The second is an SLO, a service level objective:

We’re going to want to decide what level of reliability we expect or desire from it. For example, do we expect over a period of a day that we’ll see a failure rate of 20% from the service?
(…)
That expectation, created in collaboration with the service’s developer, is a Service Level Objective (SLO).

Design a failure prediction strategy

Key SRE principles and practices: virtuous cycles - Learn

If it is really true that in some sense "you are what you do", then we've come to the heart of this module. In this…

Written by Alberto De Natale

No responses yet