Design a failure prediction strategy

Alberto De Natale
9 min readSep 13, 2021

In this post, I will share my study notes that I took while preparing for the exam Azure AZ400 “Design a failure prediction strategy”.

Analyze behavior of system with regards to load and failure conditions

The first thing we will need is an SLI, Service Level Indicator:

Typically success vs. failure measures (does the service successfully complete an operation some percentage of the time), measures of timing (did we return an answer within a certain threshold of time), measures of throughput (did we process a certain amount of data) or combinations of all of these. For a simple example, we might say an SLI for our service is how often it returned success, indicated via an HTTP 200 code (vs. a 500 or some other code).

The second is an SLO, a service level objective:

We’re going to want to decide what level of reliability we expect or desire from it. For example, do we expect over a period of a day that we’ll see a failure rate of 20% from the service?

(…)

That expectation, created in collaboration with the service’s developer, is a Service Level Objective (SLO).

--

--

Alberto De Natale
Alberto De Natale

Written by Alberto De Natale

Alberto De Natale is a passionate tech-enthusiast software developer.

No responses yet