Reliability Engineering - Chaos, Performance & Automation

SG

Scott Griffiths / March 15, 2021

3 min read

RE

Question#

How do we define what constitutes a reliable system?

Can we use Reliability engineering idealisums combined with Chaos, Performance and automation execution strategies to build what we believe defines reliable a system

We take a look at how we can get easy reliability wins by applying different DevOps and SRE methodologies in combination with CE/PE and automation strategies

An Intro to the World of S.R.E#

Reliability Engineering or more commonly knows as Site Reliability Engineering within the likes of Google, Facebook, Netflix and Linkedin (and others)

  • Its about focusing on balancing risk and its effect on teams and business velocity. Providing the right automation strategies to supply the business with the right observability confidence and insights to make critical business decisions
  • Within SRE this can be achieved (for the most part) through the use of observability metrics (SLI’s), internal and external promises (SLO/SLA’s) and team & business based error budgets

The Engineering Efficiency / Effectiveness Verticals (Alignment)#

EE Efficiency is doing things right, effectiveness is doing the right things and adaptability isresponding quickly to a changes in business circumstances

The idea is to allow engineers to focus on what’s important by automating or eliminating those items that are slowing down the ability to development product

This incorporates people, process and tech with this goal of reducing barriers, provide better value, increase velocity while still promoting a culture of empathy, accountability and transparency

Using Devops Practises and Methodologies#

That are measured, enforced and verified by Reliability, Chaos and Performance Engineering principles

DevOps emerged as a culture and a set of practices that aims to reduce the gaps between development and software operation

RE defines the overall behavior of the system, with how this is implementation being left up to the engineer

The Devops / Reliability Relationship#

EE

Performance Engineering#

By adopting a cloud first performance automation approach we can look at benefitting from a reduced feedback cycle (velocity increase) and bottlenecks / bugs being caught early (reliabilty increase).

In order to get the benefits we need a multi prong approach that uses components and methodologies from both Performance testing and Performance Engineering

Traditional Performance Testing, Done in the 'Test' Phase#

EE

Performance Engineering, Utilising a Left and 'Measure Everything' Approach#

EE

Releases#

Release frequency with small changes by supporting these releases with the right amount of automated checks and other automation configurations to provide some level of understanding of environment and application behaviour

Forget self managed teams, Aim for a self managed business with an approach to development the enables devs to deliver more efficiently, effectiveless and in an environment that fosters owernership and accountability

Evanglisum and Advocacy#

Observability, Reporting and Metrics#

References#

Linkedin Sre School

Google's Sre Books

Netflix Sre Practise

Chaos Principles

@ Discuss on Twitter