Book Summary: Increment Reliability
Issue 16 - February 2021
To build reliable system, it starts from the correct team culture. Team culture is collective behavior of team members, it reflects team values. To ensure reliable culture, team must have three things:
Collective, centralize knowledge base
Mature tools, technologies and processes
Psychology safety environment.
Then, the book enters detail into pseudo-methods which are methods with improper test cases as they don't cover much variety of test cases. Pseudo-methods bring reliability risk to teams. One way to check for pseudo-methods is via tool called Descartes. It's been tested on public repo, Apache common collections.
Reliability is shared responsibilities among engineers, not only managers, so then trust is required to achieve reliable system at scale. Peer-review, trust in the tools reliability are among things to be build. Once that's achieved, team will release more and system fails less, there are 4 metrics each team must calculate on regular basis:
Lead time
Deployment frequency
Change failure rate
Recovery time
What separate good and great teams are great teams have great numbers on the above metrics. It's impossible to eliminate risks in large system, what team should do is to make mechanics on how to handle failure. Difference between robust and resilience is that while robust is measured against known cases, while resilience measured against unexpected. One way to improve system resilience is via chaos engineering -- bringing planned, calculated amount of failure to system to check for bottlenecks.
Last updated
Was this helpful?