3 Themes from RubyConf 2017: Failure
In November I attended my first RubyConf and spent a whirlwind three days trying to learn all the things. Now that the dust has settled, here is part two of three themes that have crystalized, which I know I will find myself mulling over for weeks to come.
TL;DR — Speakers and Talks
- Jess Rudder: The Good Bad Bug
- Heidi Waterhouse: Y2K and Other Disappointing Disasters: How To Create Fizzle
- Chad Fowler: Keynote: Growing Old
More specifically, what can be done about failure, given that it is as inevitable in software as strong opinions about text editors. More than one talk advised this: stare into the darkness of failure and own it. Ok, not in those words exactly.
Jess Rudder’s talk emphasized the need to treat failure as data. She stressed that in fact there is much more data in failure than there is in success, and as such failure shouldn’t be seen as a detriment, but as an opportunity. (She also gets bonus points for having a presentation made up entirely of hand-illustrated slides.)
Layer onto this Chad Fowler and Heidi Waterhouse’s parallel points that the foolish avoid failure while the wise plan for and embrace it. They both alluded to Netflix’s Chaos Monkey, a bot that randomly toys with systems and tries to break them. Waterhouse made a powerful point about this kind of purposeful chaos: If someone notices your system breaking, it is not resilient enough. Strong systems can respond to failures without the need for human intervention. And a bot that systematizes failure allows a human to get involved and to gather data when intervention is necessary.
Fowler said the same thing in slightly different words. He said that he used to be proud of long streaks between system downtime. Now he understands that the lack of failure is an indicator of stale and brittle systems, and advocates killing off parts of the system all the time. To prove that it can be done.
All these talks illustrated the paradox that the only way to ensure your system is failure-proof is to ensure that its components are failing constantly. In this way we can constantly gather data about the system. That being said, setting up controlled failure is not a trivial task. It’s on my to-do list to find more examples like Chaos Monkey and learn about their implementation. I know the systems I work on could certainly benefit from failing more often.
This is one part of a three-part series. Read about other themes: What If Ruby? and final theme to come!