Gamification of Kubernetes Chaos Testing

Paul Klinker
6 min readAug 19, 2020
Spaceship shooting aliens.

Introduction

Ever since I read about Netflix’s Chaos Monkey, I’ve been intrigued by the idea of chaos testing Kubernetes-based systems. The best way I’ve seen chaos testing explained is to imagine a monkey running through a data center, pulling out wires and pushing buttons — essentially causing chaos.

Chaos testing seeks to emulate this mayhem in a running system to see if it can recover, and if not, figure out why and then refactor the system to handle the errors. Without chaos testing, it’s difficult to determine if your system can recover from random errors and outages.

I work at ManTech, a company that provides secure mission-critical solutions for the U.S. Government. Up-time, fault tolerance, and resiliency are crucial for many of our customers’ systems. Moving them to modern Cloud and Kubernetes-based architectures has made it much easier to build reliable infrastructures, with built-in capabilities for load balancing and auto-scaling.

Many of these systems were migrated from infrastructures with fixed resources, however, and built before the widespread availability and adoption of Cloud and Kubernetes . The software sitting on top of this reliable infrastructure needs to be designed and built, or refactored, to handle infrastructure and application failure. Verifying this is where…

--

--

Paul Klinker

Paul is a Principal Engineer at ManTech specializing in DevOps and enterprise software development.