Niagara is a massively parallel build system, jokingly referred to as denial-as-a-service.
It was the first system I worked on that needed a big red button.
What is a Big Red Button?
Google recently posted Lessons Learned from Twenty Years of Site Reliability Engineering.
#4 is a big red button.
A “Big Red Button” is a unique but highly practical safety feature: it should kick off a simple, easy-to-trigger action that reverts whatever triggered the undesirable state to (ideally) shut down whatever’s happening.
The Stage
At the time git repositories were served by a single-node1.
Traffic included users, Jenkins builds, and now a new massively parallel build system.
As Niagara’s workload ramped up, the source control system could be overwhelmed. This could mean anything from long response times to not responding at all.
We take this seriously as it blocks users from committing and builds from running. Each issue caused changes to the system.
- Observability improved
- Layers of throttling emerged
- Large runs could be cancelled in batches
Solid improvements, but they still required someone who knew the system well.
Turning the system off needed to be easy.
Simple Shutdowns
By request, I created a button that could be used by anyone at the first sign of trouble.
Our big red button came in two flavors.
- Hold launching containers - anything running could complete
- Cancel everything running
The simplest way to hand over control was to wire it up to a Jenkins job. The job would notify whenever it ran, so we could all be on the same page.
Internally we referred to the big red button as a panic button or Andon cord.
- This setup has evolved to multi-node.↩