Various components in the system need to report their status. This
standardizes on a set of conventions for reporting
Status has an integer
be compared; higher values represent higher levels of functionality.
In addition, each
Status has a
which is either GREEN, YELLOW, or RED.
The general notion is that the whole system (or a subsystem) should
only report GREEN status (UP) if everything is working as designed.
When subsystems start to go down, or the current system stops working,
status should drop to either YELLOW (DEGRADED) or RED (DOWN), depending
on whether service was still available in some form. For example, if
a cache subsystem goes down, we might report a YELLOW status because
we can attempt to serve without a cache. However, if a required database
goes down, we probably need to report a RED status, unable to serve
A "rugged" system should be able to accurately (and responsively)
report on its status even if it is unable to perform its main functions.
This will assist operators in diagnosing the problem; a hung process
tells no tales.