From Glee
Jump to: navigation, search

The golden rule of monitoring : Don't check for failures. This seems obvious once you're used to it : You should always define what you consider to be the working state for a service, and make sure any other state gets treated as a failure, never the other way around.

A typical example would be parsing a log file for lines containing "ERROR:", when all others contain either "NOTICE:" or "WARNING:". Once new unexpected lines containing "CRITICAL:" start appearing :

  • If you check for "ERROR:" only to trigger an alert, nothing will be reported.
  • If you check for "NOTICE:" and "WARNING:" only to not trigger an alert, the problem will be reported.