We’ve just fixed a bug in Nagios which an Opsview user had raised to us. A change made to Nagios in version 3.2.2 caused an issue where service alerts were being raised in the nagios.log file for every result that came back from a host that was down. This had the impact of adding lots of extra alerts that were overwhelming Opsview’s event views.

To reproduce the problem in Nagios 3.2.3:

  1. Create a host with 2 service checks
  2. Let this run normally
  3. Shutdown the host
  4. The first service check will notice the state change and set the host to be checked. It will go into a SOFT state and the service will go into a check attempt of 2 and continue into a hard state correctly

  5. The 2nd service check will see that the host is DOWN and force a hard state failure with check attempt 1 of a maximum 4. However, this hard state change did not set the last_hard_state flag correctly, which meant every subsequent check was considered to be a new hard state failure and hence a SERVICE ALERT was raised every time in nagios.log

This took a long time to track down, but we’ve found the problem and fixed it. Our fix is pushed to Nagios already.

While this bug is annoying, we’re upset that this had an impact on a customer system. We make it our principle to keep as up to date with Nagios as possible because Opsview is a shallow fork of Nagios - we make only the changes that are necessary to support our customers and we push our changes back upstream where we can.

We’ve developed a lot of trust with our users - we make the upgrade process for Opsview as easy as possible because we want all our users to get to the latest version (in fact, we’ve just had one user update their Opsview from 4 years ago, right up to the latest version, going through over a hundred database changes!).

One thing we do to make sure our systems work as expected, is to continuously test our latest versions of Opsview. We use Hudson to test Opsview on every change - currently this runs 5269 individual tests, taking 1 hour 46 minutes.

We want to bring this level of quality assurance to Nagios - included in our fix is a test case that checks exactly this issue. Running tests on Nagios will now show that this problem is fixed forever and our nightly builds of Opsview includes these too.

So now everyone can sleep easier knowing that this problem is never going to happen again.