In a standard Nagios plus database implementation, you use NDOutils to store information in a database. While we think NDOutils is fantastic, there are some major limitations with it as you monitor more hosts. With Opsview, we want to scale. We've already done lots of work with NDOutils, including adding view-like helper tables, updating the database asynchronously, improved indices and speeding up the time to load the configuration at a Nagios reload. Now we want to share an amazing improvement we've discovered.
We know that the nagios_servicechecks table is the most heavily used table. This records every result that flows into Nagios, whether it is actively or passively checked. The statement to add a row in that table is an INSERT ... ON DUPLICATE KEY UPDATE ....
However, this has problems. In our experience with the Opsview Data Warehouse - where we took best practise information from datawarehouse experts - fact tables should not have unique keys unless they really are unique. There needs to be suitable indices to help the queries, but uniqueness means that some records may be updated when you expect to have a new record instead.
This gave us pause to wonder why the statement was an UPDATE. Further investigation showed that Nagios was sending extra messages to the database for processing.
The flow was:
- a service check is initiated with an NEBTYPE_SERVICECHECK_INITIATE event being fired. NDOutils adds a new row into the table with start times but no result
- a NEBTYPE_SERVICECHECK_ASYNC_PRECHECK was being fired - this is to allow other broker modules to intercept a service check execution. This was being sent to NDOutils, but not processed
- finally, a NEBTYPE_SERVICECHECK_PROCESSED event was fired - this updates the earlier row with the results of the check
In order to work out the "earlier row", NDOutils used the unique index which consists of the instance_id, object id, start time and start time usec (micro seconds). However, with passive check results, the start time usec is always set to 0. This means it is possible to lose results if you have checks which have the same start time for the same object.
We took the view that (1) and (2) were not necessary. That meant (3) was the only event that needed to be processed by NDOutils. So our change was to tell (1) and (2) not to send information to NDOutils, and to update the command for (3) to do a straight INSERT, rather than an INSERT ... ON DUPLICATE KEY UPDATE ..... This saved an index lookup.
We also changed the database index to reflect this whilst making it much smaller. The index used to consist of (start_time, instance_id, service_object_id, start_time_usec) - this meant for each row, the index was adding another 36 bytes. However, we changed it to (start_time) - only 8 bytes. Opsview only has 1 instance_id, so it is not necessary to include it in the index.
If you are keeping score, here are the improvements:
- Reduced number of events sent to NDOutils by 66%
- Reduced number of SQL statements by 50%
- Changed 1 SQL statement, making it a smaller statement and saving an index lookup
- Reduced the size of one index by 77%
To test this was easy. As Opsview uses an asynchronous method of updating the database, you can change a debug file and Opsview will automatically start copying the data that would be pushed to the database. This gave us an NDO data packet. We then updated this data packet to have 10000 events of the same object. And then we pushed this to our database instance.
Results? 10000 records was taking 23 seconds to update the database. With our changes, this reduced down to 6 seconds! We're thrilled that this has speeded up one of the most common database operations.
As Opsview is open source, we publish our source code. And the requirements of the GPL mean we have to publish the changes we make to NDOutils. Our complete patch list (for all our 3rd party software) is listed here.
The specific patch for this change is here.
This improvement is shipped with Opsview Enterprise 3.8.0. Keep your eyes out for more performance tuning enhancements and new features that we will be adding to Opsview in the next few months!