One of Opsview’s great features is distributed monitoring, which we’ve had for over 5 years now. From the web user interface, you can assign hosts to a slave system and Opsview will take care of all the configuration work for you: from the slave configuration files, to the slave results sent to the master, to the master configuration with freshness checking.
We do all the system integration work, so you don’t have to.
However, there are some limitations in our chosen technologies. We use NSCA, which is the most common method in the Nagios world, and while we’ve made improvements to it that have gone back upstream, there are some baked-in limitations:
- Only the first 511 bytes of plugin out was returned to the master, limiting the usefulness of the information you could display
- Only the 1st line of data was returned, meaning you had to cramp output together
- NSCA communication used fixed size packets which were inefficient
- While results were sent, Nagios would wait for completion, introducing a bottleneck
- If there was a communication problem with the master, results were dropped
Sometimes to move forward, you have to leave the past behind.
So we did that - we ripped out NSCA from Opsview’s slave communications and we’ve addressed every one of these limitations - and added a few nice extras too!
- It is based on perl, which is our language of choice
- It has taken the test suite we developed for NSCA and enhanced it, demonstrating a mature approach to code development
- The client and server code is a thin shim over the libraries, which means you can easily create your own clients
- We have a good relationship with CAPSiDE and they have given us access to their code repository
We’ve spent some time understanding the core NRD code, enhancing it, fixing some issues and adding in some great new features. CAPSiDE have also released it on CPAN for wider consumption.
So Opsview’s new process for sending results from a slave is:
A couple of other amazing features we’ve squeezed in:
- A known Nagios limitation is the named pipe to submit results. We’ve overcome this by writing directly to the checkresults spool directory - this reduces a Nagios processing cycle on the Opsview master
- We’ve implemented transactions in the results, so if the client has a failure communicating to the server, the client will back off and retry again in 5 seconds. This guarantees you do not have duplicated results
- The nrd daemon on the master will dynamically add more servers as workload increases, thanks to the features of Net::Server
- As all communication between master and slaves is over a tunnelled SSH session, we’ve updated our Opsview check scripts to restart these tunnels if the slave is exhibiting communication errors
With all this extra capabilities, you would think there is a cost in performance. But in fact, our testing shows that performance has got better!
(Based on sending 2016 results in a single transaction over an SSH tunnel from a slave to a master. Times measured on the client.)
This shows that we are getting an average 62% improvement in all aspects of slave communication back to the master!
We are thrilled we’ve added this major new functionality into Opsview and have taken distributed monitoring another huge step further over any of our competitors.
But the best thing is: this is available immediately with our Opsview Community 3.11 release. Install the VM, add a slave and you will get this new architecture setup as part of the process. And if you are an existing Opsview user, you get a silky smooth switch-over. We’ve done a lot of testing to ensure that as part of the upgrade, Opsview will automatically switch any slaves to this architecture and start sending results in the new NRD way.