Alerts happen. They are the reason why monitoring applications were created: to alert us when servers need attention. The difference between an effective network monitoring system and an annoying one is a fine line between information and noise....
Opsview Development Team
Alerts happen. They are the reason why monitoring applications were created: to alert us when servers need attention. The difference between an effective network monitoring system and an annoying one is a fine line between information and noise. Alerts should be descriptive and prompt an administrative action, not elicit a huff of frustration. Here are a few ways to keep your Opsview installation (and you) effective and relevant in your company.
Use a Smartphone
A smartphone should be a tool on every system administrator’s bat belt. The more mobile you are, the more time you spend away from the Operations Center. Why not take Opsview with you? With Opsview Mobile for Android, you can do just that. There is an ambitious roadmap to the mobile app including support for other devices, but if you have an Android there is no reason to wait getting it installed.
The app handles basic needs very well, including a real-time overview of all hosts and services and alert acknowledgement. If you are away from the office (like at the beach!) and get an alert on your phone, acknowledge it and then make a call to your backup (hopefully you have one!) who can begin corrective action. If you don’t have a backup, you at least have a heads-up to an issue at work and can go back to sipping a drink from a coconut.
Use a Real Email Address
Create an email address that can be dedicated to your mobile. For example, create a Gmail address and configure your smartphone for an audible notification on new messages to that address. Smartphone text messages don’t give you the entire story, only a few characters to let you know a host or service is having a problem. There may be more to it that is detailed in the Additional Information section of the alert. A disk utilization error of 95% may be something that can wait until you get back to your office to debug where as 100% would prompt you to boot your laptop to resolve as soon as possible. The only way to know is to have all the alert information in hand (literally).
Modify Alert Templates
The more information you put in the alert, the better chance you can delegate action. Modify the default alert templates to include more information that can help other people, such as a help desk, route tickets more effectively. Since Opsview has Nagios under the hood, all Nagios macros are available. (A complete macro list can be found on Sourceforge.
An example would be inserting comments using the Opsview UI on a host group or individual host, then changing the template to include the macro output for $HOSTGROUPNOTES$ or $HOSTNOTES$. Comments could include where to route tickets or links to documents to solve common problems that first level support can handle. If the issue to too complex, level one support will know which direction to escalate the ticket. The default template is located in /usr/local/nagios/libexec/notifications/com.opsview.notificationmethods.email.tt.
It’s a good idea to keep a backup of any changes you make since the file will be overwritten with each Opsview upgrade.
Set up Layered Email Profiles with Time Periods
The rub with any server monitoring system is no one wants a critical alert at three in the morning, but proper administration can’t be done without notifications. Administrators should embrace alerts, specifically warning alerts since they allow for proactive work to be done preventing critical alerts. That being said, no one wants a warning alert at three in the morning. Fortunately, each Contact in Opsview can have multiple Profiles which can have different layers of alerts. For your work email, create profiles for warning and critical to be sent 24x7. For your Gmail that your phone accesses, create profiles for warning alerts 8x5 and another profile for critical alerts 24x7. Be sure to name your profiles logically for easier administration, such as EmailPhoneWarning or EmailWorkCritical.
Send Alerts to Host and Service Administrators
An IT shop may have specific administrators, such as Windows or Linux admins. Windows administrators may not care to get alerts when Apache is down and Linux admins may not want to be woken up because a Windows server blue-screened.
Digging deeper into the Contact Profile, notifications can be set up for Host Groups and Service Groups. Configure each user to get alerts for their responsible services. Anytime someone gets an alert that they ignore because it is someone else’s responsibility, noise is created and alerts are assumed and disregarded, lowering the value of the entire monitoring system. If you want people to feel your pain, correct the issue and send out an email to work addresses that you were up all night dealing with your problem. It’s not a bad idea to show people that the system is working as it should, notifying only the responsible parties of critical issues (plus it gets you off the hook for coming in late the next morning).
Test New Checks Before Enabling Notifications
New checks are rolled out constantly in a changing environment. But new checks put immediately in production may produce false alarms that annoy other administrators and the help desk. Since Opsview includes built-in features to help monitor and trend check results, every addition should go through a testing period with notifications disabled. After a week, check the Service Graph to find the highest and lowest values to appropriately tune warning and critical thresholds.
Using the Alert Summary, you can determine if time periods should be used for a check. For example, a service may become unavailable during a nightly backup. The check_interval must remain the same, but checks need to be suspended for two hours each night while the backup occurs. You will be able to confidently tune the time period rather than make an uninformed guess at a black out range. Making accurate adjustments before a check “goes live” with notifications enabled greatly reduces unnecessary alerts and allows administrators to maintain faith in the system.
About the Author
Paul Fleetwood started as a Unix Administrator in 1999. He has used Nagios since 2003 and has rolled out Opsview at small and large companies including a distributed installation that monitored 600 hosts and 5000 services. Paul currently works for an award-winning custom content publisher in
North Carolina and spends all his free time with his wife and three very