All too often, network management is only used when you have a problem. When the network management station starts flashing red icons, you know you have an issue to deal with.

Fault management should let you detect problems before your users do, identify what these are, what effect they will have on your network and help you pinpoint the causes. Before they can be of any use, though, you have some housekeeping to do.

Baseline
How does your network operate when all is well? (see also the Performance Management feature). You need to know, for example, at what level broadcasts stop being a normal occurrence and start to show something serious is wrong. If you don’t know what is right, how can you tell what is wrong?

You must - this cannot be overstressed - have up-to-date topology diagrams, configurations and addressing schemes. All too often problems take much longer to fault-find than they should because the documentation isn’t right.

Watching for trouble
You need an automated system to receive SNMP traps and proactively poll for erroneous conditions. But you need to tune the information you get so that you don’t get swamped by non-issues, or receive too many duplicates. There’s no point getting multiple alarms for unresponsiveness for all the LAN interfaces on a remote site router, when it’s the WAN connection to it that’s failed.

You’ll need some form of correlation engine within your management platform to do this and it’s worth finding out just how good they are at this job before you spend a lot of money. At the higher end of the complexity (and monetary) scale, network elements will be translated to business workgroups, and the effect on users, taking into account levels of redundancy, relative importance, and even time of day. This gives you a better idea of how important, in the overall scheme of things, this failure is to your business.

If you can trust your network devices to send alerts, then you won’t have to proactively poll so often, which will let you reduce management traffic. There are two issues here — if a device fails completely, it probably won’t be able to trap to tell you and how do you know you’re receiving all the traps?

The first is solved by availability polling, which you can’t really get away from. The second can be dealt with by using SNMPv2c (or v3) which can use a new operation known as an inform, rather than a trap. Informs require the receiving device to send an acknowledgement, so they can resend if necessary to provide a more guaranteed delivery service.

You should have NTP enabled across your devices to make sure that alerts, alarms and debug information can be tied together - even a few second’s difference can make tracking down a problem across multiple switches or routers an impossible task.

Security Implications
Bearing in mind that SNMP is capable of setting parameters as well as getting them, it’s not a good idea to leave your network devices open to all SNMP traffic. Even if all a rogue station does is read SNMP settings, you can gain a wealth of useful hacking information by doing an SNMP walk through a router’s MIB tree. Obviously letting anything set parameters via SNMP could be a major disaster. You should, as a minimum, use access lists to ensure SNMP traffic is only allowed to and from your specified management stations. SNMPv3 adds security in terms of authentication, albeit pretty basic, but in fact the majority of network management applications and platforms still don’t support v3, so you may not have this option.

When you know what the problem is, and the user impact, you can start to put either a fix, or a temporary workaround, in place to get your users up and working again - or even better, avoid an outage before it happens. You should by now have all the information you need telling you what has happened.

One point that should be obvious but somehow often gets ignored - if you have to put in a temporary workaround to get the service up and running quickly, remember to go back and do the proper fix as soon as you can. And, in the meantime, make sure the documentation is updated with any changes, no matter how temporary you think they will be.