Performance management is not something you do once a year to see if your WAN links are nearly full yet. You should know by now that the majority of complaints made by your users aren’t about total failures, but the much more common ‘the network’s slow’. If you’re not measuring things like response times, how do you reply to that?

And when a new application is added, how can you tell what impact it’s going to have on network bandwidth if you don’t know what the utilisation was like before? Performance monitoring, both from a fault finding—more hopefully fault avoidance—and capacity planning aspect should be an ongoing process.

Proactive fault finding
You probably as a matter of course do availability monitoring, checking you can ping your routers and servers. This isn’t good enough. The fact that a web server replies to a ping tells you nothing about its ability to produce HTML pages in a timely manner.

Of course you should be measuring response times across your network, but make sure it’s a meaningful measurement. Don’t ping between your WAN routers and expect that to tell you accurately what your users will experience. You need to measure end to end — ideally break that down into logical sections, so if you see a response time measurement creep up, you can tell if it’s the WAN, the local or the remote segment at fault.

There are lots of steps involved when you access a web page from a client. If it’s slow, it could as easily be a DNS resolution delay as a problem on the actual web server. Make sure you constantly monitor all parts of the overall transaction.

If you’ve implemented Quality of Service, ensure that what you measure is indicative of the marked traffic. If you are just sending pings, at least make sure they are the right size and IP Precedence/DSCP setting to simulate the real traffic on your network. If you’re running voice or interactive video, then not just response times but variations in these—jitter—are vitally important, so make sure you have the tools in place to monitor for this before you implement VoIP.

Observed or Synthetic?
You have the option of monitoring performance by either spying on ‘real’ user traffic, or using a synthetic polling mechanism. There are advantages and disadvantages to both, Obviously the polling alternative isn’t using real data, however it is much more easily controlled, deterministic and better equipped to pinpoint problem areas. Watching user traffic gives you an exact imprint of the user experience, but it depends on users creating it and can be variable in terms of usefulness.

Capacity Planning
In addition to providing an early warning system for problems, Performance Management should be used to let you plan your network usage. You need to know how much traffic is on your network, what it is, and where it is going to and from. The incremental impact of every new application must then be monitored—with lead times from providers in terms of months for bandwidth upgrades, you can’t afford to be caught out.

As a minimum you need to know the average and peak WAN utilisation figures, as WAN bandwidth is probably your most expensive asset and the least flexible to change. Historical change is important to let you see how quickly utilisation is increasing.

It’s not just bandwidth you need to watch — CPU and memory on routers, switches and servers also need to be monitored to ensure nothing is running out of steam.

With the amount of monitoring you have to do, you can’t afford for it to be a manual process. Once you set up a series of automated background polls, and create a process for the information gathered to be piped directly into a spreadsheet for graphing, recording and alerting on, there’s really minimal work required to keep on top of what’s happening on your network. This time when your users report a perceived slowing down of ‘the network’ you’ll know exactly where the problem lies.