Troubleshooting Fibre Channel networks can be as much an art as it is a science, but there are some basic best practices you can follow to reduce the guessing and speed resolution. Here are ten tips to help you get to the bottom of pesky problems:
1. Generally, problems are reported by the application user. As a first step, the SAN admin will usually gather dumps, logs and traces. At the same time, he'll sometimes remove other users or applications that are less critical, perhaps he'll stop backups and remove other potential bottlenecks. While this may fix the immediate problem, it often stops the underlying cause from being discovered. If you've only removed the symptom and you stop there, you're likely to see trouble later on.
2. Use real time monitoring. Ask your vendors what they mean by "real time" — a five minute polling interval is not real time. If a fire starts in your kitchen, would you like to be alerted to it immediately or in five minutes?
Use the real time alerting subsystem to get in front of the issues before the application users feel the pain. We recently saw an example where we examined the I/O history leading up to an application outage and found plenty of obvious pointers four hours before the outage. If best practices alerting had been set up, it's likely the outage could have been avoided.
3. One of the first steps is to determine if the user-reported problem correlates with what's happening on the SAN. But if you only investigate what the user is reporting, you may miss larger issues that may affect other, slightly less latency sensitive apps. It's useful to broaden the scope beyond just the immediate issue.
4. Having said that, you should customise existing, canned reports to quickly focus on the suspected application or infrastructure to isolate the condition. We recently talked with a customer who quickly eliminated about 4,380 out of 4,400 SAN links, enabling them to focus on the remaining 20 links for in-depth trace analysis.
5. Review environment inventories by device type and properties automatically discovered. Such things as manufacturer and link rate can be helpful in understanding special circumstances, such as the behaviour of a tape device or configuration settings that the admin might not be aware of, like links set to run at 1G instead of 4G. Enable users to provide their own context about devices such as applications they support, location, version, relationship to other equipment, etc.
6. As they isolate, correlate and analyse, our customers often report that the majority of the time that they troubleshoot, they find that the SAN is not to blame. Tools that report on the effect of only SAN latency on the application is very helpful in determining this aspect. Tools that lump SAN and server latency together can't help with this.
7. Time correlation is critical to determine cause and effect. When you are looking at long time windows, you often can't tell which event preceded another, and that's when you get finger-pointing from one vendor to another. Try to find the finest granularity in your historical reporting. A one minute interval is often not too granular.
8. Look at your historical I/O patterns, busy times of day, multipath configurations, queue depth settings, top talkers, etc. to gain a profile of behaviour. Then compare to your healthy baseline, and rule out things that haven't changed. You might find six things that appear to be going wrong, but if only one of those things seem to have occurred when the problem was reported, you can focus on that issue immediately. Later on, you can go back to look at the other issues.
9. When changes are made to fix the incident, you should get immediate feedback on whether it's having the desired effect. Sometimes a fix can make a problem worse, so it's good to know that as well. Without immediate feedback, you can often delay or stagger fixes until they can determine the effect of each one. Or if you make all changes at the same time, you can be left wondering which change fixed the problem. Ongoing real time monitoring can provide confidence that the problem in fact was solved.
10. Last, ask for help sooner rather than later. We've heard of problems dragging on for months, vendors kicked out of accounts and literally millions of dollars wasted on adding expensive hardware. Bring in a performance pro. Though there are things you can do to speed troubleshooting and even prevent future problems. Look at the cost of waiting. Balance that with the cost of an expert consultant, someone who spends all day, every day specialising in finding performance problems.