When it comes to debugging hard-to-diagnose software and operating system problems, there is no set recipe. Rather debugging is all about "having the right tools and knowing how to use them," advised Microsoft technical fellow Mark Russinovich at the close of the Microsoft TechEd conference.
Among the highlights of each year's TechEd conference are the technology demonstrations. Smart Microsoft and partner engineers walked attendees through how to use some new technology in a step by step process, making it seem easy or even fun to deploy.
And one of the most popular demonstrations over the past few years has been Russinovich's "Cases of the Unexplained," in which he shows how he and others tracked down hard-to-pinpoint errors in Windows deployments.
This year, of course, was no exception. Before a packed auditorium, Russinovich debugged a number of tricky problems using only a handful of free tools, many created by Russinovich himself, including Process Explorer and Process Monitor. He borrowed many examples in his presentation from his blog, where he collects user stories of tough problems.
In the cases Russinovich demonstrated, the root causes of the misbehaving systems were not readily obvious. This was especially true of software that, he noted, when it crashes, offers little instruction about its downfall. "Programs do a bad job of telling what went wrong," he said. Yet he showed that it is possible to carefully track the symptom of the problem back to the cause.
One example Russinovich dubbed "the case of the slow website." This example was submitted to Russinovich by a system administrator from an unnamed company. The organisation's users were complaining of slow performance of some internal web pages. The admin tracked all the web pages to a single server, then ran Process Explorer, which shows all the processes on a server and how much memory and CPU resources each thread of a process is consuming.
The admin identified one thread that was hogging more than a quarter of the server's resources. Doing a web search, he found that the related process belonged to a Windows management driver that, in turn, communicated with the server chassis' management controller provided by the server manufacturer. The two components were having difficulty in communicating, so the communication between them spiked.
The difficulty turned out to be that the blade server was not slotted into the rack appropriately. The user reseated the server chassis and the server quickly returned to delivering its web pages speedily.
Another problem came not from misbehaving equipment or software, but rather from user behaviour. "This case came into the Microsoft Exchange support team," Russinovich said.
Users complained that Microsoft Exchange would periodically delay responding for up to 30 seconds. Microsoft requested the customer to log the server performance using Performance Monitor, which showed periodic spikes in CPU utilisation. Using ProcDump, a Microsoft engineer created a script that would capture all the process information whenever processor usage went above a certain threshold.
Looking through the results, the engineer found an Exchange search function was consuming many of the cycles. The sluggishness was caused by the fact that a number of users had humongous mailboxes, which when they searched through them, would spike the server load. The admin instructed the users to reduce the size of their mailboxes, or organise them better. As a result of such tidying up on the users' part, server performance improved.
In a third case, Russinovich's wife had complained that the Windows Photo Gallery would hang after showing a movie. The bug was particularly annoying to her, as she was showing friends some home movies. A friend of hers even quipped, "This never happens with a Mac."
Russinovich reran the Photo Gallery software while capturing all the processes in Process Monitor. "When in doubt run Process Monitor," he advised. He matched the time of the hang with all the processes running at that time. While most of the processes were routine, he found an unusual system call, ironically enough, to an Apple QuickTime object, which was the source of program hang. "Sure enough, it was from that company that doesn't know how to write Windows software," he joked.
Russinovich also showed the audience how to rid a machine of a bad case of malware. This example also came from a user submission, detailing how the infected computer had a particularly thorough piece of malware that blocked all attempts to run any sort of diagnostic, antivirus or system administration tools.
One way around the block, Russinovich advised, consisted of running another program he had written called Desktops, which lets the user set up four virtual desktops for the computer. The user can then switch among the desktops, each of which ran independently of the others. While not a diagnostic tool per se, Desktops could be used to repair the malware-riddled computer. The malware monitored any activity on the main desktop, but it was unaware of the other desktops, one of which Russinovich used to run antivirus tools.
Finally, no debugging session would be complete without diagnosing the infamous Windows Blue Screen of Death (BSOD) error. Despite the severity of the problem, "It is incredibly easy to do crash analysis" on a BSOD, he said. Russinovich explained that such a crash happens when something goes wrong within the operating system's kernel memory space, such as a device driver that tries to access memory allotted to another program. Because Windows' first priority is "to protect data," it will shut down as soon as a program acts outside allotted memory space, he said.
After a system crash, Microsoft will offer an analysis upon reboot of the machine, which can point to drivers that need to be updated or other fixes. Even if this help message proves unhelpful, the administrator can check for the crash dump file that Windows produces when it crashes, Russinovich said. This is either found in the Windows directory, or in a subdirectory called Minidump. A program called Windows Debugger can examine the file and provide more information about what possibly caused the crash.
The presentation "makes you want to go home and crash your computer," one attendee said afterward.