Although most of us don't realise it, many of the storage devices on our computer systems use a standard called SMART (Self-Monitoring Analysis and Reporting Technology) to store low-level information about their own behaviour. It's this low-level data that mission-critical systems use to predict impending failures, since a complete device failure is usually preceded by a number of transient errors that have been recorded in the SMART data area.
SMARTMon and SMARTMon-UX are a pair of applications that system managers can use to monitor, and even change, the information that's held on the controllers of storage devices. SMARTMon is a GUI-based application for Windows only, while SMARTMon-UX is a more powerful command-line alternative that runs on Windows and a variety of other platforms, including Linux, Solaris, HP-UX, IRIX and (to a limited extent) MacOS X. SMARTMon is available in a "personal" version, which allows read-only monitoring of up to two devices with no automated alerting of faults, and a "server" version which has none of these restrictions. SMARTMon-UX comes only in the "server" version. We looked at the Linux version on our Red Hat 9 machine.
At the basic device level, SMARTMon supports SCSI, IDE, SSA and Fibre Channel storage devices. If you have multiple devices connected to a RAID card, such that actual disks are hidden from view by the computer, you won't be able to manage these devices individually (unless you're using a Mylex RAID adaptor, which is specifically supported by SMARTMon-UX). The package also understands how to communicate with multi-device enclosures that support SCSI Enclosure Services (SES) or SAF-TE. So, as well as flagging the condition of disks, it can tell you about anomalies with any other variables that SES is able to measure, such as the temperature inside the enclosure.
Installing both packages is very easy. The Windows GUI application has a traditional installer. On the Unix/Linux version you simply run an installer script and answer some questions (where you want to put the documentation, what email address to send alerts to, etc) and watch the installer do the rest.
For monitoring and alerting, you tell the package how frequently to check for issues, and the email address to which it should send alerts, and leave it to watch the world. But although the main motivation for purchasing SMARTMon is likely to be the desire to be notified when your disks start to get a bit wobbly, there's a whole load more stuff you can do with it albeit with care and some prior investigation. So, you can interrogate and modify SES or SAF-TE settings on your enclosures, ping devices on your fibre channel SAN, read the diagnostics on your tape drives (to check, for example, whether the "clean me" light is on, or whether the current tape is write-protected) and even deduce the link speed of any connected device.
One of the most versatile, and dangerous, features of SMARTMon-UX is the ability to edit the "mode pages" on the drive controllers. The mode page is where drives store all their fundamental parameters, such as power-saving or cache settings, even the name of the disk that the BIOS sees. It's here, for instance, that a big-name system vendor can make a disk claim to be an Acme 1234Z when it's actually just a bog-standard, off-the-shelf Seagate, or WD model, with the write cache turned off and the name changed in the mode page.
Because all the commands seem to be run via a single command-line executable, you end up chanting some pretty weird incantations on the command line. However, because some command types are device- or even vendor-related, you'll probably find, like us, that you only use a subset of the available commands anyway, so it's not a great problem. Although there are loads of possible functions to perform, the enclosed HTML-based manual is extremely informative and easy to use.
SMARTMon is another of these tools we keep coming across that are so useful and so inexpensive that you'd be tempted to buy it just for the sake of having it to hand. The pre-failure monitoring and alerting facility is worth the price on its own, and the ability to get more adventurous (albeit to differing extents depending on what hardware you have) and poke about in the low-level settings of devices is a valuable extra facility.
This kind of software can do destructive things to your disks if you set something incorrectly. So, you should make sure that you understand how your devices work, preferably with the help of the vendor's low-level docs, before changing crucial settings.