SATA drive failures can be caused by firmware lockups and can be cured by a drive reset.
Talking to Jason Williams, the COO and CTO of DigiTar, a secure e-mail outsourcing supplier in the USA, he mentioned that 60 percent of his SATA drive failures are not really drive failures at all.
DigiTar stores the e-mails on a Sun StorageTek X4500 server/storage system running ZFS. What happens is that ZFS detects a drive failure when I/Os fail enough. Traditionally the drive gets replaced.
But Williams found out that when a Pillar Data System detects a SATA drive failure it resets the drive and tries the I/O again. Most of the time it works because the locked-up firmware gets sorted out and starts working properly again.
ZFS doesn't do this. So he had the Engenio controllers in the Sun StorageTek arrays he is using ignore the ZFS drive failure signal and has the apparently failed drive reset. It then works okay 60 percent of the time.
What he would ideally like is for Sun to amend ZFS so that when it detects a SATA drive failure it first of all queues all writes. Then it issues a dive reset and tries the writes again, doing a checksum verification to ensure the I/O is correct. If this is okay then it could just carry on.
Please ZFS developers, he says, put SATA drive reset logic like this into ZFS. It will avoid needless SATA drive replacement.