One of my clients had an interesting problem yesterday. I got the call at 9:15am from their techie, who told me that the website was broken.
There were two bits of good news. Firstly, when the site's broken it presents a very nice "I'm broken" page, so at least it didn't look crap to customers. Second, I'm not responsible for that server so I didn't get the blame - I just got a "can you fix it because nobody else is available" plea.
After a few minutes, I discovered that the problem was a full disk on the partition where the database keeps its files. This struck me as a bit odd since that partition is over 50GB and the database is around 5GB, so I thought I'd have a look about on that partition.
The first thing I found was a directory that seemed to be where the backup routines store files temporarily on their way to the backup media. It had over 20GB of stuff in it, including some big archive files that were clearly not required any more - so at least this let me free up a bit of space and start up the database. With the site back up, I could relax a bit and look around with a tad less urgency.
The next discovery was an email inbox file. A 6GB one, belonging to the user ID under which all the scheduled jobs on that server run. Pause the mail server, rename it to "inbox.orig", restart the mail server and zip up the old one - another 5GB saved.
We've got lots of free space by this time, so I started to muse about why this problem had suddenly happened. What had we changed that could cause disk usage to multiply by 10?
Then the penny dropped.
In the week, the DBA had re-synchronised the databases. We run replication between that server and another, and the replicas had got a bit out of sync for reasons I won't bore you with. To re-sync stuff you basically stop the DBMS on both ends, copy the files from the master to the slave, then start up again. Both ends are usually down for about an hour, as this is how long the files take to copy (they're in different countries, so we're talking Internet link speeds, not LAN ones).
Now, whenever someone goes to the website and it experiences a database error, it logs that fact and emails the sysadmin. There's also a scheduled job that runs every minute that checks the database and does the same log-and-email process in the event of a problem. The job scheduler sends an email to the sysadmin to say when it's done a scheduled job. Then there's the backup job that archives the files on disk and copies them off to a temporary location for writing to backing store.
When everything is running fine, then, all you get is a small email entry in the inbox for each scheduled run, and there's no noticeable increase in disk space used. But when the database is down, you get not just the scheduler's email entry but also that of the email alert the database checker is sending out, and once you've had a few thousand of those the backup routine goes and duplicates it for you. So any problems get magnified and then duly copied! Thankfully we didn't also have the log files writing to that partition - it'd have filled up even faster if we did.
The moral of the story? There are several:
1. Did anyone ever read this email inbox that had grown to 6GB? Probably not - it was only used by programs on that server for routine notifications. We could probably turn off the notifications altogether.
2. If you have monitoring systems, consider how they'll behave when something goes wrong. For instance, do you really want 5,000 notifications when the database is down, or do you want it to stop bugging you after the fifth attempt?
3. If you do something unusual to your server, watch it like a hawk afterwards.