So, you've all heard people talk about a computer crashing.
I want to describe to you what happens when a server crashes at a medium-sized
organization.
This particular crash was interesting because the server wasn't completely dead. Sure, we've all had power supplies die, hard drives crap out, BSODs
and other kinds of issues happen. Those
are usually complete either-or propositions.
Either the machine works or it doesn't.
Very seldom do we have the situation where the system works - sorta-kinda. Well, that's exactly what happened in this
situation.
The server in question is four years old, with a three year old Dell Powervault PV-220 RAID-5 enclosure. It has 1.5 TB of data on this drive, storing user files and research data. So, with the 8 hard drives in the array, one of them is a hot spare. Because it is RAID-5,any one of the drives can fail, the hot spare comes online automatically and rebuilds the RAID. Well, that's what is supposed to happen.
I did have a drive indicate possible failure, but it didn't swap out. The server started serving out a number of corrupted files from this drive and a normal (non-raid) drive. Upon reboot, the drive system showed six of the eight drives as having completely failed. That's not supposed to happen either.
I was able to force the drives back into an online mode - and bring the array back online, but the NTFS file structure was corrupted. The server needed about 72 hours to rebuild the NTFS structure. Unfortunately, we needed the server to be back online within 12 hours, so we forced it back up the next day after the crash. The data on the RAID array looked like it was completely rebuilt and some files were lost to corruption, but it turns out that halfway through the next day, that we realized that the drive wasn't rebuilt correctly, and wasn't stable... users were losing files and directories throughout the day.
So, back to square 1 - and we brought a new server online with the backup data. However, guess what - the backup wasn't completely up-to-date. Some files were 2-3 weeks old, while others were completely current. So, 2 days later and we bring up the backups that are not current - yeah, people were just short of screaming at me.
That's all fine in my mind - because people should be making their own backups of their own data. That's what I tell them to do, but not everyone listens to what I tell them.
I was able to bring up the old server with the suspect drive the next week - ran the rebuild over the weekend. While there was file corruption on individual files, I was able to bring back some files that people had lost. Others recovered lost work within a day or two of re-doing what they had done in the past couple of weeks.
The crash has taught me a couple of things:
Most users will not do their own backups. They rely on systems too much and make assumptions that nothing will ever go wrong. While this is a bad assumption on the user's part, it *IS* the base assumption that most users have.
IT Managers must live up to the user's expectation, regardless of how unrealistic that expectation is.
There is a middle-ground between a system working and a system failing. That middle ground sucks.
Disaster Recovering Planning needs more attention in small-to-medium sized organizations. Something as simple as a single server crash can highlight faulty backup processes, required services and end-user expectations.
Oh, and Murphy's Law applies to IT. The server crashed 30 minutes before my IST511 class where I was supposed to do a class presentation and could not miss.
So, maybe you're laughing right now. Maybe you're thinking, "Gee, so what's new?" Maybe you're thinking, "I'm glad it's not me!" Whichever your reaction - I hope this blog post has made you think about how you back up your important data. Having a solid, reliable, reproducible, transparent and easy-to-use backup system for things that are important to you is a key ingredient in your ability to survive even the most complicated failure.Maybe I should post my philosophy on how to back up your data for your own protection... watch for that blog posting later.
Recent Comments