Man, it's been a painful 36 hours or so, but everything is finally calming down. What happened? It's a bit of a long story, but here goes.
Around 3:00 AM Sunday morning, as I was working on posting news stories from Comic-Con International in San Diego, the CBR server went offline. It was completely unresponsive. OK, servers sometimes do that, so I remote restarted the machine and it came back online within minutes. Phew, no problem. Except for it did the same thing again 15 minutes later. I restarted the machine once again, but this time it crashed 10 minutes later. What was going on?
So I called the data center and that discussion led me to believe there was some sort of hardware failure going on, but what exactly I couldn't tell until I got down there. It was now past 4:00 AM and driving back to Los Angeles immediately was a very bad idea. With only five hours of sleep the night before, the two-hour drive home to LA would not be smart. Plus, even if I arrived by 6:00 AM, if there was hardware that needed to purchased I couldn't get to a store until 9:00 AM. I decided to pack up and get a couple of hours of sleep before heading back north.
Woke up around 8:00, packed up all my stuff and gear, said goodbye to the CBR crew and was on the road by 9:00. I arrived at my data center around Noon and discovered that the fan in the power supply on the CBR server wasn't spinning. OK, that made sense. The power supply was overheating and like a good little power supply, when it got too hot, it shut itself down to protect the machine. I quickly drove to the nearest electronics store, purchased a new power supply, headed back to the data center and by 2:00 the server was back online. Crisis solved.
Of course, as you well know, that wasn't the end of the story. About four hours later the server started to act very strange. It would serve up web pages very slowly and the load average (the amount of processes running on the server) was going sky high. That's when I called one of my SysAdmin's, Mark Luntzel, who took a look around. We worked on things and by 6:30 it looked like things were A-OK once again. At that time I realized I desperately needed some sleep, as the two and a half hours I got previously just wasn't enough.
At 9:00 Sunday night I woke up to find the server doing very weird things once again. What was going on with this thing? AHHHHH! Mark said it appeared that there could very well be a hard disc on its way out, which meant we needed to act fast. This began my second visit to the data center. Mark and I worked on getting the data from the Web site off the server as quickly as we could. With about 1 gigs worth of data left to transfer, the hard drive started throwing out all sorts of errors. There was no way, given the tools we had, to access the data on the hard disc. It appeared the hard disc was dead.
This is when the panic set in. The last good backup we had was from Friday morning, so all the work we had done throughout the weekend would have to be reinserted and edited. Not the end of the world, but a lot of work nonetheless. Plus some of it was done on the fly. But it was now 3:00 AM and both of us were absolutely exhausted. As anyone could tell you who's worked on a server during a failure of this sort, it's not always smart to work when you're exhausted. Mistakes can be made easily and things could have gotten worse. I decided we should stop working, get some sleep, then I'd take the hard disc to a data recovery specialist in the morning.
Monday morning I woke up and started calling around to data recovery specialists in the LA area. I was getting quotes from $750 minimum to $5000 to retrieve the data off the drive. It was going to depend on how big a job it ended up to be. Obviously this wasn't something I was looking forward to. While I was calling around, Mark called me up and told me to hook that drive up to my home server and see if it could read the drive. I did just that and guess what? It worked just fine. In the end it wasn't a failing hard disc, rather a failing motherboard. The best we can tell is when the power supply was overheating, it damaged the hard disc controller on the motherboard. This was why it was impossible to get the data off the drive.
Once we figured that out I immediately made a new backup of all the data on a home machine, just in case something else went wrong. Once that was down we headed back down to the data center to work on restoring CBR.
I had an extra server in my rack that wasn't being used, so the decision was made to take the data from the old hard disc and put it on the server that wasn't being used. This migration went surprisingly smooth, taking only three hours to move almost eight gigs of data and to configure the CBR scripts and what not to work properly on the new machine. Having done a number of migrations in my time, this was actually one of the smoothest I ever experienced.
The upside to all of this is that the main CBR Web site is now on a server that's almost three times as fast as the old one and it's running current software throughout (kernel, web server, database, etc.). The old CBR server was just that, old, and it was time to upgrade anyway. Of course the downside meant a lot of extended downtime. This didn't make me happy, but what can you do?
While most everything works fine with CBR now, there are a few things that are still busted. The CBR search engine is having some small issues, as is the CBR Polls software, but both can be fixed easily once I bring my programmer in to have a look, which should be sometime within the next 24 hours. Outside of those two issues, I'm not aware of any other problems. If you see anything, and I mean anything, that looks out of the ordinary, please e-mail me immediately at jonah *at* comicbookresources.com.
Now it's time to rest up a little bit and then get back to work. I kind of feel like I ran a marathon... for five days! Comic-Con International is a lot of hard work with very little sleep. Add to that the stress and hard work involved in recovering from a server failure like this and I can say I feel completely wiped out. Thag being said, we have some catching up to do as there are another handful of stories to post from this weekend, as well as a number of columns and new announcements to put online. I'll be working through the night and into the wee hours of the morning to get us caught up.
Thanks to all of you who wrote expressing your concerns and sharing your sympathies and ideas during this time. It really was appreciated. And thanks to all our site visitors and our advertisers for your patience and understanding during this crisis. We'll make it all up to you in the coming weeks. Take care and thanks for visiting CBR.
Executive Producer, CBR