Status Blog - Update inline with discussions

EdGillett · 11 Nov 2005, 11:18 PM

OK, even though I've been hosing my day going around these boards today, I completely missed the post from that Jeremy made until recently:

http://www.emaildiscussions.com/...=&pagenumber=7

(Direct Link, but quotes beneath

"The hard drive enclosure that failed is RAID 6. That means that it can survive 2 hard drive failures. Unfortunately, 2 drives failed within half an hour of each other, and then a 3rd drive started getting intermittant errors. As a result, we got the file system corruption. Drives fail regularly, but you almost never notice, because the RAID redundancy means that things normally continue without any user impact.

In 10 hours time, the first user restores will occur. 6 hours after that (i.e. in 16 hours time), 75% of server4 users will have their service restored. The final restores will be complete about 45 hours after they start (i.e. 55 hours from now).

We have learnt a lot from this failure - we haven't had a RAID array fail before so we haven't really known what it would look like. What we've learnt in particular is that we should have many small volumes, rather than one big one. That way, in the case of file system corruption, we can simply restore one small volume, which would not take so long.

We're also looking at a fully replicated environment, where a file system corruption would only impact one server in a replicated pair; the other server would take over without any user impact. Replication has its own problems (e.g. the complexity of replication can slow performance, and cause software reliability issues), but with the work that Cambridge University have done on IMAP replication, it is getting to the point where we may be able to use it."

Jeremy, I know you're busy, but can these status updates be put into the status blog at the same time.

I know it's not exactly status, but it's news of what's going on and a response from FastMail is important for people not to miss on days like today.

Either highlight responses in a new thread, or add info to the status blog (for those of us that missed that one amidst the chaos and madness of lots of us venting and explaining our backup strategies)

Hopefully this new thread shows it up for anyone else that missed it

EdGillett · 11 Nov 2005, 11:21 PM

RAID systems are only as good as the monitoring and alerting that goes along with them.

You guys are having a realy horrible run of luck if a RAID6 got hosed, but RAID systems have diagnostics which go with them. Learning of imminent failure is useful

And yes, more drives, fewer volumes.

Seems like IMAP redundancy is more of a minefield that we may have thought, myself personally being used to mirroring SQL servers, web servers etc - is the issue with cyrus? mysql db backend?

CML209 · 11 Nov 2005, 11:22 PM

More importantly - have they learned to heed the customers' requests to be put on a better server?

I would weather this outage in a heartbeat if I knew that the paid accounts (or mine) would be getting what was paid for in the future.

Xenna · 11 Nov 2005, 11:32 PM

Quote:

Originally posted by CML209
More importantly - have they learned to heed the customers' requests to be put on a better server?

I would weather this outage in a heartbeat if I knew that the paid accounts (or mine) would be getting what was paid for in the future.

This is probably the cause of the problem. They have heeded these requests and bought the latest greatest bleeding edge RAID 6 server. We are now experiencing the pleasures of being in the forefront of technology

X.

EdGillett · 11 Nov 2005, 11:32 PM

... changing servers is not the issue. Any one server can fail.

It's knowing that the infrastructure can cope with the loss of hardware and carry on going. Whcih means load balancing, server redundancy, data separate to the servers.

Sounds like they've had a really rough time of it with 3 hard drives failing in their RAID6 array. If they had 13 servers all pointing to the same RAID array to pickup their data they would still be screwed right now.

And with 2 4TB (that's 4000GB each - 8000GB) to backup, that needs some serious management and maintenance.

Hmm .. Fastmail team need to lean on their hardware vendors to work this one through methinks.

CML209 · 11 Nov 2005, 11:34 PM

Quote:

Originally posted by EdGillett
... changing servers is not the issue. Any one server can fail.

It's knowing that the infrastructure can cope with the loss of hardware and carry on going. Whcih means load balancing, server redundancy, data separate to the servers.

Hmm .. Fastmail team need to lean on their hardware vendors to work this one through methinks.

I know any server can fail, but it seems 4 has the worst track record.

11 Nov 2005, 11:18 PM	#1
EdGillett Member Join Date: Jul 2002 Location: Guildford, UK Posts: 88	Status Blog - Update inline with discussions OK, even though I've been hosing my day going around these boards today, I completely missed the post from that Jeremy made until recently: http://www.emaildiscussions.com/...=&pagenumber=7 (Direct Link, but quotes beneath "The hard drive enclosure that failed is RAID 6. That means that it can survive 2 hard drive failures. Unfortunately, 2 drives failed within half an hour of each other, and then a 3rd drive started getting intermittant errors. As a result, we got the file system corruption. Drives fail regularly, but you almost never notice, because the RAID redundancy means that things normally continue without any user impact. In 10 hours time, the first user restores will occur. 6 hours after that (i.e. in 16 hours time), 75% of server4 users will have their service restored. The final restores will be complete about 45 hours after they start (i.e. 55 hours from now). We have learnt a lot from this failure - we haven't had a RAID array fail before so we haven't really known what it would look like. What we've learnt in particular is that we should have many small volumes, rather than one big one. That way, in the case of file system corruption, we can simply restore one small volume, which would not take so long. We're also looking at a fully replicated environment, where a file system corruption would only impact one server in a replicated pair; the other server would take over without any user impact. Replication has its own problems (e.g. the complexity of replication can slow performance, and cause software reliability issues), but with the work that Cambridge University have done on IMAP replication, it is getting to the point where we may be able to use it." Jeremy, I know you're busy, but can these status updates be put into the status blog at the same time. I know it's not exactly status, but it's news of what's going on and a response from FastMail is important for people not to miss on days like today. Either highlight responses in a new thread, or add info to the status blog (for those of us that missed that one amidst the chaos and madness of lots of us venting and explaining our backup strategies) Hopefully this new thread shows it up for anyone else that missed it

11 Nov 2005, 11:21 PM	#2
EdGillett Member Join Date: Jul 2002 Location: Guildford, UK Posts: 88	Whilst we're at it .. RAID systems are only as good as the monitoring and alerting that goes along with them. You guys are having a realy horrible run of luck if a RAID6 got hosed, but RAID systems have diagnostics which go with them. Learning of imminent failure is useful And yes, more drives, fewer volumes. Seems like IMAP redundancy is more of a minefield that we may have thought, myself personally being used to mirroring SQL servers, web servers etc - is the issue with cyrus? mysql db backend?

11 Nov 2005, 11:32 PM	#5
EdGillett Member Join Date: Jul 2002 Location: Guildford, UK Posts: 88	Again ... ... changing servers is not the issue. Any one server can fail. It's knowing that the infrastructure can cope with the loss of hardware and carry on going. Whcih means load balancing, server redundancy, data separate to the servers. Sounds like they've had a really rough time of it with 3 hard drives failing in their RAID6 array. If they had 13 servers all pointing to the same RAID array to pickup their data they would still be screwed right now. And with 2 4TB (that's 4000GB each - 8000GB) to backup, that needs some serious management and maintenance. Hmm .. Fastmail team need to lean on their hardware vendors to work this one through methinks.

11 Nov 2005, 11:22 PM	#3
CML209 Essential Contributor Join Date: Feb 2004 Posts: 328	More importantly - have they learned to heed the customers' requests to be put on a better server? I would weather this outage in a heartbeat if I knew that the paid accounts (or mine) would be getting what was paid for in the future.