EmailDiscussions.com  

Go Back   EmailDiscussions.com > Email Service Provider-specific Forums > FastMail Forum
Register FAQ Members List Calendar Today's Posts
Stay in touch wirelessly

FastMail Forum All posts relating to FastMail.FM should go here: suggestions, comments, requests for help, complaints, technical issues etc.

Reply
 
Thread Tools
Old 11 Nov 2005, 11:18 PM   #1
EdGillett
Member
 
Join Date: Jul 2002
Location: Guildford, UK
Posts: 88
Post Status Blog - Update inline with discussions

OK, even though I've been hosing my day going around these boards today, I completely missed the post from that Jeremy made until recently:

http://www.emaildiscussions.com/...=&pagenumber=7

(Direct Link, but quotes beneath

"The hard drive enclosure that failed is RAID 6. That means that it can survive 2 hard drive failures. Unfortunately, 2 drives failed within half an hour of each other, and then a 3rd drive started getting intermittant errors. As a result, we got the file system corruption. Drives fail regularly, but you almost never notice, because the RAID redundancy means that things normally continue without any user impact.

In 10 hours time, the first user restores will occur. 6 hours after that (i.e. in 16 hours time), 75% of server4 users will have their service restored. The final restores will be complete about 45 hours after they start (i.e. 55 hours from now).

We have learnt a lot from this failure - we haven't had a RAID array fail before so we haven't really known what it would look like. What we've learnt in particular is that we should have many small volumes, rather than one big one. That way, in the case of file system corruption, we can simply restore one small volume, which would not take so long.

We're also looking at a fully replicated environment, where a file system corruption would only impact one server in a replicated pair; the other server would take over without any user impact. Replication has its own problems (e.g. the complexity of replication can slow performance, and cause software reliability issues), but with the work that Cambridge University have done on IMAP replication, it is getting to the point where we may be able to use it."

Jeremy, I know you're busy, but can these status updates be put into the status blog at the same time.

I know it's not exactly status, but it's news of what's going on and a response from FastMail is important for people not to miss on days like today.

Either highlight responses in a new thread, or add info to the status blog (for those of us that missed that one amidst the chaos and madness of lots of us venting and explaining our backup strategies)

Hopefully this new thread shows it up for anyone else that missed it
EdGillett is offline   Reply With Quote

Old 11 Nov 2005, 11:21 PM   #2
EdGillett
Member
 
Join Date: Jul 2002
Location: Guildford, UK
Posts: 88
Whilst we're at it ..

RAID systems are only as good as the monitoring and alerting that goes along with them.

You guys are having a realy horrible run of luck if a RAID6 got hosed, but RAID systems have diagnostics which go with them. Learning of imminent failure is useful

And yes, more drives, fewer volumes.

Seems like IMAP redundancy is more of a minefield that we may have thought, myself personally being used to mirroring SQL servers, web servers etc - is the issue with cyrus? mysql db backend?
EdGillett is offline   Reply With Quote
Old 11 Nov 2005, 11:22 PM   #3
CML209
Essential Contributor
 
Join Date: Feb 2004
Posts: 328
More importantly - have they learned to heed the customers' requests to be put on a better server?

I would weather this outage in a heartbeat if I knew that the paid accounts (or mine) would be getting what was paid for in the future.
CML209 is offline   Reply With Quote
Old 11 Nov 2005, 11:32 PM   #4
Xenna
Junior Member
 
Join Date: Apr 2004
Posts: 23
Quote:
Originally posted by CML209
More importantly - have they learned to heed the customers' requests to be put on a better server?

I would weather this outage in a heartbeat if I knew that the paid accounts (or mine) would be getting what was paid for in the future.
This is probably the cause of the problem. They have heeded these requests and bought the latest greatest bleeding edge RAID 6 server. We are now experiencing the pleasures of being in the forefront of technology

X.

Last edited by Xenna : 11 Nov 2005 at 11:37 PM.
Xenna is offline   Reply With Quote
Old 11 Nov 2005, 11:32 PM   #5
EdGillett
Member
 
Join Date: Jul 2002
Location: Guildford, UK
Posts: 88
Again ...

... changing servers is not the issue. Any one server can fail.

It's knowing that the infrastructure can cope with the loss of hardware and carry on going. Whcih means load balancing, server redundancy, data separate to the servers.

Sounds like they've had a really rough time of it with 3 hard drives failing in their RAID6 array. If they had 13 servers all pointing to the same RAID array to pickup their data they would still be screwed right now.

And with 2 4TB (that's 4000GB each - 8000GB) to backup, that needs some serious management and maintenance.

Hmm .. Fastmail team need to lean on their hardware vendors to work this one through methinks.
EdGillett is offline   Reply With Quote
Old 11 Nov 2005, 11:34 PM   #6
CML209
Essential Contributor
 
Join Date: Feb 2004
Posts: 328
Re: Again ...

Quote:
Originally posted by EdGillett
... changing servers is not the issue. Any one server can fail.

It's knowing that the infrastructure can cope with the loss of hardware and carry on going. Whcih means load balancing, server redundancy, data separate to the servers.


Hmm .. Fastmail team need to lean on their hardware vendors to work this one through methinks.
I know any server can fail, but it seems 4 has the worst track record.
CML209 is offline   Reply With Quote
Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +9. The time now is 11:01 AM.

 

Copyright EmailDiscussions.com 1998-2022. All Rights Reserved. Privacy Policy