EmailDiscussions.com  

Go Back   EmailDiscussions.com > Email Service Provider-specific Forums > FastMail Forum
Register FAQ Members List Calendar Today's Posts
Stay in touch wirelessly

FastMail Forum All posts relating to FastMail.FM should go here: suggestions, comments, requests for help, complaints, technical issues etc.

Reply
 
Thread Tools
Old 30 Oct 2006, 04:49 AM   #1
john7033
Junior Member
 
Join Date: Oct 2006
Posts: 2
Lightbulb Let's try to be reasonable and logical

First of all I want to make it clear that I'm not taking sides on this issue. I'm only going to say what I know to be true.

I work as a SysAdmin for a large private financial institution. We operate a redundant mail server (Postfix/Cyrus/OpenLDAP) for about 8050 users spread across the USA. I have worked in this field for about 11 years.

We process about three million messages per month, though seventy percent of that is spam (which never makes it past the helo). For me, managing the mail server is almost a non-issue. It hasn't been down in sixteen months. The reason for the last outage was to migrate accounts to a larger raid.

I first want to point out that the people who are angry about this outage have every right to be if they are paying customers (and to a lesser degree if they aren't). I have seen a number of comments which suggest that the low price of the service mitigates paying customer's right to be angry. It does not matter what the price is..if you pay forty bucks a year for email service, you are entitled to complain loudly about outages.

The price for service is set by the owners of Fastmail, and if they think they are entitled to be excused from outages because they aren't charging more, they have a litle something to learn about the free market. To be clear, the low price for service set by the proprietor is not an automatic sanction for poor service. Forty dollars Even ten dollars per year per account is a fairly high price when you consider that there are no software licensing costs...the costs are in the management of the system.

While I'm at it, I'd like to talk about costs. This business about the raid failure is foreign to me. With the low cost of outstanding hardware, there is absolutely no excuse for the sort of outages Fastmail has experienced. I'm inclined to believe that human error caused the issue and someone is blaming it on the hardware. Here's why I believe this:

1) Bandwidth and hosting are both exceptionally cheap. You can get a full cage in a world class data center for about $3k US per month.

2) You can get a *redundant* DS3 of internet in a datacenter for about $3900 US per month (That's 45 megabits) .

3) You can get enterprise storage in the seventy terabyte range for low six figures, and lease for just a few thousand dollars per month. We run a dual head NetApp 3020 with fiber, iscsi, and nfs and it supports eighty servers and runs at about twenty percent of throughput capacity. With snapshots, the chances of data loss are pretty remote. This is why I find the story about the raid eating itself to be questionable. The raid array itself should not crash a kernel. If it did, then there is something seriously wrong with the kernel. I'll stay away from the technical minutiae, but I think other Unix/Linux admins will see where I'm going with this.


I don't know what Fastmail's gross is, but even if they have only 100,000 users paying ten bucks a year, that's still a million dollars a year. More that enough to pay for first rate hosting and hardware. I know that quality IT people aren't cheap, but they are essential if you're going to charge money for your services. Even at the low end, outages that last for days (allegedly caused by hardware) are not something that should be considered acceptable for paying customers. In my case, the outage has lasted at least 55 hours now.

It does no good to complain without offering solutions, so here are my suggestions for Fastmail:

1) Start virtualizing your OS's on VMWare or other Virtual Server. This will make it easy to add redundancy to your systems and gives you access to every runlevel over tcp. The server itself should only take ten or twenty gigs per instance.

2) Get an enterprise level Raid platform. I recommend NetApp. I don't know what your storage requirements are, but the 200 series with a single shelf starts out at about $40K. You can take instant snapshots and sync redundant arrays easily.

3) Get a hardware load balancer and learn to use it to your advantage. When you have a sick server, fire up a new vitual instance of your server, create a new array from the latest snapshot on your Netapp, mount the array, add the instance to your load balancer and restore service. I recommend a redundant Kemp. That will set you back another eight or tend grand or so.


I pay forty bucks a year and I'm still down at 01: 20 GMT , so I'm just as annoyed as anyone else.


Finally, let's try to give Fastmail a break. They are probably under a huge amount of pressure right now. I bet everyone has a massive tension headache and some people migh even feel like their jobs are threatened. Anyone can make a mistake or have a piece of hardware go bad. Let's give them a chance to learn from this. If it happens again, they yes, let's line up with torches and pitchforks. I know everyone is mad and they have the right to be, but let's give Fastmail a chance to make things right.

Just my $.02.


Good luck with the recovery, and greetings to everyone.
John7033
john7033 is offline   Reply With Quote

Old 30 Oct 2006, 05:16 AM   #2
MikeL
Essential Contributor
 
Join Date: Apr 2002
Posts: 326
Very well thought-out post, thanks. I'm also still down, though since I have my own domain thru 1&1, I've been able to point my MX records at 1&1's servers, which allows me to get email (though there is a couple of days or so worth of email waiting in the queue for whenever my account comes back up).

I am sympathetic to FM's issues, though my sympathy is getting rather strained at this point. This is the 2nd major outage we've experienced in a year, and there have been other outages of lesser impact over the 4 years or so I've been an FM user. I still want to use FM, because I can't live without the folder addressing that it offers, but I am certainly rethinking my setup and trying to figure out the best way to have seamless recovery when something goes wrong. Being unemployed at the moment (for about a year, though I did get decent severance and had unemployment insurance and savings to help me muddle through until now), I can't pay for an alternate service, even something like Pobox (FM & 1&1 are paid up for now, but 1&1 won't let me use wildcard addressing in my forwards).

In any case, this should probably be in the Help/Issues forum.

-Mike
MikeL is offline   Reply With Quote
Old 30 Oct 2006, 05:39 AM   #3
kastaway
= Permanently banned =
 
Join Date: Nov 2005
Posts: 128
Totally agree

I've also done the math in my head, and while I'm always trying to give Fastmail the "benefit of the doubt", I too just don't see why the money that must be available could not solve these problems for the amount of money required.

My account was down for days, in the last outage (November). I was panicked at first, but setup a GMail account, told all my acquaintances what happened, and it survived. I thought I should switch, but I liked the features of FM too much, so I didn't.

Now, I'm down for days, again. I feel even stupider than I did last time! I feel stupid having to tell everybody I know that the email service that I pay for is down AGAIN. (They already think I'm nuts for paying for email in the first place....) And I feel stupid for waiting, like a deer caught in headlights, for the next FM disaster....which we all know is coming, because adequate money and expertise are not being used!

Well, I'm tired of feeling stupid.

So I'm gone. I signed up for Tuffmail two days ago, I'm migrating all of those downloaded emails onto that account, and I'm not coming back.

I really, really want to keep wildcard subdomains. And I really liked the web interface. But I've setup all the subdomains I can remember as Tuffmail domains, and besides I've concluded that plus-addressing can work just as well in the future.

Horde/IMP is not as fast of an *interface* as FM, but SINCE WHEN does the FM interface ACTUALLY reload as quickly as it should? Maybe 3 days out of 30, on average.

I was a huge fan. I put my money, my email, and my hope, where my mouth was. But it would be stupid to stay with FM any longer.

Bye bye Miss American Pie. (We drove the Chevy to the levee, but the levee was dry....)
kastaway is offline   Reply With Quote
Old 30 Oct 2006, 07:30 AM   #4
Aimlink
Master of the @
 
Join Date: Oct 2005
Location: Here and Now...
Posts: 1,078
John7033,

I don't get the impression from their description of what they're doing that they're liberal with their hardware to ensure state of the art robustness and performance.

Interesting what you wrote though.

Several months ago, I paid up to five years subscription for my enhanced account.

I did so because I could afford to and even moreso decided to support FastMail with a lumpsum payment with which they could combine similar payments and do something towards improving the service.

I haven't been directly affected but still have and empty feeling since I think it could have easily been me too that was affected. I do feel insecure about the whole setup now.

It will take them a year of uninterrupted service to recover from this soiling of their reputation.

The quotes on their home page now seem like a farce.

The recent threads on these lists are now dominated by negative commentary and complaining.
Aimlink is offline   Reply With Quote
Old 30 Oct 2006, 11:07 AM   #5
sflorack
The "e" in e-mail
 
Join Date: Feb 2002
Posts: 2,937
As was stated by another user here, their lack of participation in the forums suggests that they are losing interest in their own business ventures. Although Bron seems to be a competent data recovery administrator, I have serious doubts of his ability to single-handedly manage a worldwide email provider -- from what I recall, he's not even out of his twentys.

The only reason I'm continuing to settle with FM is that they have my money.. There really isn't any other option for me at this point.
sflorack is offline   Reply With Quote
Old 30 Oct 2006, 11:30 AM   #6
darens
Member
 
Join Date: Jul 2006
Posts: 46
Quote:
Originally posted by sflorack
The only reason I'm continuing to settle with FM is that they have my money.. There really isn't any other option for me at this point.
I'm in the same boat. The way I look at it is, I've got several months to investigate alternatives before having to make a decision.
darens is offline   Reply With Quote
Old 30 Oct 2006, 01:16 PM   #7
Shelded
 Moderator 
 
Join Date: Aug 2001
Location: USA Northwest
Posts: 3,849
John7033
Thanks for a post which says something constructive in the midst of frustration.
Shelded is offline   Reply With Quote
Old 30 Oct 2006, 09:01 PM   #8
brong
The "e" in e-mail
 
Join Date: Jul 2004
Location: Melbourne, Australia
Posts: 2,696

Representative of:
Fastmail.fm
Re: Let's try to be reasonable and logical

Quote:
Originally posted by john7033

We process about three million messages per month, though seventy percent of that is spam (which never makes it past the helo). For me, managing the mail server is almost a non-issue. It hasn't been down in sixteen months. The reason for the last outage was to migrate accounts to a larger raid.
Just a datapoint there - each one of our incoming mail servers processes more than that number of messages per day. Lots of things that are easy at lower rates become much more edgy when you get up to high rates, and a lot more kernel bugs get discovered. It's why we're trying to keep the load down by balancing things all the time.

Re: the RAID failure. I can assure you that the first one really was a kernel bug (not the fault of the RAID unit at all.. it was functioning perfectly well) caused by excessive numbers of deletes in a single transaction. My analysis (which hasn't been extensively tested admittedly) was that we were using an external journal on a battery backed memory device inside that machine. The partition that had been set aside for the journal was exactly 32Mb, which is the maximum journal size, however one 512 byte block was being used as a journal header, so the journal was slightly smaller than "standard". We managed to queue so many deletes in one transaction that it hit that size, and my guess is that something in the filesystem doesn't check that very rare edge case.

End result - a filesystem that couldn't be mounted read-write again.


This time however the RAID unit did die. We didn't realise how badly at first - the kernel returned "device offline" type errors, and I logged in to the web based management interface to notice that one drive was listed as "0GB" in size. I contacted the techs and had them change the drive. It then reported the correct size (400GB) and everything, but didn't say that the drive was a new one!!!

I was pretty worried at that stage, because you can't just plug a new drive into a RAID unit and consider the data on it to be valid! I had them pull it immediately and re-plug it. This time the change was detected and it started rebuilding to the new drive. Then another drive claimed to have failed, then another. We realised the drives weren't actually failing, the firmware on the RAID unit was confused.

This really was our worst nightmare situtation, the one remaining 2TB and non-replicated server dying before we had all the users off (you'll be pleased to hear that everyone is moved the whole machine now I'm sure). We were seriously about 1 week away from completion too, we had already moved over 2/3 of non-guest users from that partition, but the remaining users are of course the ones who noticed!

I'll write something in more detail again later, but it's hit midnight here. I'll just post something smaller to the technical blog and a separate thread here as well.
brong is offline   Reply With Quote
Old 31 Oct 2006, 05:09 AM   #9
ChinaLamb
The "e" in e-mail
 
Join Date: Dec 2004
Location: a virtually impossible but finitely improbable position
Posts: 2,320
Quote:
Originally posted by sflorack
As was stated by another user here, their lack of participation in the forums suggests that they are losing interest in their own business ventures. Although Bron seems to be a competent data recovery administrator, I have serious doubts of his ability to single-handedly manage a worldwide email provider -- from what I recall, he's not even out of his twentys.
Brong seems to be active to some extent in these forums, He also seems to be concerned about the customers. He also does not come off like an arrogant jerk as so many young Sys Admins do. I would lose interest in posting to these forums as well, with how negative they have been. (yes, I have added to that though... :P

Quite honestly, if your business lives and dies on having 100% uptime for email (no one can promise that), you cannot rely on just one vendor, you have to build in your own redundancy. Many good posts here on how to do that. If your email provider goes down twice and it damages your business - you have no one to blame but yourself.

Just some thoughts...

CL
ChinaLamb is offline   Reply With Quote
Old 31 Oct 2006, 05:19 AM   #10
sflorack
The "e" in e-mail
 
Join Date: Feb 2002
Posts: 2,937
I know it was a general comment, but being that you began your post by quoting me, I felt obliged to respond..

I do not run a business, nor do I consider my FM account to handle vital emails. Regardless, FM sells subscriptions to their services and there is an expectation of their reliability. I could just as easily use a free Yahoo account, but like the feature set of FM and like supporting small businesses.

What's becoming increasingly frustrating is not necessarily the downtime in of itself, it's the idea that I (and probably many others) are starting to think that FM is just not staffed/equipped to run a premiere email service. As a consumer, I make daily choices as to how I'll spend my money, and if features are the only perks I receive and have to sacrifice reliability, then I'm starting to tip the balance to finding an alternative.
sflorack is offline   Reply With Quote
Old 31 Oct 2006, 05:23 AM   #11
darens
Member
 
Join Date: Jul 2006
Posts: 46
Very well put, sflorack.
darens is offline   Reply With Quote
Old 31 Oct 2006, 08:09 AM   #12
brong
The "e" in e-mail
 
Join Date: Jul 2004
Location: Melbourne, Australia
Posts: 2,696

Representative of:
Fastmail.fm
Quote:
Originally posted by sflorack
As a consumer, I make daily choices as to how I'll spend my money, and if features are the only perks I receive and have to sacrifice reliability, then I'm starting to tip the balance to finding an alternative.
edit: sorry, I managed to hit the submit button without pasting in my response! Oops.

We're very aware of that, and we know that every time something goes wrong it will push some people to run to other providers. That's unfortunately a fact of life - and the bigger you get the bigger the risks of something going wrong.

We made a couple of really bad calls over the past couple of years. The worst one by far was choosing to set up 2TB volumes to put Cyrus emails on. Cyrus stores every single email individually, and the average size of an email is in the 10KB range. Do the maths yourself to figure out just how many emails that is, and how long it takes to do a full filesystem check on that sort of data.

Also, look at the maximum speed of a SCSI bus or at SATA150 drive for the matter and think about how long it takes to move that much data on and off the drive array if necessary. We didn't really consider the worst case there well enough, and it hurt us badly.

We've now standardised on 300Gb across the board as our largest partition. Well, that's not strictly true, our backups are still on 2TB arrays, but they are one .tar.gz file per user, so they check a lot faster. The 300Gb partitions check in about 8 hours, and they are replicated as well.

Other bad choices included applying a security update to Cyrus without sufficient checks that they hadn't included some braindead half-baked feature that broke approximately everything along with the security update, buying hardware that we hadn't used before and that, while it was expected to be lower repliability, wasn't expected to be quite so low as all that, and more recently trying to switch some systems to the ext3 filesystem, which has much worse throughput than reiserfs for our usage, meaning stores have been overloaded and I had to cut back the transfer rate more than I would otherwise have done.

That said, I think we've made a lot of good choices as well, you just don't see them - and we've made a lot of tradeoffs that knowing everything we know know about just when our units would fail we wouldn't have made - but at the time the smoother and safer approach seemed more sensible. If we were a few months ago with the information we had then, I think we would still make the same choices (except ext3, I wouldn't do ext3 again in hurry, man its delete performance sucks, and random write isn't too hot either)

Last edited by brong : 31 Oct 2006 at 08:21 AM.
brong is offline   Reply With Quote
Old 31 Oct 2006, 01:55 PM   #13
robmueller
Intergalactic Postmaster
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,102

Representative of:
Fastmail.FM
I'll second bron. Unfortunately you make choice everyday, and sometimes you get the wrong and the effects last longer than you'd ever hope. Even if you realise the mistake quickly, the effort to turn back from the original mistake can take way longer.

Now it's easy to say "why make the mistake in the first place", but life is never that easy.

The ironic thing is that (as Bron pointed out in the status blog post) this server3 failure forced us to rapidly restore people onto the new replicated servers so now everyone is on replicated servers. Had it happened 1 week later, no one would have been affected. Oh well, stuff happens. You fix it and move on.

Rob
robmueller is offline   Reply With Quote
Old 31 Oct 2006, 02:20 PM   #14
NJSS
Master of the @
 
Join Date: Jul 2002
Location: Hampshire, UK
Posts: 1,238
Brong & Rob

I have followed this, and the many other threads on this topic, in the last few days.

Firstly I'd like to thank you both, and Richard & Jeremy for the hard work which has undoubtedly been put in since the server 3 problems manifested themselves last week.

Secondly I am in the "Let's be reasonable & logical" camp: e-mail is essential for me so I have had my own personal redundancy & fall-back in place for a couple of years, save for replication of the FM filestore, which is only a couple of weeks old.

I am very happy with what FM is doing, I like the interface & features, and will be staying with FM.

I've said it elsewhere, and you now appear to be addressing these issues, but they bear repetition: better communication via the Status Blog is essential, as is an improvement of the direct support of users.

Well done guys, and thanks again.
NJSS is offline   Reply With Quote
Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +9. The time now is 04:16 PM.

 

Copyright EmailDiscussions.com 1998-2022. All Rights Reserved. Privacy Policy