Crucial & Technical Server4 Inquiry

jmlondon · 11 Nov 2005, 11:07 PM

Hi,

esp for Jeremy Howard, when you have time to check in . . .

just how big is this arrray of yours on Server 4???

Let me explain my line of questioning:

[background: recently "Full Member" more recently upgraded storage on the strength of faultless performance to date for my usage (about 18 months - 2 years), the irony . .

oh, and maybe someone will flame me later why i have such high end experience and use Fastmail - er, we're small, not needing or wanting to run anything we don't have to, and *ahem* read again the bit i just wrote about "faultless performance" I was just about to consider chucking my whole company onto Fastmail, because 1) i roadtested the service long enough, 2) I was going to do dilligence anyhow, 3) it's been very very good YMMV]

Firstly I *have* read all the other threads, so i am asking now about logic which doesn't stack up in my mind from what Mr Howard has said publicly so far.

Recently I've moved to using AIT 4 tape for my desktop.

these are the beasties: http://b2b.sony.com/Solutions/subcat...T_Drives/AIT-4

(I work with large (media) object databases, and have a 400GB level 5 array, backed up locally to save the network whilst we upgrade the drops, and also considerable experience designing storage systems that were vast for a few years ago and nearly standard now . . . a related observation for later)

AIT-4 thruput is 24 MB/s (see the specs).

According to Mr Howard, "The restore will take around 45 hours"

and "The hard drive enclosure that failed is RAID 6. "

Raid 6 is pretty new, and unavailable / impossible on low end controllers, so i'm making an admittedly big assumption that Server4 was a forklift upgrrade with the latest gear.

This is why i'm presuming a tape storage option was specified like AIT-4, or the LTO Ultrium etc. equivalent. Capacity and backup windows alone demand it, even on my desktop workstation (admittedly i'm doing a *lot* more than use Office on my local machine

If i add up the numbers for AIT-4 thruput, I get

24MB/s = 0.024 GB/s [*10^-3]

the projected 45hr restore is (3600 * 45) = 162000s

and multiplying those [note divisor equivent in seconds to stop me wrongly posting this]

we get a storage array of:

3888 GB

or 3.8 TB (TeraByte)

which is one hell of a big array.

** remember i am assuming a "forklift " upgrade to new gear associated with the shiny new RAID 6 controller we know about

- reinforcing this, bear in mind also i know personally from professional experience now that neither IBM nor HP nor DELL support these RAID cards on any but a few _new_ servers, else you're SOL if you don't knbow your way around. Maybe Intel will sell you a card, as will HP etc, but only after they disclaim the hell outta support. I tried, only three weeks ago. Being a small company, we insist by policy on NBD support at minimum **

Because of my familiarity with these RAID cards, I'm assuming we have installed on this server the latest capacity / density SCSI ? Serial Attached SCSI or SATA drives.

And i'm further assuming that because SATA 2 (as yet hardly implemented in the stack) has far fewer hardware safeguards compared with SCSI (Serial SCSI or Parrallel), then SCSI is what is being used for critical systems like Server4.

Now, using SCSI/SAS drives, the highest density is 300GB per drive for Seagate / Fujitsu 10KRPM SCSI.

So a NON-RAID capacity requirement would be: 3888GB/ 300GB = 12.96 disks _unformatted_. Let's round down and ignore the formatting loss and take 12 drives, a nice number . .

For RAID 6 the capacity you get C = N(N-2)

i.e. there are TWO parity drives for each array configured.

since these are included in the required backup, we still have the same total data and so in fact there are more drives,

BUT LET ME GET TO MY POINT (and apologies for my assumptive math)

is this one vast array, and if so just how many paying users are sitting on a 4TB filesystem?

or is Server4's FS far smaller and utilising much slower tape drives? (N.B. here that 4yr old tech pulls 6MB/s, so for the same restore time, the capacity is still 1TB, which is a lot of user space)

and we get to the crux, assuming everything is the latest tech (and why introduce a semanding storage system without the processing power to handle the FS metadata at OS level???)

a) 4TB is way too much to manage without system level FS knowledge or high end object-FS overlays (i have tried this for a major project, thankfully when arrays this size were considered experimental _for a single disk image_)

b) if there's no autochanger, at max. compression of 2.6:1 (for AIT-4, but similar for other current tech) i can get ~520GB on each tape, which is ~8 tapes for a full system restore . . . . point being HOW OFTEN IS A FULL FS IMAGE TAKEN

b.i) if incremental backup is used on a single non - robootic drive, how many tape changes does this really need?

c) REALLY CRITICAL FOR USERS (connsidering how email stores update so frequently) how frequent is the backup window, both for incremental and full image, and how does the incoming SMTP queue cover for this i.e. do they overlap adequately so no incoming is lost AND that when the SMTP queue hands over to the FS, this is done close enough to a backup window that there's no chance for data to slip the net?

without going on further, even if thiis is a much more modest FS than i have theorized, 1) there are high hardware requirements to meet the backup window problem alone 2) i am aware only a handful of commercial backup software systems that can cope, and no F/OSS packages (at least without considerable user tweaking that would for a experienced sysadmin/network engineer's time completely negate the licensing cost) 3) I'd dang like to know, considering i've been more than happy to date, and am obviously concered as hell what's happining to my email. (and yes, i switched over my MX records, just waiting for them to propogate).

In summary, like everyone else i want to know why 45 HOURS is the (constantly moving) ETA for restore.

If i were running a huge array like this, I'd have online backupo with disc mirroring and RAID'd tape drives in a robotic changer. I honestly don't know how you could recover otherwise.

kind regards to the hard pressed team, i mean you no disrespect considering even at half the complexity i am guessing (from experience, N.B.), you have your work cut out, but could you please address my questions, as i think a concise but sufficiently detailed reply here will assuage the woes of even the most technically competant customers you have.

best,

- john

p.s. sorry typos, must get back to work

Xenna · 11 Nov 2005, 11:17 PM

I think I read the 4TB number somewhere, so you're probably right. I also believe that multi TB arrays and huge filesystems are asking for trouble since if the worst case happens, restore times are ridiculous.

I also fear that with 3 disks failing there may be something wrong with the RAID controller or the rest of the system. I don't have too much experience with RAID arrays but how big is the risk that when the restore finally finishes FM will find out that the problem repeats and the new FS is corrupted as well?

In that case we'll all be waiting for a week!

X.

PS: I've alway been sceptical when colleagues raved on about great new storage technologies. The bigger the storage, the bigger the problem if it fails.

EdGillett · 11 Nov 2005, 11:28 PM

Quote from Bron in July:

"To improve stability and performance: Two new 4TB arrays in the past month. We're working towards installing a much more efficient POP and IMAP proxy, alleviating the load on our frontends"

So yeah ...

2 4TB Arrays.

Big old beasties

A backup of that boy is gonna take a while.

Good point about RAID6.

What harware support do they have for this - for the drives to fail like this, I'm presuming they have some rights to onsite tech support from the hardware vendor?

http://wiki.fastmail.fm/index.php/InfraStructure

donno how up to date the wiki is. Seems like it's info gleamed from the forums

jmlondon · 12 Nov 2005, 12:02 AM

@Xenna & @EdGillett

Hi there, thanks for your inpput!

in a bit more detail (my argument, that is) . .

The MTBF of my new Seagate Cheetah 15K7 146GB drives is nearly 7 years. That didn't stop a DOA between factory and powerup.

Point is, this is an awesome MTBFF rating. But even a small failure rate multiplied by a large number of drives = a frequent failure rate. Just prob. math & fact of life.

Which ask the almost philosphical question: "Lots of small drives, less data to fail and quicker rebuilds as a result, but higher cumulative failure OR a few huge drives?"

Power, cooling and density always dictate the latter.

If these are NOT NEW DRIVES, then sorry, that's imcompetant. Add the expected failure rate to the distribution of failure across the lifetime, and you're asking for trouble.

Yeah, huge FS's are truly asking for trouble.

Reiser/FS ain't bad, but sorry it is seriously not production on this scale.

SGI XFS is proven at this scale, long time now. 4TB is quite small for their sites even a year or two ago.

But no FS will work well without efficient metadata handling, which is why there are so many specialized file systems available, excluding those aimed at SAN multi-OS interop.

That metadata is what drives the backup schedule and general maintenance. A serious FS will talk to the controller directly to accelerate management routine functions. Top level backup software (and it need not cost so very much) understands this and optimizes.

But no buzzword saves the day. RAID may sound exotic to most. RAID 6 is impossible to understand without reading the Galois Algebra. But RAID might exist on the lowest end home PC's.

Remember what RAID stands for: Redundant Array of Inexpensive Discs.

That never meant the cheapest, just when the acronym was coined, disc space was so expensive, any opportunity to reduce QA limitations (the 15% rule) and improve failure times was critical, so the name was chosen to appeal to managers.

Assuming you can throw in budget drives on a critical system is suicidal.

The point of SAS / SATA is to have an interoperable bus. I.E. one controller can talk to high reliability SCSI serial drives at the same time as cheap SATA drives, and migrate less used data to the "nearliine" SAT, *prior to backup* or as part of a backup mirror copy to reduce load on a smaller quantity of higher performing drives.

Equally, to think that old hardware / motherboards / CPUs can handle all this is a crock - the all important FS metadata is not handled at kernel level, or shouldn't be. Under NT good controller _drivers_ hand off this load efficiently. Linux drivers, for lack of deployment on this scale, are a generation behind, causing CPU load.

Oh, and LANL (and others) do have one of the biggest supercomputers running Linix with vast disc arrays. Only the whole storage array is a proprietary hardware running pproprietary ObjectFS which is naturally heavily optimised. If you have to ask the price, er, you don't have clearance

Just because we can do 4TB in the SME / micro business, even, because discs got cheap (even the very best, comparatively) doesn't mean EMC is going out of business. It's that complicated.

Sorry this iis a bit of a rant, but 4 years ago, i worked on a project (library archiving) that needed nearly 5TB of near-line. Allowing i visit the subject again, with experience, and with new tools available commercially, I would expect _installation planning_ alone to cost 10* to hardware cost. . . .

with that much data, knowing where you are and whare you are going, and how to log, track and backtrack is the "be all and end all", and is specific enough to what you're running that you have to write your own operations manual.

I forget his name, and thiis may be an aporcyphal quote from the film, but the Flight Director of Apollo 13 kept yelling "GIVE ME THE PROCEDURES"

If it ain't written down, whhere do you start?

Is this why we have a moving ETA target?

Sheesh, sorry this ain't a pitch, but it sounds a bit like one i know

cheers,

- john

EdGillett · 12 Nov 2005, 12:08 AM

I bow to John's superior knowledge on this one, but comments from Jeremy including "We have learnt a lot from this failure - we haven't had a RAID array fail before so we haven't really known what it would look like" are a bit concerning.

A lack of understanding with file systems this large may be the root of the issues here as well.

FM staffers out of their depth when they've had to scale up?

jmlondon · 12 Nov 2005, 12:42 AM

Quote:

Originally posted by EdGillett
I bow to John's superior knowledge on this one, but comments from Jeremy including "We have learnt a lot from this failure - we haven't had a RAID array fail before so we haven't really known what it would look like" are a bit concerning.

A lack of understanding with file systems this large may be the root of the issues here as well.

FM staffers out of their depth when they've had to scale up?

Hey Ed!

Thanks for your compliments, but let me climb quickly down from any high horse that happened to be passing (ugh, mixing metaphors, forgive me!)

Can i just say that 6 weeks ago I had also *my first* RAID failure, oor rather a disc controller failure on a JBOOD (Just a Bunch of Discs).

Now this was an OLD 5yr machine, so it hardly mattered. Professional pride in recovery however (window missed, if only non-essentiual data) since I *never* lost data . . .

let me explain:

If this was RAID with redundancy the plan is simple:

If the RAID controller card is stuffed, unplug the batter backed RAM, install in another card, reconnect drives, swap out dead drive, rebuild.

Sadly, it was JBOD so no redundancy, and it was the dic controller that went, so i am just waiting for time to steal a unused identical disc and perform some surgery.

But for a modest mistake, i lost 100 hrs of secondary and tertiary impact. That's a HUGE cost for us.

Small business such as ourselves actually have far less tolerance for failure than say a major bank or big contractor with 1,000s staff. We can't ask the Gov. to bail us out

** I have to say: HARDWARE RAID FAILURE IS SCARY ! **

but to think that a 4TB rebuild, even permitting a extreme failure _does NOT occur_ will be easy, is naive.

Someone on the other big thread made a complaint they don't want to pay (via subs) for Fastmail to learn.

Hmm, fair enough sentiment.

But i'd want the big guns in for this.

But again, still, we're connstrained by the huge size of the storage array.

I mean the very best available RAID has 12GB/s non-contended internal real-time rebuild. This is serious budget kit, and even that is only being used in non - OLTP situations.

OLTP - and by my definition, email is OLTP, because as communication it has the same non-repudaition and ACID requirements as a banking app - *DOES NOT DEPOY* the same kind of arrays. Media, analysis ("data warehousing to be so 1992"

and scientific use that kind of thruput, because they don't have to be UP ALL THE TIME.

My humble anbd very quick solution: much much smalller file syetems as logical drives, and have them independant RAIDs. A big Xeon machine willmore than handle the IMAP etc requests, but hardly has the capacity to deal with lots of disc management. $1000 servers can internall hotswap enough SATA to make a man happy, when therre are 5 of them or so.

Basically, reduce points of failure.

This is what costomers are complaining about.

The minimum action requirement as i see it if you have one server like Server4 is as follows:

a) send the broken drives for "forensic" recovery, and rebuild on a separate machine (good RAID controllers allow copy of settings or even a hot swap, so this is easy enough)

b) run the backup from tape in parrallel on another machine.

c) this only requires the most modest hardweare since you are creating 100% redundancy over and above the fact you can try again

d) problems encountered on oine system can be preempted on the other.

I've said it already, but i need to emphasise:

1) Hardware failures on RAID are very scary

2) I don't want to start on how much it takes to get a SLA data survival support contract even from EMC, even when you have a 7 figure budget.

3) I just want to know how bad this is. Fastmail guys are actually dealing with problems here that are rarely faced outside research labs and huge Fortune 500 companies. This time around, they have a very steep learning curve. I think there are bettter ways for the customer to deal with this. e.g. I registered a backup and redirected mail from my DNS. Can Fastmail give me the features (such as SMS which i use now) on a backup interim, on a working server? Fastmail need to communicate. I know only 0.005% of what's going on, and i understand the difficulties. There is a way to say "sorry guys, we're shafted, we just bit the a$$ off the Yeti" and say that it's "watch this space", not give confusing messages. (oh, and i worked in advertising very intimately, so i know you can cover your a$$ without lying even if no-one seems to try that one . . . )

. . basically, make this an epic turnaround. But you can't educate your customers whilst educating yourself. Catch 22.

I think it would be very fair if Fastmail came out with the seriouness of the problem, and very fair if customers gave them credit for climbing Everest's Noth Face.

cow, i know how to spin things LOL!

thanks Ed again for your compliments, but honestly such nice words are not due to mortals

best,

- john

elvey · 12 Nov 2005, 07:02 AM

Good analysis and questions.
As a reminder for posterity:
What we're talking about is this message on login attempt on Thursday <sic> night:
The account you are trying to login to is on a server which requires emergency maintenance. The maintenance will take around 2 days - it should be complete by the end of the weekend. Update will be posted to FastMail.FM's status page

Jeremy Howard · 12 Nov 2005, 07:17 AM

We are reasonably familiar with file system internals, and have worked directly with Namesys (the file system vendor) on numerous patches.

However, next time we build a server we'll use fewer smaller arrays.

The reason we don't have experience with a full RAID failure is that we've managed the system well enough so far that we have had one! So this is the 1st time that it's happened. We've had many hardware failures of course, but they've never caused an outage before.

The backups we are using are on a 2nd hard disk enclosure. It's slower than pure disk speed because it has to write lots of small files, so the meta-data updates are the constraint.

jmlondon · 13 Nov 2005, 08:44 PM

Quote:

Originally posted by Jeremy Howard
We are reasonably familiar with file system internals, and have worked directly with Namesys (the file system vendor) on numerous patches.

However, next time we build a server we'll use fewer smaller arrays.

The reason we don't have experience with a full RAID failure is that we've managed the system well enough so far that we have had one! So this is the 1st time that it's happened. We've had many hardware failures of course, but they've never caused an outage before.

The backups we are using are on a 2nd hard disk enclosure. It's slower than pure disk speed because it has to write lots of small files, so the meta-data updates are the constraint.

Hi,

i don't think you answered my original question (tho' others reported array size) because i expanded on the all important failure ratio for drives : function of MTBF * #drives * drive age failure rate. It would be nice to have confirmed this was a new array, new drives, ideally a small number of drives (less cumulative statistical failure, and as you know therefore les chance of a two-drive failure downing the array) and if these are SATA or SCSI (the considerable price bump of SCSI is in no small part attributable to extra QA)

But what disturbes me now, is your mention of working on patches for Reiser, if only because you do not state in what capacity or under what conditions you were involved in that process.

At the best, your assertion indicates a close vendor involvement.

At the worst, it suggest you are a testbed site for untrialled features or scale.

I'm not familiar with Reiser in the least, but the following quote worries me: "Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins...."

so the obvious question is what plugins if any do you deploy? External plugins inevitably mean there is no single codebase checkin, testing, formal bakeoff . . .

I may know diddly about Reiser internals, but insofar as NTFS is what i use and use intensively (currently deep into the Sparse Files API because that fits nicely with our problem set), i am hard pressed to think of one critical NTFS hotfix that was not spurred by a rolling codebase in application and NTFS features, IOW the baseline has been rock solid for us since 3.51, and the problem areas are (in VMS speak) layered products - i.e. forcing feature use or performance related optimisations from the application layer (which really ain't for the feint of heart, is very rare in user apps, resulting in usually only MSFT themselves screwing things up if you look through the patch lists).

My hypothetical extrapolation is that for a known good FS, anything which needs patching is application and feature driven - so why would a service providor be doing such a thing?

I would appreciate some clarification, over and above a simple "you're wrong, John".

Good that you pretty much hit schedule on the restore. Much appreciated.

On a (possible) positive note - can you upgrrade the cache on your controller? More cache = faster rebuild times for hot spares (you do have hot spares, don't you?

and 512MB is bog minimum for an array of this scale. I think you'll find that any EMC / Hitachi array of 4TB has a minimum cache of 2GB (poss. even more), ignoring multiple controllers, and this is precisely so the rebuild can happen as close to wire speed as possible, without loading the server with requests or risking only partial rebuild before another drive goes down.

Once again, good job on your timeplanning, but please drop in on my queries above.

kind regards,

- john

fmfan · 13 Nov 2005, 09:45 PM

Quote:

Originally posted by jmlondon
... 512MB is bog minimum for an array of this scale. I think you'll find that any EMC / Hitachi array of 4TB has a minimum cache of 2GB (poss. even more), ignoring multiple controllers, and this is precisely so the rebuild can happen as close to wire speed as possible, without loading the server with requests or risking only partial rebuild before another drive goes down. ...

512MB?

There is, I have, 2GB, not cache, but RAM on this home computer.

40GB internal, 120GB external hard drives.
One HD 2MB, other 8MB cache.

Seems like the 2GB RAM might be similar to the 512MB as it is effectively the system (memory) cache?

Jeremy Howard · 14 Nov 2005, 05:08 AM

John, we use Reiser3, not Reiser4, which is not plugin based.

We have a close vendor involvement with them to ensure that we get the best performance and reliability we can. We are not a testbed for new features.

I have also used NTFS extensively and I have not found it more reliable than Reiser3.

We already have 2GB cache on our SCSI array.

11 Nov 2005, 11:07 PM	#1
jmlondon Junior Member Join Date: Nov 2005 Posts: 5	Crucial & Technical Server4 Inquiry Hi, esp for Jeremy Howard, when you have time to check in . . . just how big is this arrray of yours on Server 4??? Let me explain my line of questioning: [background: recently "Full Member" more recently upgraded storage on the strength of faultless performance to date for my usage (about 18 months - 2 years), the irony . . oh, and maybe someone will flame me later why i have such high end experience and use Fastmail - er, we're small, not needing or wanting to run anything we don't have to, and ahem read again the bit i just wrote about "faultless performance" I was just about to consider chucking my whole company onto Fastmail, because 1) i roadtested the service long enough, 2) I was going to do dilligence anyhow, 3) it's been very very good YMMV] Firstly I have read all the other threads, so i am asking now about logic which doesn't stack up in my mind from what Mr Howard has said publicly so far. Recently I've moved to using AIT 4 tape for my desktop. these are the beasties: http://b2b.sony.com/Solutions/subcat...T_Drives/AIT-4 (I work with large (media) object databases, and have a 400GB level 5 array, backed up locally to save the network whilst we upgrade the drops, and also considerable experience designing storage systems that were vast for a few years ago and nearly standard now . . . a related observation for later) AIT-4 thruput is 24 MB/s (see the specs). According to Mr Howard, "The restore will take around 45 hours" and "The hard drive enclosure that failed is RAID 6. " Raid 6 is pretty new, and unavailable / impossible on low end controllers, so i'm making an admittedly big assumption that Server4 was a forklift upgrrade with the latest gear. This is why i'm presuming a tape storage option was specified like AIT-4, or the LTO Ultrium etc. equivalent. Capacity and backup windows alone demand it, even on my desktop workstation (admittedly i'm doing a lot more than use Office on my local machine If i add up the numbers for AIT-4 thruput, I get 24MB/s = 0.024 GB/s [10^-3] the projected 45hr restore is (3600 45) = 162000s and multiplying those [note divisor equivent in seconds to stop me wrongly posting this] we get a storage array of: 3888 GB or 3.8 TB (TeraByte) which is one hell of a big array. remember i am assuming a "forklift " upgrade to new gear associated with the shiny new RAID 6 controller we know about - reinforcing this, bear in mind also i know personally from professional experience now that neither IBM nor HP nor DELL support these RAID cards on any but a few _new_ servers, else you're SOL if you don't knbow your way around. Maybe Intel will sell you a card, as will HP etc, but only after they disclaim the hell outta support. I tried, only three weeks ago. Being a small company, we insist by policy on NBD support at minimum Because of my familiarity with these RAID cards, I'm assuming we have installed on this server the latest capacity / density SCSI ? Serial Attached SCSI or SATA drives. And i'm further assuming that because SATA 2 (as yet hardly implemented in the stack) has far fewer hardware safeguards compared with SCSI (Serial SCSI or Parrallel), then SCSI is what is being used for critical systems like Server4. Now, using SCSI/SAS drives, the highest density is 300GB per drive for Seagate / Fujitsu 10KRPM SCSI. So a NON-RAID capacity requirement would be: 3888GB/ 300GB = 12.96 disks _unformatted_. Let's round down and ignore the formatting loss and take 12 drives, a nice number . . For RAID 6 the capacity you get C = N(N-2) i.e. there are TWO parity drives for each array configured. since these are included in the required backup, we still have the same total data and so in fact there are more drives, BUT LET ME GET TO MY POINT (and apologies for my assumptive math) is this one vast array, and if so just how many paying users are sitting on a 4TB filesystem? or is Server4's FS far smaller and utilising much slower tape drives? (N.B. here that 4yr old tech pulls 6MB/s, so for the same restore time, the capacity is still 1TB, which is a lot of user space) and we get to the crux, assuming everything is the latest tech (and why introduce a semanding storage system without the processing power to handle the FS metadata at OS level???) a) 4TB is way too much to manage without system level FS knowledge or high end object-FS overlays (i have tried this for a major project, thankfully when arrays this size were considered experimental _for a single disk image_) b) if there's no autochanger, at max. compression of 2.6:1 (for AIT-4, but similar for other current tech) i can get ~520GB on each tape, which is ~8 tapes for a full system restore . . . . point being HOW OFTEN IS A FULL FS IMAGE TAKEN b.i) if incremental backup is used on a single non - robootic drive, how many tape changes does this really need? c) REALLY CRITICAL FOR USERS (connsidering how email stores update so frequently) how frequent is the backup window, both for incremental and full image, and how does the incoming SMTP queue cover for this i.e. do they overlap adequately so no incoming is lost AND that when the SMTP queue hands over to the FS, this is done close enough to a backup window that there's no chance for data to slip the net? without going on further, even if thiis is a much more modest FS than i have theorized, 1) there are high hardware requirements to meet the backup window problem alone 2) i am aware only a handful of commercial backup software systems that can cope, and no F/OSS packages (at least without considerable user tweaking that would for a experienced sysadmin/network engineer's time completely negate the licensing cost) 3) I'd dang like to know, considering i've been more than happy to date, and am obviously concered as hell what's happining to my email. (and yes, i switched over my MX records, just waiting for them to propogate). In summary, like everyone else i want to know why 45 HOURS is the (constantly moving) ETA for restore. If i were running a huge array like this, I'd have online backupo with disc mirroring and RAID'd tape drives in a robotic changer. I honestly don't know how you could recover otherwise. kind regards to the hard pressed team, i mean you no disrespect considering even at half the complexity i am guessing (from experience, N.B.), you have your work cut out, but could you please address my questions, as i think a concise but sufficiently detailed reply here will assuage the woes of even the most technically competant customers you have. best, - john p.s. sorry typos, must get back to work

11 Nov 2005, 11:28 PM	#3
EdGillett Member Join Date: Jul 2002 Location: Guildford, UK Posts: 88	That's about right ... Quote from Bron in July: "To improve stability and performance: Two new 4TB arrays in the past month. We're working towards installing a much more efficient POP and IMAP proxy, alleviating the load on our frontends" So yeah ... 2 4TB Arrays. Big old beasties A backup of that boy is gonna take a while. Good point about RAID6. What harware support do they have for this - for the drives to fail like this, I'm presuming they have some rights to onsite tech support from the hardware vendor? http://wiki.fastmail.fm/index.php/InfraStructure donno how up to date the wiki is. Seems like it's info gleamed from the forums

12 Nov 2005, 12:02 AM	#4
jmlondon Junior Member Join Date: Nov 2005 Posts: 5	Re: That's about right ... @Xenna & @EdGillett Hi there, thanks for your inpput! in a bit more detail (my argument, that is) . . The MTBF of my new Seagate Cheetah 15K7 146GB drives is nearly 7 years. That didn't stop a DOA between factory and powerup. Point is, this is an awesome MTBFF rating. But even a small failure rate multiplied by a large number of drives = a frequent failure rate. Just prob. math & fact of life. Which ask the almost philosphical question: "Lots of small drives, less data to fail and quicker rebuilds as a result, but higher cumulative failure OR a few huge drives?" Power, cooling and density always dictate the latter. If these are NOT NEW DRIVES, then sorry, that's imcompetant. Add the expected failure rate to the distribution of failure across the lifetime, and you're asking for trouble. Yeah, huge FS's are truly asking for trouble. Reiser/FS ain't bad, but sorry it is seriously not production on this scale. SGI XFS is proven at this scale, long time now. 4TB is quite small for their sites even a year or two ago. But no FS will work well without efficient metadata handling, which is why there are so many specialized file systems available, excluding those aimed at SAN multi-OS interop. That metadata is what drives the backup schedule and general maintenance. A serious FS will talk to the controller directly to accelerate management routine functions. Top level backup software (and it need not cost so very much) understands this and optimizes. But no buzzword saves the day. RAID may sound exotic to most. RAID 6 is impossible to understand without reading the Galois Algebra. But RAID might exist on the lowest end home PC's. Remember what RAID stands for: Redundant Array of Inexpensive Discs. That never meant the cheapest, just when the acronym was coined, disc space was so expensive, any opportunity to reduce QA limitations (the 15% rule) and improve failure times was critical, so the name was chosen to appeal to managers. Assuming you can throw in budget drives on a critical system is suicidal. The point of SAS / SATA is to have an interoperable bus. I.E. one controller can talk to high reliability SCSI serial drives at the same time as cheap SATA drives, and migrate less used data to the "nearliine" SAT, prior to backup or as part of a backup mirror copy to reduce load on a smaller quantity of higher performing drives. Equally, to think that old hardware / motherboards / CPUs can handle all this is a crock - the all important FS metadata is not handled at kernel level, or shouldn't be. Under NT good controller _drivers_ hand off this load efficiently. Linux drivers, for lack of deployment on this scale, are a generation behind, causing CPU load. Oh, and LANL (and others) do have one of the biggest supercomputers running Linix with vast disc arrays. Only the whole storage array is a proprietary hardware running pproprietary ObjectFS which is naturally heavily optimised. If you have to ask the price, er, you don't have clearance Just because we can do 4TB in the SME / micro business, even, because discs got cheap (even the very best, comparatively) doesn't mean EMC is going out of business. It's that complicated. Sorry this iis a bit of a rant, but 4 years ago, i worked on a project (library archiving) that needed nearly 5TB of near-line. Allowing i visit the subject again, with experience, and with new tools available commercially, I would expect _installation planning_ alone to cost 10* to hardware cost. . . . with that much data, knowing where you are and whare you are going, and how to log, track and backtrack is the "be all and end all", and is specific enough to what you're running that you have to write your own operations manual. I forget his name, and thiis may be an aporcyphal quote from the film, but the Flight Director of Apollo 13 kept yelling "GIVE ME THE PROCEDURES" If it ain't written down, whhere do you start? Is this why we have a moving ETA target? Sheesh, sorry this ain't a pitch, but it sounds a bit like one i know cheers, - john

12 Nov 2005, 12:08 AM	#5
EdGillett Member Join Date: Jul 2002 Location: Guildford, UK Posts: 88	Hmm ... I bow to John's superior knowledge on this one, but comments from Jeremy including "We have learnt a lot from this failure - we haven't had a RAID array fail before so we haven't really known what it would look like" are a bit concerning. A lack of understanding with file systems this large may be the root of the issues here as well. FM staffers out of their depth when they've had to scale up?

11 Nov 2005, 11:17 PM	#2
Xenna Junior Member Join Date: Apr 2004 Posts: 23	I think I read the 4TB number somewhere, so you're probably right. I also believe that multi TB arrays and huge filesystems are asking for trouble since if the worst case happens, restore times are ridiculous. I also fear that with 3 disks failing there may be something wrong with the RAID controller or the rest of the system. I don't have too much experience with RAID arrays but how big is the risk that when the restore finally finishes FM will find out that the problem repeats and the new FS is corrupted as well? In that case we'll all be waiting for a week! X. PS: I've alway been sceptical when colleagues raved on about great new storage technologies. The bigger the storage, the bigger the problem if it fails.

12 Nov 2005, 07:02 AM	#7
elvey The "e" in e-mail Join Date: Jan 2002 Location: San Francisco Posts: 2,458	Good analysis and questions. As a reminder for posterity: What we're talking about is this message on login attempt on Thursday <sic> night: The account you are trying to login to is on a server which requires emergency maintenance. The maintenance will take around 2 days - it should be complete by the end of the weekend. Update will be posted to FastMail.FM's status page

12 Nov 2005, 07:17 AM	#8
Jeremy Howard Ultimate Contributor Join Date: Sep 2001 Location: Australia Posts: 11,501	We are reasonably familiar with file system internals, and have worked directly with Namesys (the file system vendor) on numerous patches. However, next time we build a server we'll use fewer smaller arrays. The reason we don't have experience with a full RAID failure is that we've managed the system well enough so far that we have had one! So this is the 1st time that it's happened. We've had many hardware failures of course, but they've never caused an outage before. The backups we are using are on a 2nd hard disk enclosure. It's slower than pure disk speed because it has to write lots of small files, so the meta-data updates are the constraint.

14 Nov 2005, 05:08 AM	#11
Jeremy Howard Ultimate Contributor Join Date: Sep 2001 Location: Australia Posts: 11,501	John, we use Reiser3, not Reiser4, which is not plugin based. We have a close vendor involvement with them to ensure that we get the best performance and reliability we can. We are not a testbed for new features. I have also used NTFS extensively and I have not found it more reliable than Reiser3. We already have 2GB cache on our SCSI array.