Matthew Angelo | 7 Feb 03:45 2011
Picon

RAID Failure Calculator (for 8x 2TB RAIDZ)

I require a new high capacity 8 disk zpool.  The disks I will be
purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
bits read) of 1 in 10^14 and will be 2TB.  I'm staying clear of WD
because they have the new 2048b sectors which don't play nice with ZFS
at the moment.

My question is, how do I determine which of the following zpool and
vdev configuration I should run to maximize space whilst mitigating
rebuild failure risk?

1. 2x RAIDZ(3+1) vdev
2. 1x RAIDZ(7+1) vdev
3. 1x RAIDZ2(7+1) vdev

I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
2TB disks.

Cheers
Ian Collins | 7 Feb 05:18 2011

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

  On 02/ 7/11 03:45 PM, Matthew Angelo wrote:
> I require a new high capacity 8 disk zpool.  The disks I will be
> purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
> bits read) of 1 in 10^14 and will be 2TB.  I'm staying clear of WD
> because they have the new 2048b sectors which don't play nice with ZFS
> at the moment.
>
> My question is, how do I determine which of the following zpool and
> vdev configuration I should run to maximize space whilst mitigating
> rebuild failure risk?
>
> 1. 2x RAIDZ(3+1) vdev
> 2. 1x RAIDZ(7+1) vdev
> 3. 1x RAIDZ2(7+1) vdev
>
I assume 3 was 6+2.

A bigger issue than drive error rates is how long a new 2TB drive will 
take to resilver if one dies.  How long are you willing to run without 
redundancy in your pool?

--

-- 
Ian.
Edward Ned Harvey | 7 Feb 05:48 2011

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

> From: zfs-discuss-bounces <at> opensolaris.org [mailto:zfs-discuss-
> bounces <at> opensolaris.org] On Behalf Of Matthew Angelo
> 
> My question is, how do I determine which of the following zpool and
> vdev configuration I should run to maximize space whilst mitigating
> rebuild failure risk?
> 
> 1. 2x RAIDZ(3+1) vdev
> 2. 1x RAIDZ(7+1) vdev
> 3. 1x RAIDZ2(6+2) vdev
> 
> I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
> 2TB disks.

(Corrected type-o, 6+2 for you).
Sounds like you made up your mind already.  Nothing wrong with that.  You
are apparently uncomfortable running with only 1 disk worth of redundancy.
There is nothing fundamentally wrong with the raidz1 configuration, but the
probability of failure is obviously higher.

Question is how do you calculate the probability?  Because if we're talking
abou 5e-21 versus 3e-19 then you probably don't care about the difference...
They're both essentially zero probability...  Well...  There's no good
answer to that.  

With the cited probability of bit error rate, you're just representing the
probability of a bit error.  You're not representing the probability of a
failed drive.  And you're not representing the probability of a drive
failure within a specified time window.  What you really care about is the
probability of two drives (or 3 drives) failing concurrently...  In which
(Continue reading)

Matthew Angelo | 7 Feb 07:22 2011
Picon

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

Yes I did mean 6+2, Thank you for fixing the typo.

I'm actually more leaning towards running a simple 7+1 RAIDZ1.
Running this with 1TB is not a problem but I just wanted to
investigate at what TB size the "scales would tip".   I understand
RAIDZ2 protects against failures during a rebuild process.  Currently,
my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
and worse case assuming this is 2 days this is my 'exposure' time.

For example, I would hazard a confident guess that 7+1 RAIDZ1 with 6TB
drives wouldn't be a smart idea.  I'm just trying to extrapolate down.

I will be running hot (or maybe cold) spare.  So I don't need to
factor in "Time it takes for a manufacture to replace the drive".

On Mon, Feb 7, 2011 at 2:48 PM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris <at> nedharvey.com> wrote:
>> From: zfs-discuss-bounces <at> opensolaris.org [mailto:zfs-discuss-
>> bounces <at> opensolaris.org] On Behalf Of Matthew Angelo
>>
>> My question is, how do I determine which of the following zpool and
>> vdev configuration I should run to maximize space whilst mitigating
>> rebuild failure risk?
>>
>> 1. 2x RAIDZ(3+1) vdev
>> 2. 1x RAIDZ(7+1) vdev
>> 3. 1x RAIDZ2(6+2) vdev
>>
>> I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
>> 2TB disks.
(Continue reading)

Peter Jeremy | 7 Feb 22:07 2011

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers <at> gmail.com> wrote:
>I'm actually more leaning towards running a simple 7+1 RAIDZ1.
>Running this with 1TB is not a problem but I just wanted to
>investigate at what TB size the "scales would tip".

It's not that simple.  Whilst resilver time is proportional to device
size, it's far more impacted by the degree of fragmentation of the
pool.  And there's no 'tipping point' - it's a gradual slope so it's
really up to you to decide where you want to sit on the probability
curve.

>   I understand
>RAIDZ2 protects against failures during a rebuild process.

This would be its current primary purpose.

>  Currently,
>my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
>and worse case assuming this is 2 days this is my 'exposure' time.

Unless this is a write-once pool, you can probably also assume that
your pool will get more fragmented over time, so by the time your
pool gets to twice it's current capacity, it might well take 3 days
to rebuild due to the additional fragmentation.

One point I haven't seen mentioned elsewhere in this thread is that
all the calculations so far have assumed that drive failures were
independent.  In practice, this probably isn't true.  All HDD
manufacturers have their "off" days - where whole batches or models of
disks are cr*p and fail unexpectedly early.  The WD EARS is simply a
(Continue reading)

Richard Elling | 8 Feb 01:53 2011
Picon

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote:

> On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers <at> gmail.com> wrote:
>> I'm actually more leaning towards running a simple 7+1 RAIDZ1.
>> Running this with 1TB is not a problem but I just wanted to
>> investigate at what TB size the "scales would tip".
> 
> It's not that simple.  Whilst resilver time is proportional to device
> size, it's far more impacted by the degree of fragmentation of the
> pool.  And there's no 'tipping point' - it's a gradual slope so it's
> really up to you to decide where you want to sit on the probability
> curve.

The "tipping point" won't occur for similar configurations. The tip
occurs for different configurations. In particular, if the size of the 
N+M parity scheme is very large and the resilver times become
very, very large (weeks) then a (M-1)-way mirror scheme can provide
better performance and dependability. But I consider these to be
extreme cases.

>>  I understand
>> RAIDZ2 protects against failures during a rebuild process.
> 
> This would be its current primary purpose.
> 
>> Currently,
>> my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
>> and worse case assuming this is 2 days this is my 'exposure' time.
> 
> Unless this is a write-once pool, you can probably also assume that
(Continue reading)

Paul Kraus | 14 Feb 13:55 2011

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

On Mon, Feb 7, 2011 at 7:53 PM, Richard Elling <richard.elling <at> gmail.com> wrote:
> On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote:
>
>> On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers <at> gmail.com> wrote:
>>> I'm actually more leaning towards running a simple 7+1 RAIDZ1.
>>> Running this with 1TB is not a problem but I just wanted to
>>> investigate at what TB size the "scales would tip".
>>
>> It's not that simple.  Whilst resilver time is proportional to device
>> size, it's far more impacted by the degree of fragmentation of the
>> pool.  And there's no 'tipping point' - it's a gradual slope so it's
>> really up to you to decide where you want to sit on the probability
>> curve.
>
> The "tipping point" won't occur for similar configurations. The tip
> occurs for different configurations. In particular, if the size of the
> N+M parity scheme is very large and the resilver times become
> very, very large (weeks) then a (M-1)-way mirror scheme can provide
> better performance and dependability. But I consider these to be
> extreme cases.

    Empirically it seems that resilver time is related to number of
objects as much (if not more than) amount of data. zpools (mirrors)
with similar amounts of data but radically different numbers of
objects take very different amounts of time to resilver. I have NOT
(yet) started actually measuring and tracking this, but the above is
based on casual observation.

P.S. I am measuring number of objects via `zdb -d` as that is faster
than trying to count files and directories and I expect is a much
(Continue reading)

Nico Williams | 14 Feb 14:12 2011

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

On Feb 14, 2011 6:56 AM, "Paul Kraus" <paul <at> kraus-haus.org> wrote:
> P.S. I am measuring number of objects via `zdb -d` as that is faster
> than trying to count files and directories and I expect is a much
> better measure of what the underlying zfs code is dealing with (a
> particular dataset may have lots of snapshot data that does not
> (easily) show up).

It's faster because; a) no atime updates, b) no ZPL overhead.

Nico
--

_______________________________________________
zfs-discuss mailing list
zfs-discuss <at> opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling | 7 Feb 08:01 2011
Picon

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

On Feb 6, 2011, at 6:45 PM, Matthew Angelo wrote:

> I require a new high capacity 8 disk zpool.  The disks I will be
> purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
> bits read) of 1 in 10^14 and will be 2TB.  I'm staying clear of WD
> because they have the new 2048b sectors which don't play nice with ZFS
> at the moment.
> 
> My question is, how do I determine which of the following zpool and
> vdev configuration I should run to maximize space whilst mitigating
> rebuild failure risk?

The MTTDL[2] model will work.
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl
As described, this model doesn't scale well for N > 3 or 4, but it will get
you in the ballpark.

You will also need to know the MTBF from the data sheet, but if you
don't have that info, that is ok because you are asking the right question:
given a single drive type, what is the best configuration for preventing
data loss. Finally, to calculate the raidz2 result, you need to know the 
mean time to recovery (MTTR) which includes the logistical replacement
time and resilver time.

Basically, the model calculates the probability of a data loss event during
reconstruction. This is different for ZFS and most other LVMs because ZFS
will only resilver data and the total data <= disk size.

> 
> 1. 2x RAIDZ(3+1) vdev
> 2. 1x RAIDZ(7+1) vdev
> 3. 1x RAIDZ2(7+1) vdev
> 
> 
> I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
> 2TB disks.

Double parity will win over single parity. Intuitively, when you add parity you
multiply by the MTBF. When you add disks to a set, you change the denominator
by a few digits. Obviously multiplication is a good thing, dividing not so much.
In short, raidz2 is the better choice.
 -- richard
Sandon Van Ness | 7 Feb 14:23 2011

Re: RAID Failure Calculator (for 8x 2TB RAIDZ)

I think as far as data integrity and complete volume loss is most likely 
in the following order:

1. 1x Raidz(7+1)
2. 2x RaidZ(3+1)
3. 1x Raidz2(6+2)

Simple raidz certainly is an option with only 8 disks (8 is about the 
maximum I would go) but to be honest I would feel safer going raidz2. 
The 2x raidz (3+1) would probably perform the best but I would prefer 
going with the 3rd option (raidz2) as it is better for redundancy. With 
raidz2 any two disks can fail and with dual parity if you get some 
un-recoverable read errors during a scrub you have a much better chance 
of not having corruption due to the double parity on the same set of data.

On 02/06/2011 06:45 PM, Matthew Angelo wrote:
> I require a new high capacity 8 disk zpool.  The disks I will be
> purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
> bits read) of 1 in 10^14 and will be 2TB.  I'm staying clear of WD
> because they have the new 2048b sectors which don't play nice with ZFS
> at the moment.
>
> My question is, how do I determine which of the following zpool and
> vdev configuration I should run to maximize space whilst mitigating
> rebuild failure risk?
>
> 1. 2x RAIDZ(3+1) vdev
> 2. 1x RAIDZ(7+1) vdev
> 3. 1x RAIDZ2(7+1) vdev
>
>
> I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
> 2TB disks.
>
> Cheers
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss <at> opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Gmane