Peter T. Breuer | 3 Jan 12:31
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Guy <bugzilla <at> watkins-home.com> wrote:
> "Also sprach Guy:"
> > "Well, you can make somewhere. You only require an 8MB (one cylinder)
> > partition."
> > 
> > So, it is ok for your system to fail when this disk fails?
> 
> You lose the journal, that's all.  You can react with a simple tune2fs
> -O ^journal or whatever is appropriate.  And a journal is ONLY there in
> order to protect you against crashes of the SYSTEM (not the disk), so
> what was the point of having the journal in the first place? 
> 
> ** When you lose the journal, does the system continue without it?
> ** Or does it require user intervention?

I don't recall. It certainly at least puts itself into read-only mode
(if that's the error mode specified via tune2fs). And the situation
probably changes from version t version.

On a side note, I don't know why you think user intervention is not
required when a raid system dies.  As a matter of liklihoods, I have
never seen a disk die while IN a working soft (or hard) raid system, and
the system continue working afterwards, instead the normal disaster
sequence as I have experienced it is:

   1) lightning strikes rails, or a/c goes out and room full of servers
      overheats. All lights go off.

   2) when sysadmin arrives to sort out the smoking wrecks, he finds
      that 1 in 3 random disks are fried - they're simply the points
(Continue reading)

maarten | 3 Jan 18:46

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday 03 January 2005 12:31, Peter T. Breuer wrote:
> Guy <bugzilla <at> watkins-home.com> wrote:
> > "Also sprach Guy:"

>    1) lightning strikes rails, or a/c goes out and room full of servers
>       overheats. All lights go off.
>
>    2) when sysadmin arrives to sort out the smoking wrecks, he finds
>       that 1 in 3 random disks are fried - they're simply the points
>       of failure that died first, and they took down the hardware with
>       them.
>
>    3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware
>       to piece together the raid arrays from the surviving disks, and
>       hastily does a copy to somewhere very safe and distant, while
>       an assistant holds off howling hordes outside the door with
>       a shutgun.
>
> In this scenario, a disk simply acts as the weakest link in a fuse
> chain, and the whole chain goes down.  But despite my dramatisation it
> is likely that a hardware failure will take out or damage your hardware!
> Ide disks live on an electric bus conected to other hardware.  Try a
> shortcircuit and see what happens.  You can't even yank them out while
> the bus is operating if you want to keep your insurance policy.

The chance of a PSU blowing up or lightning striking is, reasonably, much less 
than an isolated disk failure.  If this simple fact is not true for you 
personally, you really ought to reevaluate the quality of your PSU (et al) 
and / or the buildings' defenses against a lightning strike...

(Continue reading)

Guy | 3 Jan 22:36

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Maarten said:
"Doing the math, the outcome is still (200% divided by four)= 50%.
Ergo: the same as with a single disk.  No change."

Guy said:
"I bet a non-mirror disk has similar risk as a RAID1."

Guy and Maarten agree, but Maarten does a better job of explaining it!  :)

I also agree with most of what Maarten said below, but not mirroring swap???

Guy

-----Original Message-----
From: linux-raid-owner <at> vger.kernel.org
[mailto:linux-raid-owner <at> vger.kernel.org] On Behalf Of maarten
Sent: Monday, January 03, 2005 12:47 PM
To: linux-raid <at> vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

On Monday 03 January 2005 12:31, Peter T. Breuer wrote:
> Guy <bugzilla <at> watkins-home.com> wrote:
> > "Also sprach Guy:"

>    1) lightning strikes rails, or a/c goes out and room full of servers
>       overheats. All lights go off.
>
>    2) when sysadmin arrives to sort out the smoking wrecks, he finds
>       that 1 in 3 random disks are fried - they're simply the points
(Continue reading)

maarten | 4 Jan 01:15

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday 03 January 2005 22:36, Guy wrote:
> Maarten said:
> "Doing the math, the outcome is still (200% divided by four)= 50%.
> Ergo: the same as with a single disk.  No change."
>
> Guy said:
> "I bet a non-mirror disk has similar risk as a RAID1."
>
> Guy and Maarten agree, but Maarten does a better job of explaining it!  :)
>
> I also agree with most of what Maarten said below, but not mirroring
> swap???

Yeah... bad choice in hindsight.  
But, there once was a time, a long long time ago, that the software-raid howto 
explicitly stated that running swap on raid was a bad idea, and that by 
telling the kernel all swap partitions had the same priority, the kernel 
itself would already 'raid' the swap, ie. divide equally between the swap 
spaces. I'm sure you can read it back somewhere.

Now we know better, and we realize that that will indeed loadbalance between 
the various swap partitions, but it will not provide redundancy at all.  
Oh well, new insights huh ? ;-)

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
(Continue reading)

Michael Tokarev | 4 Jan 12:21
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten wrote:
> On Monday 03 January 2005 22:36, Guy wrote:
> 
>>Maarten said:
>>"Doing the math, the outcome is still (200% divided by four)= 50%.
>>Ergo: the same as with a single disk.  No change."
>>
>>Guy said:
>>"I bet a non-mirror disk has similar risk as a RAID1."
>>
>>Guy and Maarten agree, but Maarten does a better job of explaining it!  :)
>>
>>I also agree with most of what Maarten said below, but not mirroring
>>swap???
> 
> 
> Yeah... bad choice in hindsight.  
> But, there once was a time, a long long time ago, that the software-raid howto 
> explicitly stated that running swap on raid was a bad idea, and that by 

In 2.2, and probably in early 2.4, there indeed was a prob with having
swap on raid (md) array.  "Random" system lockups, especially during
the array recovery.  That problem(s) has been fixed long ago.  But I
think the howto in question tells about something different...

> telling the kernel all swap partitions had the same priority, the kernel 
> itself would already 'raid' the swap, ie. divide equally between the swap 
> spaces. I'm sure you can read it back somewhere.

> Now we know better, and we realize that that will indeed loadbalance between 
(Continue reading)

Peter T. Breuer | 3 Jan 21:22
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> The chance of a PSU blowing up or lightning striking is, reasonably, much less 
> than an isolated disk failure.  If this simple fact is not true for you 

Oh?  We have about 20 a year.  Maybe three of them are planned.  But
those are the worst ones!  - the electrical department's method of
"testing" the lines is to switch off the rails then pulse them up and
down.  Surge tests or something.  When we can we switch everything off
beforehand.  But then we also get to deal with the amateur contributions
from the city power people.

Yes, my PhD is in electrical engineering. Have I sent them sarcastic
letters explaining  how to test lines using a dummy load? Yes. Does the
physics department also want to place them in a vat of slowly reheating
liquid nitrogen? Yes. Does it make any difference? No.

I should have kept the letter I got back when I asked them exactly WHAT
it was they thought they had been doing when they sent round a pompous
letter explaining how they had been up all night "helping" the town
power people to get back on line, after an outage took out the
half-million or so people round here. Waiting for the phonecall saying
"you can turn it back on now", I think.

That letter was a riot.

I plug my stuff into the ordinary mains myself.  It fails less often
than the "secure circuit" plugs we have that are meant to be wired to
their smoking giant UPS that apparently takes half the city output to
power up.

(Continue reading)

I'm glad I don't live in Spain (was Re: ext3 journal on software raid)

I've been following this discussion with varying degrees of incredulity 
and finally thought to look up just _where_ Peter works that might have 
such incompetence involved.  His email address is at the University of 
Madrid in Spain.  I'm glad I don't have to deal with it (not that Japan 
is perfect - but the power usually works here).

Peter, you may be a math whiz, but your situation and experience is very 
different from that of many other people here.  Please stop trying to 
say that because you have power failures 12 times a year everyone else 
does.  We don't.

Most people have working power systems and exerience disk failures 
because disks are funny spinny things and wear out.  Many of us work in 
environments where when the power people do funny things they get fired 
and replaced with people who do not do funny things.  You have my 
sympathies but please stop trying to tell everyone that they should 
expect 12 power failures a year.

Peter T. Breuer wrote:

>maarten <maarten <at> ultratux.net> wrote:
>  
>
>>The chance of a PSU blowing up or lightning striking is, reasonably, much less 
>>than an isolated disk failure.  If this simple fact is not true for you 
>>    
>>
>
>Oh?  We have about 20 a year.  Maybe three of them are planned.  But
>those are the worst ones!  - the electrical department's method of
(Continue reading)

maarten | 4 Jan 01:08

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday 03 January 2005 21:22, Peter T. Breuer wrote:
> maarten <maarten <at> ultratux.net> wrote:
> > The chance of a PSU blowing up or lightning striking is, reasonably, much
> > less than an isolated disk failure.  If this simple fact is not true for
> > you
>
> Oh?  We have about 20 a year.  Maybe three of them are planned.  But
> those are the worst ones!  - the electrical department's method of
> "testing" the lines is to switch off the rails then pulse them up and
> down.  Surge tests or something.  When we can we switch everything off
> beforehand.  But then we also get to deal with the amateur contributions
> from the city power people.

It goes on and on below, but this your first paragraph is already striking(!)
You actually say that the planned outages are worse than the others!
OMG.  Who taught you how to plan ?  Isn't planning the act of anticipating 
things, and acting accordingly so as to minimize the impact ?
So your planning is so bad that the planned maintenance is actually worse than 
the impromptu outages.   I...  I am speechless.  Really. You take the cake.

But from the rest of your post it also seems you define a "total system 
failure" as something entirely different as the rest of us (presumably).
You count either planned or unplanned outages as failures, whereas most of us 
would call that downtime, not system failure, let alone "total".
If you have a problematic UPS system, or mentally challenged UPS engineers, 
that does not constitute a failure IN YOUR server.  Same for a broken 
network.  Total system failures is where the single computer system we're 
focussing on goes down or is unresponsive. You can't say "your server" is 
down when all that is happening is someone pulled the UTP from your remote 
console...! 
(Continue reading)

Guy | 4 Jan 00:05

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter said:
"Except that it is not the case. With a single disk you are CERTAIN to
detect the problem (if it is detectable) when you run the fsck at reboot."

Guy says:
As a guess, fsck checks less than 1% of the disk.  No user data is checked.
So, virtually all errors would go un-detected.  But a RAID system could
detect the errors.  Any yes, RAID6 could correct a single disk error.  Even
multi disk errors, as long as only 1 error per stripe/block.

Your data center has problems well beyond the worse stories I have ever
heard!  My home systems tend to have much better uptime than any of your
systems.

  5:05pm  up 33 days, 16:31,  1 user,  load average: 0.12, 0.03, 0.01
17:05:20  up 28 days, 15:06,  1 user,  load average: 0.03, 0.04, 0.00

Both were re-booted by me, not some strange failures.  When I re-booted the
first one, it had over 7 months of uptime.  At work I had 2 systems with
over 2 years uptime, and one of them made it to 3 years (yoda).  The 3 year
system was connected to the Internet and was used for customer demos.  So,
very low usage, but the 2 year system (lager) was an internal email server,
test server, router, had Informix, ...  I must admit, I rebooted the 2 year
system by accident!  But it was a proper reboot, not a crash.  The Y2K
patches did not require a re-boot.

This is from an email I sent 10/26/2000:
"Subject: Happy birthday Yoda!
  6:13pm  up 730 days, 23:59,  1 user,  load average: 0.10, 0.12, 0.12
  6:14pm  up 731 days,  1 user,  load average: 0.08, 0.12, 0.12"
(Continue reading)

maarten | 3 Jan 20:52

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday 03 January 2005 18:46, maarten wrote:
> On Monday 03 January 2005 12:31, Peter T. Breuer wrote:
> > Guy <bugzilla <at> watkins-home.com> wrote:

>
> Doing the math, the outcome is still (200% divided by four)= 50%.
> Ergo: the same as with a single disk.  No change.

Just for laughs, I calculated this chance also for a three-way raid-1 setup 
using a lower 'failure possibility' percentage.  The outcome does not change.
The (statisticly higher) chance of a disk failing is exactly offset by the 
greater likelyhood that the raid system chooses one of the good drives to 
read from.
(Obviously this is only valid for raid level 1, not for level 5 or others)

Let us (randomly) assume there is a 10% chance of a disk failure.
We use three raid-1 disks, numbered 1 through 3.

We therefore have eight possible scenarios:

A
disk1 fail
disk2 good
disk3 good

B
disk1 good
disk2 fail
disk3 good

(Continue reading)

Peter T. Breuer | 3 Jan 21:41
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> Just for laughs, I calculated this chance also for a three-way raid-1 setup 

There's no need for you to do this - your calculations are
unfortunately not meaningful.

> Let us (randomly) assume there is a 10% chance of a disk failure.

No, call it "p". That is the correct name. And I presume you mean "an
error", not "a failure".

> We therefore have eight possible scenarios:

Oh, puhleeeeze.  Infantile arithmetic instead of elementary probabilistic
algebra is not something I wish to suffer through ...

> A
> disk1 fail
> disk2 good
> disk3 good

 ...

> H
> disk1 good
> disk2 good
> disk3 good

Was that all? 8 was it? 1 all good, 3 with one good, 3 with two good, 1
with all fail? Have we got the binomial theorem now!
(Continue reading)

maarten | 4 Jan 01:45

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> maarten <maarten <at> ultratux.net> wrote:
> > Just for laughs, I calculated this chance also for a three-way raid-1
> > setup

> > Let us (randomly) assume there is a 10% chance of a disk failure.
>
> No, call it "p". That is the correct name. And I presume you mean "an
> error", not "a failure".

You presume correctly.

> > We therefore have eight possible scenarios:
>
> Oh, puhleeeeze.  Infantile arithmetic instead of elementary probabilistic
> algebra is not something I wish to suffer through ...

Maybe not.  Your way of explaining may make sense to a math expert, I tried to 
explain it in a form other humans might comprehend, and that was on purpose.

Your way may be correct, or it may not be, I'll leave that up to other people. 
To me, it looks like you complicate it and obfuscate it, like someone can 
code a one-liner in perl which is completely correct yet cannot be read by 
anyone but the author...  In other words, you try to impress me with your 
leet math skills but my explanation was both easier to read and potentially 
reached a far bigger audience.

Now excuse me if my omitting "p" in my calculation made you lose your 
concentration... or something.  Further comments to be found below.

(Continue reading)

Peter T. Breuer | 4 Jan 11:14
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> > maarten <maarten <at> ultratux.net> wrote:
> > > Just for laughs, I calculated this chance also for a three-way raid-1
> > > setup
> 
> > > Let us (randomly) assume there is a 10% chance of a disk failure.
> >
> > No, call it "p". That is the correct name. And I presume you mean "an
> > error", not "a failure".
> 
> You presume correctly.
> 
> > > We therefore have eight possible scenarios:
> >
> > Oh, puhleeeeze.  Infantile arithmetic instead of elementary probabilistic
> > algebra is not something I wish to suffer through ...
> 
> Maybe not.  Your way of explaining may make sense to a math expert, I tried to 

It would make sense to a 16 year old, since that's about where you get
to be certified as competent in differential calculus and probability
theory, if my memory of my high school math courses is correct.  This is
pre-university stuff by a looooooong way.

The problem is that I never have a 9-year old child available when I
need one ...

> explain it in a form other humans might comprehend, and that was on purpose.

(Continue reading)

Maarten | 4 Jan 14:24

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote:
> maarten <maarten <at> ultratux.net> wrote:
> > On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> > > maarten <maarten <at> ultratux.net> wrote:

> It would make sense to a 16 year old, since that's about where you get
> to be certified as competent in differential calculus and probability
> theory, if my memory of my high school math courses is correct.  This is
> pre-university stuff by a looooooong way.

Oh wow.  So you deduced I did not study math at university ?
Well, that IS an eye-opener for me.  I was unaware studying math was a 
requirement to engage in conversation on the linux-raid mailinglist ?
Or is this not the list I think it is ?

> The problem is that I never have a 9-year old child available when I
> need one ...

Um, check again... he's sitting right there with you I think.

> You forget it because it is tiny.  As tiny as you or I could wish to
> make it.  Puhleeze.  This is just Poisson distributions.

>
> Therefore you forget it. All of differential calculus works like that.
> Forget the square term - it vanishes. All terms of the series beyond
> the first can be ignored as you go to the limiting situation.

And that is precisely what false assumption you're making ! We ARE not going 
to the limiting situation. We are discussing the probabilities in failures of 
(Continue reading)

Peter T. Breuer | 4 Jan 15:05
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Maarten <maarten <at> ultratux.net> wrote:
> On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote:
> > maarten <maarten <at> ultratux.net> wrote:
> > > On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> > > > maarten <maarten <at> ultratux.net> wrote:
> 
> > It would make sense to a 16 year old, since that's about where you get
> > to be certified as competent in differential calculus and probability
> > theory, if my memory of my high school math courses is correct.  This is
> > pre-university stuff by a looooooong way.
> 
> Oh wow.  So you deduced I did not study math at university ?

Well, I deduced that you did not get to the level expected of a 16 year
old.

> Well, that IS an eye-opener for me.  I was unaware studying math was a 

One doesn't "study" math, one _does_ math, just as one _does_ walking
down the street, talking, and opening fridge doors. Your competency
at it gets certified in school and uni, that's all.

> requirement to engage in conversation on the linux-raid mailinglist ?

Looks like it, or else one gets bogged down in inane conversations at
about the level of "what is an editor".

Look - a certain level of mathematical competence is required in the
technical world.  You cannot get away without it.  Being able to do math
to the level expected of an ordinary 16 year old is certainly expected
(Continue reading)

Maarten | 4 Jan 16:31

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 15:05, Peter T. Breuer wrote:
> Maarten <maarten <at> ultratux.net> wrote:
> > On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote:
> > > maarten <maarten <at> ultratux.net> wrote:
> > > > On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> > > > > maarten <maarten <at> ultratux.net> wrote:

> > Well, that IS an eye-opener for me.  I was unaware studying math was a
>
> One doesn't "study" math, one _does_ math, just as one _does_ walking
> down the street, talking, and opening fridge doors. Your competency
> at it gets certified in school and uni, that's all.

I know a whole mass of people who can't calculate what chance the toss of a 
coin has.  Or who don't know how to verify their money change is correct.
So it seems math is not an essential skill, like walking and talking is.
I'll not even go into gambling, which is immensely popular.  I'm sure there 
are even mathematicians who gamble.  How do you figure that ?? 

> > but that doesn't make it so that any harddrive has a life expectancy of
> > 20+ years, as the daily facts prove all the time.
>
> It does mean it. It means precisely that (given certain experimental
> conditions). If you want to calculate the MTBF in a real dusty noisy
> environment, I would say it is about ten years. That is, 10% chance of
> failure per year.
>
> If they say it is 20 years and not 10 years, well I believe that too,
> but they must be keeping the monkeys out of the room.

(Continue reading)

Mikael Abrahamsson | 4 Jan 20:57
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tue, 4 Jan 2005, Maarten wrote:

> failures within the first 10 years, let alone 20, to even remotely support 
> that outrageous MTBF claim.

One should note that environment seriously affects MTBF, even on 
non-movable parts, and probably even more on movable parts.

I've talked to people in the reliability business, and they use models 
that say that MTBF for a part at 20 C as opposed to 40 C can differ by a 
factor of 3 or 4, or even more. A lot of people skimp on cooling and then 
get upset when their drives fail.

I'd venture to guess that a drive that has an MTBF of 1.2M at 25C will 
have less than 1/10th of that at 55-60C.

--

-- 
Mikael Abrahamsson    email: swmike <at> swm.pp.se

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

maarten | 4 Jan 22:05

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 20:57, Mikael Abrahamsson wrote:
> On Tue, 4 Jan 2005, Maarten wrote:
> > failures within the first 10 years, let alone 20, to even remotely
> > support that outrageous MTBF claim.
>
> One should note that environment seriously affects MTBF, even on
> non-movable parts, and probably even more on movable parts.

Yes.  Heat especially above all else.

> I've talked to people in the reliability business, and they use models
> that say that MTBF for a part at 20 C as opposed to 40 C can differ by a
> factor of 3 or 4, or even more. A lot of people skimp on cooling and then
> get upset when their drives fail.
>
> I'd venture to guess that a drive that has an MTBF of 1.2M at 25C will
> have less than 1/10th of that at 55-60C.

Yes. I know that full well.  Therefore my server drives are mounted directly 
behind two monstrous 12cm fans...  I don't take no risks.  :-)

Still, two western digitals have died within the first or second year in that 
enclosure. So much for MTBF vs. real world expectancy I guess.

It should be public knowledge by now that heat is the number 1 killer for 
harddisks.  However, you still see PC cases everywhere where disks are 
sandwiched together and with no possible airflow at all. Go figure... 

Maarten

(Continue reading)

Guy | 4 Jan 22:46

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

I have a PC with 2 disks, these disks are much too hot to touch for more
than a second or less.  The system has been like that for 3-4 years.  I have
no idea how they lasted so long!  1 is an IBM the other is Seagate.  Both
are 18 Gig SCSI disks.  The Seagate is 10,000 RPM.

As you said: "Go figure..."!  :)

Guy

-----Original Message-----
From: linux-raid-owner <at> vger.kernel.org
[mailto:linux-raid-owner <at> vger.kernel.org] On Behalf Of maarten
Sent: Tuesday, January 04, 2005 4:05 PM
To: linux-raid <at> vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

On Tuesday 04 January 2005 20:57, Mikael Abrahamsson wrote:
> On Tue, 4 Jan 2005, Maarten wrote:
> > failures within the first 10 years, let alone 20, to even remotely
> > support that outrageous MTBF claim.
>
> One should note that environment seriously affects MTBF, even on
> non-movable parts, and probably even more on movable parts.

Yes.  Heat especially above all else.

> I've talked to people in the reliability business, and they use models
> that say that MTBF for a part at 20 C as opposed to 40 C can differ by a
> factor of 3 or 4, or even more. A lot of people skimp on cooling and then
(Continue reading)

Alvin Oga | 4 Jan 22:26

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)


On Tue, 4 Jan 2005, maarten wrote:

> Yes. I know that full well.  Therefore my server drives are mounted directly 
> behind two monstrous 12cm fans...  I don't take no risks.  :-)

exactly... lots of air for the drives ( treat it like a cpu ) that it
should be kept cool as possible

> Still, two western digitals have died within the first or second year in that 
> enclosure. So much for MTBF vs. real world expectancy I guess.

wd is famous for various reasons ..

> It should be public knowledge by now that heat is the number 1 killer for 
> harddisks.  However, you still see PC cases everywhere where disks are 
> sandwiched together and with no possible airflow at all. Go figure... 

its a conspiracy, to get you/us to buy new disks when the old one dies

but if we all kept a 3" fan cooling each disk ... inside the pcs,
there'd be less disk failures
	- and equal amounts of fresh cooler air coming in as 
	hot air going out

c ya
alvin

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
(Continue reading)

Peter T. Breuer | 4 Jan 17:21
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Maarten <maarten <at> ultratux.net> wrote:
> I'll not even go into gambling, which is immensely popular.  I'm sure there 
> are even mathematicians who gamble.  How do you figure that ?? 

I know plenty who do.  They win.  A friend of mine made his living at
the institute of advanced studies at princeton for two years after his
grant ran out by winning at blackjack in casinos all over the states.
(never play him at poker!  I used to lose all my matchsticks ..)

> > If they say it is 20 years and not 10 years, well I believe that too,
> > but they must be keeping the monkeys out of the room.
> 
> Nope, not 10 years, not 20 years, not even 40 years.  See this Seagate sheet 
> below where they go on record with a whopping 1200.000 hours MTBF.  That 
> translates to 137 years.

I believe that too.  They REALLY have kept the monkeys well away.
They're only a factor of ten out from what I think it is, so I certainly
believe them.  And they probably discarded the ones that failed burn-in
too.

> Now can you please state here and now that you 
> actually believe that figure ?

Of course. Why wouldn't I? They are stating something like 1% lossage
per year under perfect ideal conditions, no dust, no power spikes, no
a/c overloads, etc. I'd easily belueve that.

> Cause it would show that you have indeed 
> fully and utterly lost touch with reality.  No sane human being would take 
(Continue reading)

maarten | 4 Jan 21:55

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 17:21, Peter T. Breuer wrote:
> Maarten <maarten <at> ultratux.net> wrote:

> > Nope, not 10 years, not 20 years, not even 40 years.  See this Seagate
> > sheet below where they go on record with a whopping 1200.000 hours MTBF. 
> > That translates to 137 years.
>
> I believe that too.  They REALLY have kept the monkeys well away.
> They're only a factor of ten out from what I think it is, so I certainly
> believe them.  And they probably discarded the ones that failed burn-in
> too.
>
> > Now can you please state here and now that you
> > actually believe that figure ?
>
> Of course. Why wouldn't I? They are stating something like 1% lossage
> per year under perfect ideal conditions, no dust, no power spikes, no
> a/c overloads, etc. I'd easily belueve that.

No spindle will take 137 years of abuse at the incredibly high speed of 10000 
rpm and not show enough wear so that the heads will either collide with the 
platters or read on adjacent tracks.  Any mechanic can tell you this.
I don't care what kind of special diamond bearings you use, it's just not 
feasible.  We could even start a debate of how much decay we would see in the 
silicon junctions in the chips, but that is not useful nor on-topic.  Let's 
just say that the transistor barely exists 50 years and it is utter nonsense 
to try to say anything meaningful about what 137 years will do to 
semiconductors and their molecular structures over that vast a timespan.  
Remember, it was not too long ago they said CDs were indestructible (by time 
elapsed, not by force, obviously). And look what they say now. 
(Continue reading)

Peter T. Breuer | 4 Jan 22:38
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> I don't see where you come up with 1% per year.

Because that is 1/137 approx (hey, is that planks constant or
something...)

> Remember that MTBF means MEAN 
> time between failures,

I.e. it's the inverse of the probability of failure per unit time, in a
Poisson distribution.  A Poisson distribution only has one parameter
and that's it! The standard deviation is that too. No, I don't recall
the third moment offhand.

> so for every single drive that dies in year one, one 
> other drive has to double its life expectancy to twice 137, which is 274 

Complete nonsense. Please go back to remedial statistics.

> years.  If your reasoning is correct with one drive dying per year, the 

Who said that? I said the probability of failure is 1% per year. Not
one drive per year! If you have a hundred drives, you expect about one
death in the first year.

> remaining bunch after 50 years will have to survive another 250(!) years, on 
> average.  ...But wait, you're still not convinced, eh ?

Complete and utter disgraceful nonsense! Did you even get as far as the
11-year old standard in your math?
(Continue reading)

Guy | 5 Jan 00:29

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

I think in this example, if you had 137 disks, you should expect an average
of 1 failed drive per year.  But, I would bet after 5 years you would have
much more than 5 failed disks!

Guy 

-----Original Message-----
From: linux-raid-owner <at> vger.kernel.org
[mailto:linux-raid-owner <at> vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Tuesday, January 04, 2005 4:38 PM
To: linux-raid <at> vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> I don't see where you come up with 1% per year.

Because that is 1/137 approx (hey, is that planks constant or
something...)

> Remember that MTBF means MEAN 
> time between failures,

I.e. it's the inverse of the probability of failure per unit time, in a
Poisson distribution.  A Poisson distribution only has one parameter
and that's it! The standard deviation is that too. No, I don't recall
the third moment offhand.

> so for every single drive that dies in year one, one 
> other drive has to double its life expectancy to twice 137, which is 274 
(Continue reading)

Peter T. Breuer | 4 Jan 22:11
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> On Tuesday 04 January 2005 17:21, Peter T. Breuer wrote:
> > Maarten <maarten <at> ultratux.net> wrote:
> 
> 
> > > Nope, not 10 years, not 20 years, not even 40 years.  See this Seagate
> > > sheet below where they go on record with a whopping 1200.000 hours MTBF. 
> > > That translates to 137 years.
> >
> > I believe that too.  They REALLY have kept the monkeys well away.
> > They're only a factor of ten out from what I think it is, so I certainly
> > believe them.  And they probably discarded the ones that failed burn-in
> > too.
> >
> > > Now can you please state here and now that you
> > > actually believe that figure ?
> >
> > Of course. Why wouldn't I? They are stating something like 1% lossage
> > per year under perfect ideal conditions, no dust, no power spikes, no
> > a/c overloads, etc. I'd easily belueve that.
> 
> No spindle will take 137 years of abuse at the incredibly high speed of 10000 
> rpm and not show enough wear so that the heads will either collide with the 

Nor does anyone say it will! That's the mtbf, that's all. It's a
parameter in a statistical distribrution. The inverse of the
probability of failure per unit time.

Peter

(Continue reading)

Peter T. Breuer | 4 Jan 00:19
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer <ptb <at> lab.it.uc3m.es> wrote:
> No, call it "p". That is the correct name. And I presume you mean "an
> error", not "a failure".

I'll do this thoroughly, so you can see how it goes.

Let 

   p = probability of a detectible error occuring on a disk in a unit time
   p'= ................ indetectible .....................................

Then the probability of an error occuring UNdetected on a n-disk raid
array is

       (n-1)p + np'

and on a 1 disk system (a 1-disk raid array :) it is

       p'

OK? (hey, I'm a mathematician, it's obvious to me).

Exercise .. calculate effect of majority voting! 

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
(Continue reading)

Neil Brown | 4 Jan 00:46
X-Face
Picon
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> Peter T. Breuer <ptb <at> lab.it.uc3m.es> wrote:
> > No, call it "p". That is the correct name. And I presume you mean "an
> > error", not "a failure".
> 
> I'll do this thoroughly, so you can see how it goes.
> 
> Let 
> 
>    p = probability of a detectible error occuring on a disk in a unit time
>    p'= ................ indetectible .....................................
> 
> Then the probability of an error occuring UNdetected on a n-disk raid
> array is
> 
>        (n-1)p + np'
>   
> and on a 1 disk system (a 1-disk raid array :) it is
> 
>        p'
> 
> OK? (hey, I'm a mathematician, it's obvious to me).

It may be obvious, but it is also wrong.  But then probability is, I
think, the branch of mathematics that has the highest ratio of people
who think that understand it to people to actually do (witness the
success of lotteries).

The probability of an event occurring lies between 0 and 1 inclusive.
You have given a formula for a probability which could clearly evaluate
(Continue reading)

Peter T. Breuer | 4 Jan 01:28
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> > Peter T. Breuer <ptb <at> lab.it.uc3m.es> wrote:
> > > No, call it "p". That is the correct name. And I presume you mean "an
> > > error", not "a failure".
> > 
> > I'll do this thoroughly, so you can see how it goes.
> > 
> > Let 
> > 
> >    p = probability of a detectible error occuring on a disk in a unit time
> >    p'= ................ indetectible .....................................
> > 
> > Then the probability of an error occuring UNdetected on a n-disk raid
> > array is
> > 
> >        (n-1)p + np'
> >   
> > and on a 1 disk system (a 1-disk raid array :) it is
> > 
> >        p'
> > 
> > OK? (hey, I'm a mathematician, it's obvious to me).
> 
> It may be obvious, but it is also wrong.

No, it's quite correct.

> But then probability is, I
> think, the branch of mathematics that has the highest ratio of people
(Continue reading)

Neil Brown | 4 Jan 03:07
X-Face
Picon
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> > > Then the probability of an error occuring UNdetected on a n-disk raid
> > > array is
> > > 
> > >        (n-1)p + np'
> > >   
> 
> > The probability of an event occurring lies between 0 and 1 inclusive.
> > You have given a formula for a probability which could clearly evaluate
> > to a number greater than 1.  So it must be wrong.
> 
> The hypothesis here is that p is vanishingly small.  I.e. this is a Poisson
> distribution - the analysis assumes that only one event can occcur per
> unit time.  Take the unit too be one second if you like.  Does that make
> it true enough for you?

Sorry, I didn't see any such hypothesis stated and I don't like to
assUme.

So what you are really saying is that:
  for sufficiently small p and p' (i.e. p-squared terms can be ignored)
  the probability of an error occurring undetected approximates
     (n-1)p + np'

this may be true, but I'm still having trouble understanding what your
p and p' really mean.

> > You have also been very sloppy in your language, or your definitions.
> > What do you mean by a "detectable error occurring"? 
> 
(Continue reading)

Peter T. Breuer | 4 Jan 10:40
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> > > > Then the probability of an error occuring UNdetected on a n-disk raid
> > > > array is
> > > > 
> > > >        (n-1)p + np'
> > > >   
> > 
> > > The probability of an event occurring lies between 0 and 1 inclusive.
> > > You have given a formula for a probability which could clearly evaluate
> > > to a number greater than 1.  So it must be wrong.
> > 
> > The hypothesis here is that p is vanishingly small.  I.e. this is a Poisson
> > distribution - the analysis assumes that only one event can occcur per
> > unit time.  Take the unit too be one second if you like.  Does that make
> > it true enough for you?
> 
> Sorry, I didn't see any such hypothesis stated and I don't like to
> assUme.

You don't have to. It is conventional. It doesn't need saying.

> So what you are really saying is that:
>   for sufficiently small p and p' (i.e. p-squared terms can be ignored)
>   the probability of an error occurring undetected approximates
>      (n-1)p + np'
> 
> this may be true, but I'm still having trouble understanding what your
> p and p' really mean.

(Continue reading)

David Greaves | 4 Jan 15:03
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer wrote:

>Then I guess you have helped clarify to yourself what type of errors
>falls in which class! Apparently errors caused by drive failure fall in
>the class of "indetectible error" for you!
>
>But in any case, you are wrong, because it is quite possible for an
>error to spontaneously arise on a disk which WOULD be detected by fsck.
>What does fsck detect normally if it is not that! 
>  
>
It checks the filesystem metadata - not the data held in the filesystem.

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter T. Breuer | 4 Jan 15:07
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

David Greaves <david <at> dgreaves.com> wrote:
> Peter T. Breuer wrote:
> 
> >Then I guess you have helped clarify to yourself what type of errors
> >falls in which class! Apparently errors caused by drive failure fall in
> >the class of "indetectible error" for you!
> >
> >But in any case, you are wrong, because it is quite possible for an
> >error to spontaneously arise on a disk which WOULD be detected by fsck.
> >What does fsck detect normally if it is not that! 
> >
> It checks the filesystem metadata - not the data held in the filesystem.

So you should deduce that your test (if fsck be it) won't detect errors
in the files data, but only errors in the filesystem metadata.

So? Is there some problem here?

(yes, and one could add a md5sum per block to a fs, but I don't know a
fs that does).

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Greaves | 4 Jan 15:43
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter,

Can I make a serious attempt to sum up your argument as:

Disks suffer from random *detectable* corruption events on (or after) 
write (eg media or transient cache being hit by a cosmic ray, cpu 
fluctuations during write, e/m or thermal variations).

Disks suffer from random *undetectable* corruption events on (or after) 
write (eg media or transient cache being hit by a cosmic ray, cpu 
fluctuations during write, e/m or thermal variations)

Raid disks have more 'corruption-susceptible' data capacity per useable 
data capacity and so the probability of a corruption event is higher. 
Since a detectable error is detected it can be retried and dealt with.

This leaves the fact that essentially, raid disks are less reliable than 
non-raid disks wrt undetectable corruption events.

However, we need to carry out risk analysis to decide if the increase in 
susceptibility to certain kinds of corruption (cosmic rays) is 
acceptable given the reduction in susceptibility to other kinds (bearing 
or head failure).

David

tentative definitions:
detectable = noticed by normal OS I/O. ie CRC sector failure etc
undetectable = noticed by special analysis (fsck, md5sum verification etc)

(Continue reading)

Peter T. Breuer | 4 Jan 16:12
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

David Greaves <david <at> dgreaves.com> wrote:
> Disks suffer from random *detectable* corruption events on (or after) 
> write (eg media or transient cache being hit by a cosmic ray, cpu 
> fluctuations during write, e/m or thermal variations).

Well, and also people hitting the off switch (or the power going off)
during a write sequence to a mirror, but after one of a pair of mirror
writes has gone to disk, but before the other of the pair has.

(If you want to say "but the fs is journalled", then consider what if 
the write is to the journal ...).

> Disks suffer from random *undetectable* corruption events on (or after) 
> write (eg media or transient cache being hit by a cosmic ray, cpu 
> fluctuations during write, e/m or thermal variations)

Yes. This is not different from what I have said. I didn't have any
particular scenario in mind.

But I see that you are correct in pointing out that some error
posibilities arer _created_ by the presence of raid that would not
ordinarily be present. So there is some scaling with the
number of disks that needs clarification.

> Raid disks have more 'corruption-susceptible' data capacity per useable 
> data capacity and so the probability of a corruption event is higher. 

Well, the probability is larger no matter what the nature of the event.
In principle, and vry apprximately, there are simply more places (and
times!) for it to happen TO.
(Continue reading)

David Greaves | 4 Jan 17:54
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer wrote:

>David Greaves <david <at> dgreaves.com> wrote:
>  
>
>>Disks suffer from random *detectable* corruption events on (or after) 
>>write (eg media or transient cache being hit by a cosmic ray, cpu 
>>fluctuations during write, e/m or thermal variations).
>>    
>>
>
>Well, and also people hitting the off switch (or the power going off)
>during a write sequence to a mirror, but after one of a pair of mirror
>writes has gone to disk, but before the other of the pair has.
>
>(If you want to say "but the fs is journalled", then consider what if 
>the write is to the journal ...).
>  
>
Hmm.
In neither case would a journalling filesystem be corrupted.

The md driver (somehow) gets to decide which half of the mirror is 'best'.

If the journal uses the fully written half of the mirror then it's replayed.
If the journal uses the partially written half of the mirror then it's 
not replayed.
It's just the same as powering off a normal non-resilient device.

(Is your point here back to the failure to guarantee write ordering? I 
(Continue reading)

Peter T. Breuer | 4 Jan 18:42
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

David Greaves <david <at> dgreaves.com> wrote:
> >(If you want to say "but the fs is journalled", then consider what if 
> >the write is to the journal ...).

> Hmm.
> In neither case would a journalling filesystem be corrupted.

A joournalled file system is always _consistent_. That does no mean it
is correct!

> The md driver (somehow) gets to decide which half of the mirror is 'best'.

Yep - and which is correct?

> If the journal uses the fully written half of the mirror then it's replayed.
> If the journal uses the partially written half of the mirror then it's 
> not replayed.

Which is correct?

> It's just the same as powering off a normal non-resilient device.

Well, I see what you mean - yes, it is the same in terms of the total
event space.  It's just that with a single disk, the possible outcomes
are randomized only over time, as you repeat the experiment.  Here you
have randomization of outcomes over space as well, depending on which
disk you test (or how you interleave the test across the disks).

And the question remains - which outcome is correct?

(Continue reading)

David Greaves | 4 Jan 20:12
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer wrote:

>A joournalled file system is always _consistent_. That does no mean it
>is correct!
>  
>
To my knowledge no computers have the philosophical wherewithall to 
provide that service ;)

If one is rude enough to stab a journalling filesystem in the back as it 
tries to save your data it promises only to be consistent when it is 
revived - it won't provide application correctness..

I think we agree on that.

>>The md driver (somehow) gets to decide which half of the mirror is 'best'.
>>    
>>
>Yep - and which is correct?
>  
>
Both are 'correct' - they simply represent different points in the 
series of system calls made before the power went.

>Which is correct?
>  
>
<grumble> ditto

>And the question remains - which outcome is correct?
(Continue reading)

Michael Tokarev | 4 Jan 12:57
Picon

Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc]

Peter T. Breuer wrote:
> Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
[]
>>If there is a system crash before correct, consistent data is written,
>>then on restart, disk B will not be read at all until disk A as been
> 
> Why do you think so? I know of no mechanism in RAID that records to
> which of the two disks paired data has been written and to which it has
> not!
> 
> Please clarify - this is important. If you are thinking of the "event
> count" that is stamped on the superblocks, that is only updated from
> time to time as far as I know! Can you please specify (for my
> curiousity) exactly when it is updated? That would be useful to know.

Yes, this is the most dark corner in whole raid stuff for me still.
I just looked at the code again, re-read it several times, but the
code is a bit.. large to understand in a relatively short time.  This
very question bothered me for quite some time now.  How md code "knows"
which drive has "more recent" data on it in case of system crash (power
loss, whatever) after one drive has completed the write but before
another hasn't?  The "event counter" isn't updated on every write
(it'd be very expensive in both time and disk health -- too much
seeking and too much writes to a single block where the superblock
is located).

For me, and I'm just thinking how it can be done, the only possible
solution in this case is to choose "random" drive and declare it as
"up-to-date" -- it will not necessary be really up-to-date.  Or,
maybe, write to "first" drive first and to "second" next, and assume
(Continue reading)

Peter T. Breuer | 4 Jan 13:44
Picon

Re: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc]

Michael Tokarev <mjt <at> tls.msk.ru> wrote:
> How it all fits together?
> Which drive will be declared "fresh"?

I'd like details of the event count too. No, I haven't been able to
figure it out from the code either. In this case "ask an author" is
indicated. :).

> How about several (>2) drives in raid1 array?
> How about data written without a concept of "commits", if "wrong"
> drive will be choosen -- will it contain some old data in it, while
> another drive contained new data but was declared "non fresh" at
> reconstruction?

To answer a question of yours which I seem to have missed quoting here,
standard softare raid only acks the user (does end_request) when ALL the
i/os corresponding to mirrored requests have finished.

This is precisely the condition Stephen wants for ext3, and it is
satisfied.  However, the last time I asked Hans Reiser what his
conditions were for reiserfs, he told me that he required write order to
be preserved, which is a different condition.  It's not precisely
stronger as it is, but it becomes precisely stronger than Stephen's when
you add in some extra "normal" hypotheses about the rest of the universe
it lives in.

However, the media underneath raid is free to lie.  In many respects, it
is likely to lie!  Hardware disks, for example, ack back the write when
they have buffered it, not when they have written it (and manufacturers
claim there is always enough capacitative energy in the disk
(Continue reading)

Maarten | 4 Jan 15:22

Re: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc]

On Tuesday 04 January 2005 13:44, Peter T. Breuer wrote:
> Michael Tokarev <mjt <at> tls.msk.ru> wrote:

Hm, Peter, you did it again.  At the very end of an admittedly interesting 
discussion you come out with the baseless assumptions and conclusions.
Just when I was prepared to give you the benefit of the doubt...

>
> Anyway, strictly speaking, the answer to your question is "yes". It
> does not decrease the probability, and therefore it increases it. The
> question is by how much, and that is unanswerable.

You continue to amaze me. If it does not decrease, it automatically 
increases ??  What happened to the "stays equal" possibility ?
Do you exclusively use ">" and "<" instead of "=" in your math too ?  

Maybe the increase is zero. Oh wait, it could even be negative, right ? Just 
as with probability. So it possibly has an increase of, say, -0.5 ?
(see how easy it is to confuse people ?)

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter T. Breuer | 4 Jan 15:56
Picon

Re: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc]

Maarten <maarten <at> ultratux.net> wrote:
> On Tuesday 04 January 2005 13:44, Peter T. Breuer wrote:
> > Michael Tokarev <mjt <at> tls.msk.ru> wrote:
> 
> Hm, Peter, you did it again.  At the very end of an admittedly interesting 
> discussion you come out with the baseless assumptions and conclusions.
> Just when I was prepared to give you the benefit of the doubt...

:-(.

> > Anyway, strictly speaking, the answer to your question is "yes". It
> > does not decrease the probability, and therefore it increases it. The
> > question is by how much, and that is unanswerable.
> 
> You continue to amaze me. If it does not decrease, it automatically 
> increases ?? 

Yes.

> What happened to the "stays equal" possibility ?

It's included in the "automatically increases". But anyway, it's
neglible.  Any particular precise outcome (such as "stays precisely the
same") is neglibly likely in a cntinuous universe.  Probability
distributions are only stated to "almost everywhere" equivalence, since
they are fundamentally just measures on the universe, so we can't even
talk about "=", properly speaking.

> Do you exclusively use ">" and "<" instead of "=" in your math too ?  

(Continue reading)

Favicon

Re: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc]

> Yes, this is the most dark corner in whole raid stuff for me still.
> I just looked at the code again, re-read it several times, but the
> code is a bit.. large to understand in a relatively short time.  This
> very question bothered me for quite some time now.  How md code "knows"
> which drive has "more recent" data on it in case of system crash (power
> loss, whatever) after one drive has completed the write but before
> another hasn't?  The "event counter" isn't updated on every write
> (it'd be very expensive in both time and disk health -- too much
> seeking and too much writes to a single block where the superblock
> is located).
>
> For me, and I'm just thinking how it can be done, the only possible
> solution in this case is to choose "random" drive and declare it as
> "up-to-date" -- it will not necessary be really up-to-date.  Or,
> maybe, write to "first" drive first and to "second" next, and assume
> first drive have the data written before second (no guarantee here
> because of reordering, differences in drive speed etc, but it is --
> sort of -- valid assumption).

Funny, I've been thinking alot about this lately, because I use RAID in
strange setup with failover (admittedly a stupid setup, I did not know any
better). I've have only been looking at scenarios for RAID-1. I can't
even begin to think about what might happen with RAID-5. But as the RAID
howto says, RAID does not protect you from power failures and the like,
and you should have an UPS.

The md layer will not acknowledge a write before it has been written to
all disks. I have not checked this, but the raid developers are smart
people, and otherwise I would loose my sanity. IMHO this means that it
doesn't really matter which disk is chosen as the one to synchronize from
(Continue reading)

Ewan Grantham | 4 Jan 03:16
Picon
Gravatar

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

I are confused...

Which perhaps should be a lesson to a slightly knowlegeable user not
to read a thread like this.

But having given myself a headache trying to figure this all out, I
guess I'll just go ahead and ask directly.

I've setup a RAID-5 array using two internal 250 Gig HDs and two
external 250 Gig HDs through a USB-2 interface. Each of the externals
is on it's own card, and the internals are on seperate IDE channels.

I "thought" I was doing a good thing by doing all of this and then
setting them up using an ext3 filesystem.

From the reading on here I'm not clear if I should have specified
something besides whatever ext3 does by default when you set it up,
and if so if it's something I can still do without having to redo
everything. Something I'd rather not do to be honest.

Thanks in advance,
Ewan
---
http://a1.blogspot.com - commentary since 2002
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Neil Brown | 4 Jan 03:22
X-Face
Picon
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday January 3, ewan.grantham <at> gmail.com wrote:
> I are confused...
> 
> Which perhaps should be a lesson to a slightly knowlegeable user not
> to read a thread like this.
> 
> But having given myself a headache trying to figure this all out, I
> guess I'll just go ahead and ask directly.
> 
> I've setup a RAID-5 array using two internal 250 Gig HDs and two
> external 250 Gig HDs through a USB-2 interface. Each of the externals
> is on it's own card, and the internals are on seperate IDE channels.
> 
> I "thought" I was doing a good thing by doing all of this and then
> setting them up using an ext3 filesystem.

Sounds like a perfectly fine setup (providing always that external
cables are safe from stray feet etc).

No need to change anything.

NeilBrown

> 
> >From the reading on here I'm not clear if I should have specified
> something besides whatever ext3 does by default when you set it up,
> and if so if it's something I can still do without having to redo
> everything. Something I'd rather not do to be honest.
> 
> Thanks in advance,
(Continue reading)

Andy Smith | 4 Jan 03:41
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> On Monday January 3, ewan.grantham <at> gmail.com wrote:
> > I've setup a RAID-5 array using two internal 250 Gig HDs and two
> > external 250 Gig HDs through a USB-2 interface. Each of the externals
> > is on it's own card, and the internals are on seperate IDE channels.
> > 
> > I "thought" I was doing a good thing by doing all of this and then
> > setting them up using an ext3 filesystem.
> 
> Sounds like a perfectly fine setup (providing always that external
> cables are safe from stray feet etc).
> 
> No need to change anything.

Except that Peter says that the ext3 journals should be on separate
non-mirrored devices and the reason this is not mentioned in any
documentation (md / ext3) is that everyone sees it as obvious.
Whether it is true or not it's clear to me that it's not obvious to
everyone.
Peter T. Breuer | 4 Jan 10:46
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Andy Smith <andy <at> strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 20 lines --]
> 
> On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > On Monday January 3, ewan.grantham <at> gmail.com wrote:
> > > I've setup a RAID-5 array using two internal 250 Gig HDs and two
> > > external 250 Gig HDs through a USB-2 interface. Each of the externals
> > > is on it's own card, and the internals are on seperate IDE channels.
> > > 
> > > I "thought" I was doing a good thing by doing all of this and then
> > > setting them up using an ext3 filesystem.
> > 
> > Sounds like a perfectly fine setup (providing always that external
> > cables are safe from stray feet etc).
> > 
> > No need to change anything.
> 
> Except that Peter says that the ext3 journals should be on separate
> non-mirrored devices and the reason this is not mentioned in any
> documentation (md / ext3) is that everyone sees it as obvious.

No, I dont say the "SHOULD BE" is obvious.  I say the issues are
obvious.  The "should be" is up to you to decide, based on the obvious
issues involved :-).

> Whether it is true or not it's clear to me that it's not obvious to
> everyone.

It's not obvious to anyone, where by "it" I mean whether or not you
"should" put a journal on the same raid device.  There are pros and
(Continue reading)

maarten | 4 Jan 20:02

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 10:46, Peter T. Breuer wrote:
> Andy Smith <andy <at> strugglers.net> wrote:
> > [-- text/plain, encoding quoted-printable, charset: us-ascii, 20 lines
> > --]
> >
> > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > On Monday January 3, ewan.grantham <at> gmail.com wrote:

> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.

>
> It's not obvious to anyone, where by "it" I mean whether or not you
> "should" put a journal on the same raid device.  There are pros and
> cons.  I would not.  My reasoning is that I don't want data in the
> journal to be subject to the same kinds of creeping invisible corruption
> on reboot and resync that raid is subject to.  But you can achieve that

[ I'll attempt to adress all issues that have come up in this entire thread 
until now here...  please bear with me. ]

@Peter:
I still need you to clarify what can cause such creeping corruption.
There are several possible cases:

1) A bit flipped on the platter or the drive firmware had a 'thinko'.

This will be signalled by the CRC / ECC on the drive.  You can't flip a bit 
unnoticed.  Or in fact, bits get 'flipped' constantly, therefore the highly 
(Continue reading)

Peter T. Breuer | 4 Jan 22:08
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> @Peter:
> I still need you to clarify what can cause such creeping corruption.

The classical cause in raid systems is 

  1) that data is only partially written to the array on system crash
     and on recovery the inappropriate choice of alternate datasets
     from the redundant possibles is propagated.

  2) corruption occurs unnoticed in a part of the redundant data that
     is not currently in use, but a disk in the array then drops out,
     bringing the data with the error into use. On recovery of the
     failed disk, the error data is then propagated over the correct 
     data.

Plus the usual causes. And anything else I can't think of just now.

> 1) A bit flipped on the platter or the drive firmware had a 'thinko'.
> 
> This will be signalled by the CRC / ECC on the drive.

Bits flip on our client disks all the time :(.  It would be nice if it
were the case that they didn't, but it isn't.  Mind you, I don't know
precisely HOW.  I suppose more bits than the CRC can recover change, or
something, and the CRC coincides.  Anyway, it happens.  Probably cpu
-mediated.  Sorry but I haven't kept any recent logs of 1-bit errors in
files on readonly file systems for you to look at.

> You can't flip a bit 
(Continue reading)

maarten | 5 Jan 01:38

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)


[ Spoiler: this text may or may not contain harsh language and/or insulting ] 
[  remarks, specifically in the middle part. The reader is advised to exert ] 
[  some mild caution here and there.  Sorry for that but my patience can ] 
[  and does really reach its limits, too.     -   Maarten                ]

On Tuesday 04 January 2005 22:08, Peter T. Breuer wrote:
> maarten <maarten <at> ultratux.net> wrote:
> > @Peter:
> > I still need you to clarify what can cause such creeping corruption.
>
>   1) that data is only partially written to the array on system crash
>      and on recovery the inappropriate choice of alternate datasets
>      from the redundant possibles is propagated.
>
>   2) corruption occurs unnoticed in a part of the redundant data that
>      is not currently in use, but a disk in the array then drops out,
>      bringing the data with the error into use. On recovery of the
>      failed disk, the error data is then propagated over the correct
>      data.

Congrats, you just described the _symptoms_.  We all know the alledged 
symptoms, if only for you repeating them over and over and over...
My question was HOW they [can] occur.   Disks don't go around randomly 
changing bits just because they dislike you, you know.

> > 1) A bit flipped on the platter or the drive firmware had a 'thinko'.
> >
> > This will be signalled by the CRC / ECC on the drive.
>
(Continue reading)

Neil Brown | 4 Jan 23:21
X-Face
Picon
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> 
> Uh, that's not at issue. The question is whether it is CORRECT, not
> whether it is consistent.
> 

What exactly do you mean by "correct".

If I have a program that writes some data:
   write(fd, buffer, 8192);
and then makes sure the data is on disk:
   fsync(fd);

but the computer crashes sometime between when the write call started
and the fsync called ended, then I reboot and read back that block of
data from disc, what is the "CORRECT" value that I should read back?

The answer is, of course, that there is no one "correct" value.
It would be correct to find the data that I had tried to write.  It
would also be correct to find the data that had been in the file
before I started the write.  If the size of the write is larger than
the blocksize of the filesystem, it would also be correct to find a
mixture of the old data and the new data.

Exactly the same is true at every level of the storage stack.  There
is a point in time where a write request starts, and a point in time
where the request is known to complete, and between those two times
the content of the affected area of storage is undefined, and could
have any of several (probably 2) "correct" values.

(Continue reading)

Peter T. Breuer | 5 Jan 01:08
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> > 
> > Uh, that's not at issue. The question is whether it is CORRECT, not
> > whether it is consistent.
> > 
> 
> What exactly do you mean by "correct".

Whatever you mean by it - I don't have a preference myself, though I
might have an opinion in specific situations.  It means whatever you
consider and it is up to you to make your own definition for yourself,
to your own satisfaction in particular circumstances, if you feel you
need a constructive definition in other terms (and I don't!).  I merely
gave the concept a name for you.

> If I have a program that writes some data:
>    write(fd, buffer, 8192);
> and then makes sure the data is on disk:
>    fsync(fd);
> 
> but the computer crashes sometime between when the write call started
> and the fsync called ended, then I reboot and read back that block of
> data from disc, what is the "CORRECT" value that I should read back?

I would say that if nothing on your machine or elsewhere "noticed" you
doing the write of any part of the block, then the correct answer is
"the block as it was before you wrote any of it".  However, if nothing
cares at all one way or the other, then it could be annything, what you
wrote, what you got, or even any old nonsense.
(Continue reading)

Neil Brown | 4 Jan 23:29
X-Face
Picon
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> Bits flip on our client disks all the time :(.  

You seem to be alone in reporting this.  I certainly have never
experienced anything quite like what you seem to be reporting.

Certainly there are reports of flipped bits in memory.  If you have
non-ecc memory, then this is a real risk and when it happens you
replace the memory.  Usually it happens with a sufficiently high
frequency that the computer is effectively unusable.

But bits being flipped on disk, without the drive reporting an error,
and without the filesystem very quickly becoming unusable, is (except
for your report) unheard of.

md/raid would definitely not help that sort of situation at all.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter T. Breuer | 5 Jan 01:19
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> > Bits flip on our client disks all the time :(.  
> 
> You seem to be alone in reporting this.  I certainly have never
> experienced anything quite like what you seem to be reporting.

I don't feel the need to prove it to you via actual evidence.  You
already know of mechanisms which produce such an effect:

> Certainly there are reports of flipped bits in memory. 

 .. and that is all the same to your code when it comes to resyncing.
 You don't care whether the change is real or produced in the cpu, on the
bus, or wherever. It still is what you will observe and copy.

> If you have
> non-ecc memory, then this is a real risk and when it happens you
> replace the memory.

Sure.

> Usually it happens with a sufficiently high
> frequency that the computer is effectively unusable.

Well, there are many computers that remain usable. When I see bit flips
the first thing I request the techs to do is check the memory and keep
on checking it until they find a fault. I also ask them to check the
fans, clean out dust and so on.

(Continue reading)

Jure Pe_ar | 5 Jan 02:19

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Wed, 5 Jan 2005 01:19:34 +0100
ptb <at> lab.it.uc3m.es (Peter T. Breuer) wrote:

> Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> > On Tuesday January 4, ptb <at> lab.it.uc3m.es wrote:
> > > Bits flip on our client disks all the time :(.  
> > 
> > You seem to be alone in reporting this.  I certainly have never
> > experienced anything quite like what you seem to be reporting.
> 
> I don't feel the need to prove it to you via actual evidence.  You
> already know of mechanisms which produce such an effect:
> 
> > Certainly there are reports of flipped bits in memory. 
> 
>  .. and that is all the same to your code when it comes to resyncing.
>  You don't care whether the change is real or produced in the cpu, on the
> bus, or wherever. It still is what you will observe and copy.

You work with PC servers, so live with it. 

If you want to have the right to complain about bits being flipped in
hardware randomly, go get a job with IBM mainframes or something. 

And since you like theoretic approach to problems, I might have a suggestion
for you: pick a linux kernel subsystem of your choice, think of it as a
state machine, roll out all the states and then check which states are not
covered by the code.
I think that will keep you busy and the result might have some value for the
community. 
(Continue reading)

Peter T. Breuer | 5 Jan 03:29
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Jure Pe_ar <pegasus <at> nerv.eu.org> wrote:
> And since you like theoretic approach to problems, I might have a suggestion
> for you: pick a linux kernel subsystem of your choice, think of it as a
> state machine, roll out all the states and then check which states are not
> covered by the code.

I have no idea what you mean (I suspect you are asking about reachable
states). If you want a static analyzer for the linux kernel written by
me, you can try

  ftp://oboe.it.uc3m.es/pub/Programs/c-1.2.2.tgz

> I think that will keep you busy and the result might have some value for the
> community. 

If you wish to sneer about something, please try and put some technical
espertise and effort into it.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brad Campbell | 4 Jan 23:02
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer wrote:
> maarten <maarten <at> ultratux.net> wrote:

>>You can't flip a bit 
>>unnoticed. 
> 
> 
> Not by me, but then I run md5sum every day. Of course, there is a
> question if the bit changed on disk, in ram, or in the cpu's fevered
> miscalculations. I've seen all of those. One can tell which after a bit
> more detective work.
> 

I'm wondering how difficult it may be for you to extend your md5sum script to diff the pair of files 
and actually determine the extent of the corruption. bit/byte/word/.../sector/.../stripe wise?

I have 2 RAID-5 arrays here. a 3x233GiB and a 10x233GiB and I when I install new data on the drives 
I add the md5sum of that data to an existing database stored on another machine. This gets compared 
against the data on the arrays weekly and I have yet to see a silent corruption in 18 months.

I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and 
should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet.

Honestly, in my years running Linux and multiple drive arrays I have never experienced errors such 
as you are getting.

Oh.. and both my arrays are running ext3 with an internal journal (as are all my other partitions on 
all my other machines).

Perhaps I'm lucky?
(Continue reading)

Peter T. Breuer | 5 Jan 00:20
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Brad Campbell <brad <at> wasp.net.au> wrote:
> I'm wondering how difficult it may be for you to extend your md5sum script to diff the pair of files 
> and actually determine the extent of the corruption. bit/byte/word/.../sector/.../stripe wise?

Not much.  But I don't bother.  It's a majority vote amongst all the
identical machines involved and the loser gets rewritten. The script
identifies a majority group and a minority group. If the minority is 1
it rewrites it without question.  If the minority group is bigger it
refers the notice to me.

> I have 2 RAID-5 arrays here. a 3x233GiB and a 10x233GiB and I when I install new data on the drives 
> I add the md5sum of that data to an existing database stored on another machine. This gets compared 
> against the data on the arrays weekly and I have yet to see a silent corruption in 18 months.

Looking at the lists of pending repairs over xmas, I see a pile that
will have to be investigated. I am about to do it, since you reminded me
to look at these.

> I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and 
> should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet.

No - it should not show it. 

> Honestly, in my years running Linux and multiple drive arrays I have never experienced errors such 
> as you are getting.

Then you are not trying to manage hundreds of clients at a time.

> Oh.. and both my arrays are running ext3 with an internal journal (as are all my other partitions on 
> all my other machines).
(Continue reading)

Brad Campbell | 5 Jan 06:44
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer wrote:
> Brad Campbell <brad <at> wasp.net.au> wrote:
> 
>>I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and 
>>should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet.
> 
> 
> No - it should not show it. 
> 

If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk 
and put it back in, the array is going to be written from parity data that is not quite right. (The 
problem I believe you were talking about where you have two identical disks and one is inconsistent, 
which one do you read from? is similar). And thus the reconstructed array is going to have different 
contents to the array before I failed the disk.

Therefore it should show the error. No?

Brad
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter T. Breuer | 5 Jan 10:00
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Brad Campbell <brad <at> wasp.net.au> wrote:
> If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk 
> and put it back in, the array is going to be written from parity data that is not quite right. (The 
> problem I believe you were talking about where you have two identical disks and one is inconsistent, 
> which one do you read from? is similar). And thus the reconstructed array is going to have different 
> contents to the array before I failed the disk.
> 
> Therefore it should show the error. No?

It will not detect it as an error, if that is what you mean.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brad Campbell | 5 Jan 10:14
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer wrote:
> Brad Campbell <brad <at> wasp.net.au> wrote:
> 
>>If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk 
>>and put it back in, the array is going to be written from parity data that is not quite right. (The 
>>problem I believe you were talking about where you have two identical disks and one is inconsistent, 
>>which one do you read from? is similar). And thus the reconstructed array is going to have different 
>>contents to the array before I failed the disk.
>>
>>Therefore it should show the error. No?
> 
> 
> It will not detect it as an error, if that is what you mean.

Now here we have a difference of opinion.

I'm detecting errors using md5sums and fsck.

If the drive checks out clean 1 minute, but has a bit error in a parity stripe and I remove/re-add a 
drive the array is going to rebuild that disk from the remaning data and parity. Therefore the data 
on that drive is going to differ compared to what it was previously.

Next time I do an fsck or md5sum I'm going to notice that something has changed. I'd call that an error.

Brad
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Peter T. Breuer | 5 Jan 10:28
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Brad Campbell <brad <at> wasp.net.au> wrote:
> I'm detecting errors using md5sums and fsck.
> 
> If the drive checks out clean 1 minute, but has a bit error in a parity stripe and I remove/re-add a 
> drive the array is going to rebuild that disk from the remaning data and parity. Therefore the data 
> on that drive is going to differ compared to what it was previously.

Indeed.

> Next time I do an fsck or md5sum I'm going to notice that something has changed. I'd call that an error.

If your check can find that type of error, then it will detect it, but
it is intrinsically unlikely that an fsck will see it because the "real
estate" argument say that it is 99% likely that the error occurs inside
a file or in free space rather than in metadata, so it is 99% likely
that fsck will not see anything amiss.

If you do an md5sum on file contents and compare with a previous md5sum
run, then it will be detected provided that the error occurs in a file,
but assuming that your disk is 50% full, that is only 50% likely.

I.e. "it depends on your test".

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Andy Smith | 5 Jan 11:04
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Wed, Jan 05, 2005 at 10:28:53AM +0100, Peter T. Breuer wrote:
> If you do an md5sum on file contents and compare with a previous md5sum
> run, then it will be detected provided that the error occurs in a file,
> but assuming that your disk is 50% full, that is only 50% likely.

"If a bit flips in the unused area of the disk and there is no one
there to md5sum it, did it really flip at all?"

:)

Out of interest Peter could you go into some details about how you
automate the md5sum of your filesystems?  Obviously I can think of
ways I would do it but I'm interested to hear how you have it set up
first.
Brad Campbell | 5 Jan 10:43
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Sorry, sent this privately by mistake.

Peter T. Breuer wrote:

> If you do an md5sum on file contents and compare with a previous md5sum
> run, then it will be detected provided that the error occurs in a file,
> but assuming that your disk is 50% full, that is only 50% likely.
> 
> I.e. "it depends on your test".

brad <at> srv:~$ df -h | grep md0
/dev/md0              2.1T  2.1T  9.2G 100% /raid

I'd say likely :p)

Regards,
Brad

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Guy | 5 Jan 16:09

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Dude!  That's a lot of mp3 files!  :)

-----Original Message-----
From: linux-raid-owner <at> vger.kernel.org
[mailto:linux-raid-owner <at> vger.kernel.org] On Behalf Of Brad Campbell
Sent: Wednesday, January 05, 2005 4:44 AM
To: RAID Linux
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Sorry, sent this privately by mistake.

Peter T. Breuer wrote:

> If you do an md5sum on file contents and compare with a previous md5sum
> run, then it will be detected provided that the error occurs in a file,
> but assuming that your disk is 50% full, that is only 50% likely.
> 
> I.e. "it depends on your test".

brad <at> srv:~$ df -h | grep md0
/dev/md0              2.1T  2.1T  9.2G 100% /raid

I'd say likely :p)

Regards,
Brad

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
(Continue reading)

maarten | 5 Jan 16:52

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Wednesday 05 January 2005 16:09, Guy wrote:
> Dude!  That's a lot of mp3 files!  :)

Indeed.  I "only" have this now:

/dev/md1              590G  590G  187M 100% /disk

md1 : active raid5 sdb3[4] sda3[3] hda3[0] hdc3[5] hde3[1] hdg3[2]
      618437888 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

...but 4 new 250 GB disks are on their way as we speak. :-)

P.S.:  This is my last post for a while, I have very important work to get 
done the rest of this week.  So see you all next time!

Regards,
Maarten

> -----Original Message-----
> From: linux-raid-owner <at> vger.kernel.org
> [mailto:linux-raid-owner <at> vger.kernel.org] On Behalf Of Brad Campbell
> Sent: Wednesday, January 05, 2005 4:44 AM
> To: RAID Linux
> Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
> crashing repeatedly and hard)

> Peter T. Breuer wrote:
> > If you do an md5sum on file contents and compare with a previous md5sum
> > run, then it will be detected provided that the error occurs in a file,
> > but assuming that your disk is 50% full, that is only 50% likely.
(Continue reading)

David Greaves | 4 Jan 20:12
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten wrote:

>Does this make any sense to anybody ?  (I sure hope so...)
>
>Maarten
>
Oh yeah!

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Maarten | 4 Jan 10:30

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > On Monday January 3, ewan.grantham <at> gmail.com wrote:

> > No need to change anything.
>
> Except that Peter says that the ext3 journals should be on separate
> non-mirrored devices and the reason this is not mentioned in any
> documentation (md / ext3) is that everyone sees it as obvious.
> Whether it is true or not it's clear to me that it's not obvious to
> everyone.

Be that as it may, with all that Peter wrote in the last 24 hours I tend to 
weigh his expertise a bit less than I did before.  YMMV, but his descriptions 
of his data center do not instill a very high confidence, do they ?

While it may be true that genius math people may make lousy server admins (and 
vice versa), when I read someone claiming there are random undetected errors 
propagating through raid, yet this person cannot even regulate his own 
"random, undetected" power supply problems, then I start to wonder.  

Would you believe that at one point, for a minute I wondered whether Peter was 
actually a troll ?  (yeah, sorry for that, but it happened...)
So no, he apparently is employed at a Spanish university, and he even has a 
Freshmeat project entry, something to do with raid...  

So I'm left with a blank stare, trying to figure out what to make of it. 

Maarten

(Continue reading)

Peter T. Breuer | 4 Jan 11:18
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Maarten <maarten <at> ultratux.net> wrote:
> On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > On Monday January 3, ewan.grantham <at> gmail.com wrote:
> 
> > > No need to change anything.
> >
> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.
> > Whether it is true or not it's clear to me that it's not obvious to
> > everyone.
> 
> Be that as it may, with all that Peter wrote in the last 24 hours I tend to 
> weigh his expertise a bit less than I did before.  YMMV, but his descriptions 
> of his data center do not instill a very high confidence, do they ?

It's not "my" data center.  It is what it is.  I can only control
certain things in it, such as the software on the machines, and which
machines are bought.  Nor is it a "data center", but a working
environment for about 200 scientists and engineers, plus thousands of
incompetent monkeys.  I.e., a university department.

It would be good of you to refrain from justifications based on
denigration.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
(Continue reading)

Maarten | 4 Jan 14:36

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote:
> Maarten <maarten <at> ultratux.net> wrote:
> > On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > > On Monday January 3, ewan.grantham <at> gmail.com wrote:

>
> It's not "my" data center.  It is what it is.  I can only control
> certain things in it, such as the software on the machines, and which
> machines are bought.  Nor is it a "data center", but a working
> environment for about 200 scientists and engineers, plus thousands of
> incompetent monkeys.  I.e., a university department.
>
> It would be good of you to refrain from justifications based on
> denigration.

I seem to recall you starting off boasting about the systems you had in place, 
with the rsync mirroring and all-servers-bought-in-duplicate.  If then later 
on your whole secure data center turns out to be a school department, 
undoubtedly with viruses rampant, students hacking at the schools' systems, 
peer to peer networks installed on the big fileservers unbeknownst to the 
admins, and only mains power when you're lucky, yes, then I get a completely 
other picture than you drew at first.  You can't blame me for that.

This does not mean you're incompetent, it just means you called a univ IT dept 
something that it is not, and never will be: secure, stable and organized. 
In other words, if you dislike being put down, you best not boast so much.

Now you'll have to excuse me, I have things to get done today.

(Continue reading)

Peter T. Breuer | 4 Jan 15:13
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Maarten <maarten <at> ultratux.net> wrote:
> On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote:
> > Maarten <maarten <at> ultratux.net> wrote:
> > > On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > > > On Monday January 3, ewan.grantham <at> gmail.com wrote:
> >
> > It's not "my" data center.  It is what it is.  I can only control
> > certain things in it, such as the software on the machines, and which
> > machines are bought.  Nor is it a "data center", but a working
> > environment for about 200 scientists and engineers, plus thousands of
> > incompetent monkeys.  I.e., a university department.
> >
> > It would be good of you to refrain from justifications based on
> > denigration.
> 
> I seem to recall you starting off boasting about the systems you had in place, 

I'm not "boasting"  about them. They simply ARE.

> with the rsync mirroring and all-servers-bought-in-duplicate.  If then later 

That's what there is.  Is that supposed to be boasting?  The servers are
always bought in pairs.  They always failover to each other.  They
contain each others mirrors.  Etc.

> on your whole secure data center turns out to be a school department, 

Eh?

(Continue reading)

maarten | 4 Jan 20:22

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote:
> Maarten <maarten <at> ultratux.net> wrote:
> > On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote:
> > > Maarten <maarten <at> ultratux.net> wrote:
> > > > On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > > > > On Monday January 3, ewan.grantham <at> gmail.com wrote:
> > >
> > > It's not "my" data center.  It is what it is.  I can only control
> > > certain things in it, such as the software on the machines, and which
> > > machines are bought.  Nor is it a "data center", but a working
> > > environment for about 200 scientists and engineers, plus thousands of
> > > incompetent monkeys.  I.e., a university department.

> I'm not "boasting"  about them. They simply ARE.

Are you not boasting about it, simply by providing all the little details no 
one cares about, except that it makes your story more believable ?

If I state my IQ was tested as above 140, am I then boasting, or simply 
stating a fact ?  Stating a fact and boasting are not mutually exclusive. 

> > on your whole secure data center turns out to be a school department,
>
> Eh?

What, "Eh?" ?  
Are you taking offense to me calling a "university department" a school ?  Is 
it not what you are, you are an educational institution, ie. a school.

(Continue reading)

Peter T. Breuer | 4 Jan 21:05
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote:
> > Maarten <maarten <at> ultratux.net> wrote:
> > > On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote:
> > > > Maarten <maarten <at> ultratux.net> wrote:
> > > > > On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > > > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > > > > > On Monday January 3, ewan.grantham <at> gmail.com wrote:
> > > >
> > > > It's not "my" data center.  It is what it is.  I can only control
> > > > certain things in it, such as the software on the machines, and which
> > > > machines are bought.  Nor is it a "data center", but a working
> > > > environment for about 200 scientists and engineers, plus thousands of
> > > > incompetent monkeys.  I.e., a university department.
> 
> > I'm not "boasting"  about them. They simply ARE.
> 
> Are you not boasting about it, simply by providing all the little details no 
> one cares about, except that it makes your story more believable ?

What "little details"? Really, this is most aggravating!

> If I state my IQ was tested as above 140, am I then boasting, or simply 
> stating a fact ?

You're being an improbability.

> Stating a fact and boasting are not mutually exclusive. 

But about WHAT? I have no idea what you may consider boasting!
(Continue reading)

Guy | 4 Jan 22:38

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Back to MTBF please.....

I agree that 1M hours MTBF is very bogus.  I don't really know how they
compute MTBF.  But I would like to see them compute the MTBF of a birthday
candle.

A birthday candle lasts about 2 minutes (as a guess).  I think they would
light 1000 candles at the same time.  Then monitor them until the first one
fails, say at 2 minutes.  I think the MTBF would then be computed as 2000
minutes MTBF!  But we can be sure that by 2.5 minutes, at least 90% of them
would have failed.

Guy

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mikael Abrahamsson | 5 Jan 01:58
Picon
Favicon

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tue, 4 Jan 2005, Guy wrote:

> light 1000 candles at the same time.  Then monitor them until the first one
> fails, say at 2 minutes.  I think the MTBF would then be computed as 2000
> minutes MTBF!  But we can be sure that by 2.5 minutes, at least 90% of them
> would have failed.

Which is why you, when you purchase a lot of stuff, should ask for an 
annual return rate value, which probably makes more sense than MTBF, even 
though these values are related.

--

-- 
Mikael Abrahamsson    email: swmike <at> swm.pp.se

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter T. Breuer | 5 Jan 00:53
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Guy <bugzilla <at> watkins-home.com> wrote:
> A birthday candle lasts about 2 minutes (as a guess).  I think they would
> light 1000 candles at the same time.  Then monitor them until the first one
> fails, say at 2 minutes.  I think the MTBF would then be computed as 2000
> minutes MTBF!

If the distribution is Poisson (i.e. the probabilty of dying per moment
time is constant over time) then that is correct. I don't know offhand
if that is an unbiassed estimator. I would imagine not. It would be
biassed to the short side.

> But we can be sure that by 2.5 minutes, at least 90% of them
> would have failed.

Then you would be sure that the distribution was not Poisson. What is
the problem here, exactly?  Many different distributions can have the
same mean.  For example, this one:

deaths per unit time
|
|   /\
|  /  \
| /    \
|/      \
---------->t

and this one

deaths per unit time
|
(Continue reading)

maarten | 4 Jan 22:48

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote:
> maarten <maarten <at> ultratux.net> wrote:
> > On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote:
> > > Maarten <maarten <at> ultratux.net> wrote:

> > Are you not boasting about it, simply by providing all the little details
> > no one cares about, except that it makes your story more believable ?
>
> What "little details"? Really, this is most aggravating!

These little details, as you scribbled, very helpfully I might add, below. ;)
  |
  |
  V

> over to backup pairs.  Last xmas I distinctly remember holding up the
> department on a single surviving server because a faulty cable had
> intermittently taken out one pair, and a faulty router had taken out
> another.  I forget what had happened to the remaining server.  Probably
> the cleaners switched it off!  Anyway, one survived and everything
> failed over to it, in a planned degradation.
>
> It would have been amusing, if I hadn't had  to deal with a horrible
> mail loop caused by mail being bounced by he server with intermittent
> contact through the faulty cable. There was no way of stopping it,
> since I couldn't open the building till Jan 6!

And another fine example of the various hurdles you encounter ;-)
Couldn't you just get the key from someone ?  If not, what if you saw 
something far worse happening, like all servers in one room dying shortly 
(Continue reading)

Peter T. Breuer | 5 Jan 00:14
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

maarten <maarten <at> ultratux.net> wrote:
> On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote:
> > maarten <maarten <at> ultratux.net> wrote:
> > > On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote:
> > > > Maarten <maarten <at> ultratux.net> wrote:
> 
> 
> > > Are you not boasting about it, simply by providing all the little details
> > > no one cares about, except that it makes your story more believable ?
> >
> > What "little details"? Really, this is most aggravating!

> These little details, as you scribbled, very helpfully I might add, below. ;)
>   |
>   |
>   V
> 
> > over to backup pairs.  Last xmas I distinctly remember holding up the
> > department on a single surviving server because a faulty cable had
> > intermittently taken out one pair, and a faulty router had taken out
> > another.  I forget what had happened to the remaining server.  Probably
> > the cleaners switched it off!  Anyway, one survived and everything
> > failed over to it, in a planned degradation.

This is in response to your strange statement that I had a "data center".
I hope it gives you a better idea.

> > It would have been amusing, if I hadn't had  to deal with a horrible
> > mail loop caused by mail being bounced by he server with intermittent
> > contact through the faulty cable. There was no way of stopping it,
(Continue reading)

maarten | 5 Jan 02:53

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Wednesday 05 January 2005 00:14, Peter T. Breuer wrote:
> maarten <maarten <at> ultratux.net> wrote:
> > On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote:
> > > maarten <maarten <at> ultratux.net> wrote:

> > If not, what if you saw
> > something far worse happening, like all servers in one room dying shortly
> > after another, or a full encompassing system compromise going on ??
>
> Nothing - I could not get in.

Now that is a sensible solution !  The fans in the server died off, you have 
30 minutes before everything overheats and subsequently incinerates the whole 
building, and you have no way to prevent that.  Great !  Well played.

> No - they can't do any of those things.  P2p nets are not illegal, and
> we would see the traffic if there were any.  They cannot "change their
> grades" because they do not have access to them - nobody does.  They are
> sent to goodness knows where in a govt bulding somewhere via ssl (an
> improvement from the times when we had to fill in a computer card marked
> in ink, for goodness sake, but I haven't done the sending in myself
> lately, so I don't know the details - I give the list to the secretary
> rather than suffer).  As to reading MY disk, anyone can do that.  I
> don't have secrets, be it marks on anything else.  Indeed, my disk will
> nfs mount on the student machines if they so much as cd to my home
> directory (but don't tell them that!).  Of course they'd then have to
> figure out how to become root in order to change uid so they could read
> my data, and they can't do that - all the alarms in the building would
> go off!  su isn't even executable, let alone suid, and root login is
> disabled so many places I forget (heh, .profile in /root ays something
(Continue reading)

Neil Brown | 4 Jan 04:42
X-Face
Picon
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Tuesday January 4, andy <at> strugglers.net wrote:
> On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > On Monday January 3, ewan.grantham <at> gmail.com wrote:
> > > I've setup a RAID-5 array using two internal 250 Gig HDs and two
> > > external 250 Gig HDs through a USB-2 interface. Each of the externals
> > > is on it's own card, and the internals are on seperate IDE channels.
> > > 
> > > I "thought" I was doing a good thing by doing all of this and then
> > > setting them up using an ext3 filesystem.
> > 
> > Sounds like a perfectly fine setup (providing always that external
> > cables are safe from stray feet etc).
> > 
> > No need to change anything.
> 
> Except that Peter says that the ext3 journals should be on separate
> non-mirrored devices and the reason this is not mentioned in any
> documentation (md / ext3) is that everyone sees it as obvious.
> Whether it is true or not it's clear to me that it's not obvious to
> everyone.

If Peter says that, then Peter is WRONG.

ext3 journals are much safer on mirrored devices than on non-mirrored
devices just the same as any other data is safer on mirrored than on
non-mirrored. 
In the case in question, it is raid5, not mirrored, but still raid5 is
safer than raid0 or single devices (possibly not quite as safe was raid1).

NeilBrown
(Continue reading)

Peter T. Breuer | 4 Jan 10:50
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> On Tuesday January 4, andy <at> strugglers.net wrote:
> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.
> > Whether it is true or not it's clear to me that it's not obvious to
> > everyone.
> 
> If Peter says that, then Peter is WRONG.

But Peter does NOT say that.

> ext3 journals are much safer on mirrored devices than on non-mirrored

That's irrelevant - you don't care what's in the journal, because if
your system crashes before committal you WANT the data in the journal
to be lost, rolled back, whatever, and you don't want your machine to
have acked the write until it actually has gone to disk.

Or at least that's what *I* want. But then everyone has different
wants and needs. What is obvious, however, are the issues involved.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Guy | 4 Jan 17:42

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

This may be a stupid question...  But it seems obvious to me!
If you don't want your journal after a crash, why have a journal?

Guy

-----Original Message-----
From: linux-raid-owner <at> vger.kernel.org
[mailto:linux-raid-owner <at> vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Tuesday, January 04, 2005 4:51 AM
To: linux-raid <at> vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> On Tuesday January 4, andy <at> strugglers.net wrote:
> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.
> > Whether it is true or not it's clear to me that it's not obvious to
> > everyone.
> 
> If Peter says that, then Peter is WRONG.

But Peter does NOT say that.

> ext3 journals are much safer on mirrored devices than on non-mirrored

That's irrelevant - you don't care what's in the journal, because if
your system crashes before committal you WANT the data in the journal
to be lost, rolled back, whatever, and you don't want your machine to
(Continue reading)

Peter T. Breuer | 4 Jan 18:46
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Guy <bugzilla <at> watkins-home.com> wrote:
> This may be a stupid question...  But it seems obvious to me!
> If you don't want your journal after a crash, why have a journal?

Journalled fs's have the property that their file systems are always
coherent (provided other corruption has not occurred).  This is often
advantageous in terms of providing you with the ability to at least
boot. The fs code is oranised so that everuthig is set up for a
metadata change, and then a single "final" atomic operation occurs that
finalizes the change.

It is THAT property that is desirable. It is not intrinsic to journalled
file systems, but in practice only journalled file systems have
implemented it.

In other words, what I'd like here is a journalled file system with a
zero size journal.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Greaves | 4 Jan 15:15
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Peter T. Breuer wrote:

>>ext3 journals are much safer on mirrored devices than on non-mirrored
>>    
>>
>That's irrelevant - you don't care what's in the journal, because if
>your system crashes before committal you WANT the data in the journal
>to be lost, rolled back, whatever, and you don't want your machine to
>have acked the write until it actually has gone to disk.
>
>Or at least that's what *I* want. But then everyone has different
>wants and needs. What is obvious, however, are the issues involved.
>  
>
err, no.

If the journal is safely written to the journal device and the machine 
crashes whilst updating the main filesystem you want the journal to be 
replayed, not erased. The journal entries are designed to be replayable 
to a partially updated filesystem.

That's the whole point of journalling filesystems, write the deltas to 
the journal, make the changes to the fs, delete the deltas from the journal.

If the machine crashes whilst the deltas are being written then you 
won't play them back - but your fs will be consistent.

Journaled filesystems simply ensure the integrity of the fs metadata - 
they don't protect against random acts of application/user level 
vandalism (ie power failure).
(Continue reading)

Peter T. Breuer | 4 Jan 16:20
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

David Greaves <david <at> dgreaves.com> wrote:
> Peter T. Breuer wrote:
> 
> >>ext3 journals are much safer on mirrored devices than on non-mirrored
> >That's irrelevant - you don't care what's in the journal, because if
> >your system crashes before committal you WANT the data in the journal
> >to be lost, rolled back, whatever, and you don't want your machine to
> >have acked the write until it actually has gone to disk.
> >
> >Or at least that's what *I* want. But then everyone has different
> >wants and needs. What is obvious, however, are the issues involved.
> 
> If the journal is safely written to the journal device and the machine 

You don't know it has been. Raid can't tell.

> crashes whilst updating the main filesystem you want the journal to be 
> replayed, not erased. The journal entries are designed to be replayable 
> to a partially updated filesystem.

It doesn't work. You can easily get a block  written to the journal on
disk A, but not on disk B (supposing raid 1 with disks A and B).
According to you "this" should be replayed. Well, which result do you
want? Raid has no way of telling.

Suppose that A contains the last block to be written to a file, and
does not. Yet B is chosen by raid as the "reliable" source.

Then what happens? 

(Continue reading)

Alvin Oga | 4 Jan 02:18

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)


On Tue, 4 Jan 2005, Peter T. Breuer wrote:

> Neil Brown <neilb <at> cse.unsw.edu.au> wrote:

> > > Let 
> > > 
> > >    p = probability of a detectible error occuring on a disk in a unit time
> > >    p'= ................ indetectible .....................................
> > > 

i think the definitions and modes of failures is what each reader is
interpretting from their perspective ??

> > think, the branch of mathematics that has the highest ratio of people
> > who think that understand it to people to actually do (witness the
> > success of lotteries).

ahh ... but the stock market is the worlds largest casino

> Possibly. But not all of them teach probability at university level
> (and did so when they were 21, at the University of Cambridge to boot,
> and continued teaching pure math there at all subjects and all levels
> until the age of twenty-eight - so puhleeeze don't bother!).

:-)

> I mean an error occurs that can be detected (by the experiment you run,
> which is prsumably an fsck, but I don't presume to dictate to you).

(Continue reading)

Neil Brown | 4 Jan 05:29
X-Face
Picon
Picon
Favicon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

On Monday January 3, aoga <at> ns.Linux-Consulting.com wrote:
> 
> > > think, the branch of mathematics that has the highest ratio of people
> > > who think that understand it to people to actually do (witness the
> > > success of lotteries).
> 
> ahh ... but the stock market is the worlds largest casino

and how many people do you know who make money on stock markets.
Now compare that with how many loose money on lotteries.
Find out the ratio and .....

>  
> > Possibly. But not all of them teach probability at university level
> > (and did so when they were 21, at the University of Cambridge to boot,
> > and continued teaching pure math there at all subjects and all levels
> > until the age of twenty-eight - so puhleeeze don't bother!).

Apparently teaching probability at University doesn't necessary mean
that you understand it.  I cannot comment on your understanding, but
if you ask google about the Monty Hall problem  and include search
terms like "professor" or "maths department" you will find plenty of
(reported) cases of University staff not getting it.

e.g.  http://www25.brinkster.com/ranmath/marlright/montynyt.htm

 "Our math department had a good, self-righteous laugh at your
 expense," wrote Mary Jane Still, a professor at Palm Beach Junior
 College. Robert Sachs, a professor of mathematics at George Mason
 University in Fairfax, Va., expressed the prevailing view that there
(Continue reading)

Peter T. Breuer | 4 Jan 09:43
Picon

Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Neil Brown <neilb <at> cse.unsw.edu.au> wrote:
> On Monday January 3, aoga <at> ns.Linux-Consulting.com wrote:
> > 
> > > > think, the branch of mathematics that has the highest ratio of people
> > > > who think that understand it to people to actually do (witness the
> > > > success of lotteries).
> > 
> > ahh ... but the stock market is the worlds largest casino
> 
> and how many people do you know who make money on stock markets.

Ooh .. several mathematicians (pay is not very high!). 

> Now compare that with how many loose money on lotteries.

I don't have to - I wouldn't place money in a lottery.  The expected
gain is negative whatever you do.  I stick to investments where I have
an expectation of a positive gain with at least some strategy.

Mind you, as Conway often said,  statistics don't apply to improbable
events. So you should bet on anything which is not likely to occur more
than once or twice a lifetime (theory - if you win, just don't try it
again; if you die first, well, you won't care).

Stick to someting more certain, like blackjack, if you want to  make $.

> Apparently teaching probability at University doesn't necessary mean
> that you understand it. 

Perhaps the problem is at your end?
(Continue reading)

Guy | 3 Jan 18:34

RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

Having a filesystem go into read only mode is a "down system".  Not
acceptable to me!  Maybe ok for a home system, but I don't assume Linux is
limited to home use.  In my case, this is not acceptable for my home system.
Time is money!

About user intervention.  If the system stops working until someone does
something, that is a down system.  That is what I meant by user
intervention.  Replacing a disk Monday that failed Friday night, is what I
would expect.  This is a normal failure to me.  Even if a re-boot is
required, as long as it can be scheduled, it is acceptable to me.

You and I have had very different failures over the years!
In my case, most failures are disks, and most of the time the system
continues to work just fine, without user intervention.  If spare disks are
configured, the array re-builds to the spare.  At my convenience, I replace
the disk, without a system re-boot.  Most Unix systems I have used have SCSI
disks.  IDE tends to be in home systems.  My home system is Linux with 17
SCSI disks.  I have replaced a disk without a re-boot, but the disk cabinet
is not hot-swap, so I tend to shut down the system to replace a disk.

My 20 systems had anywhere from 4 to about 44 disks.  You should expect 1
disk failure out of 25-100 disks per year.  There are good years and bad!
Our largest customer system has more than 300 disks.  I don't know the
failure rate, but most failures do not take the system down!  Our customer
systems tend to have hardware RAID systems.  HP, EMC, DG (now EMC).

If you have a 10% disk failure rate per year, something else is wrong!  You
may have a bad building ground, or too much current flowing on the building
ground line.  All sorts of power problems are very common.  Most if not all
electricians only know the building code.  They are not qualified to debug
(Continue reading)

Gordon Henderson | 3 Jan 20:20
Favicon

ext3 ..


Been folowing this with interst as just about everything I'm building
these days has raid1 to boot and data (typical small server setup), and
raid5 in larger boxes for data and ext3 ...

No problems with this yet - several power failures and disks lost and it's
all generally behaved as I expected it to. I've hot-chanaged SCSI drives
which have failed and cold changed IDE drives at a convenient time for the
server...

I did have a problem recently though - had a disk fail in an 8-disk
external SCSI array, arranged as a 7+1 RAID5 ... Then 5 minutes later had
a 2nd disk fail.

So to the upper layers, ext3, userland, etc. that should look like a
catastrophic hardware failure -- anything trying to read/write to it
should (IMO) have simply returned with IO errors.

What actually happened was that the kernel panicked and the whole box
ground to a halt. The server could have carried on doing usefull stuff
without this disk partition, but a big oops and halt wasn't useful.

(This is 2.4.27 in-case it matters)

I didn't have time to work out the why/what/wherevers of the problem, the
box was power cycled and brought online minus the external array. Ext3 did
its thing and enabled the box to come up in seconds rather than hours
(it's a big Dell - it boots Linux faster than it goes through its BIOS!)

As for the external array, well, that was resurected with mdadm with no
(Continue reading)

Favicon

Re: ext3 ..

> Been folowing this with interst as just about everything I'm building
> these days has raid1 to boot and data (typical small server setup), and
> raid5 in larger boxes for data and ext3 ...
>
> No problems with this yet - several power failures and disks lost and it's
> all generally behaved as I expected it to. I've hot-chanaged SCSI drives
> which have failed and cold changed IDE drives at a convenient time for the
> server...
>
> I did have a problem recently though - had a disk fail in an 8-disk
> external SCSI array, arranged as a 7+1 RAID5 ... Then 5 minutes later had
> a 2nd disk fail.
>
> So to the upper layers, ext3, userland, etc. that should look like a
> catastrophic hardware failure -- anything trying to read/write to it
> should (IMO) have simply returned with IO errors.

That depends on the options when the filesystem was mounted. Or the
options set in the superblock. The choices are continue, remount
read-only or panic.

Regards, Morten
----
A: No.
Q: Should I include quotations after my reply?

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
(Continue reading)

Gordon Henderson | 3 Jan 21:05
Favicon

Re: ext3 ..

On Mon, 3 Jan 2005, Morten Sylvest Olsen wrote:

> > So to the upper layers, ext3, userland, etc. that should look like a
> > catastrophic hardware failure -- anything trying to read/write to it
> > should (IMO) have simply returned with IO errors.
>
> That depends on the options when the filesystem was mounted. Or the
> options set in the superblock. The choices are continue, remount
> read-only or panic.

however - on the system in question:

  xena:/home/gordonh# dumpe2fs -h /dev/md5
  ...
  Errors behavior:          Continue
  ...

and there are no mount options to say otherwise.

There are lots of ext3 whinges in the log-file so I guess it just got
fed-up... And this really isn't a linux-raid issue, just something I
noticed recently to do with ext3...

I did try XFS a while back, but had more problems with it and no
satisfactory answers, so gave up on it... The trouble is, you get too used
to what works and tend to stick with it...

Ah well.

Gordon
(Continue reading)


Gmane