Aaron Lehmann | 1 Apr 2007 06:39

Re: Silent corruption on AMD64

On Sat, Mar 31, 2007 at 08:03:16PM -0700, Jim Paris wrote:
> Since it shows up under heavy load that includes unrelated devices, I
> think ruling out hardware problems is important.  Some suggestions:

I've been able to narrow it down to the Realtek Ethernet card. I can't
reproduce the problem using onboard Ethernet, whereas the Realtek card
causes trouble in any slot. However, I still don't know whether it's a
hardware or software issue, or whether it's caused directly or
indirectly by the Realtek card.
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stuart MacDonald | 2 Apr 2007 15:30
Favicon

RE: Silent corruption on AMD64

From: On Behalf Of Aaron Lehmann
> I've been able to narrow it down to the Realtek Ethernet card. I can't
> reproduce the problem using onboard Ethernet, whereas the Realtek card
> causes trouble in any slot. However, I still don't know whether it's a
> hardware or software issue, or whether it's caused directly or
> indirectly by the Realtek card.

I had a similar issue recently:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=223216

I recommend trying Doug Ledford's memtest script:
http://people.redhat.com/dledford/memtest.html

It helped me prove the issue was the hardware and not something else.

..Stu

Andi Kleen | 1 Apr 2007 15:58

Re: Silent corruption on AMD64

Aaron Lehmann <aaronl <at> vitelus.com> writes:

[adding netdev]
[meta-comment: I wish people wouldn't use such unnecessarily broad subjects 
-- how is it the x86-64 port's or AMD's fault when you have broken hardware? 
Would anybody write "Silent corruption on i386" or "Silent corruption 
on Intel" or "Silent corruption on Linux"?]

> On Sat, Mar 31, 2007 at 08:03:16PM -0700, Jim Paris wrote:
> > Since it shows up under heavy load that includes unrelated devices, I
> > think ruling out hardware problems is important.  Some suggestions:
> 
> I've been able to narrow it down to the Realtek Ethernet card. I can't
> reproduce the problem using onboard Ethernet, whereas the Realtek card
> causes trouble in any slot. However, I still don't know whether it's a
> hardware or software issue, or whether it's caused directly or
> indirectly by the Realtek card.

You could disable the hardware checksumming support in the card with
the appended patch. Then hopefully Linux will catch most corruptions
(but perhaps not all because TCP checksums are not very strong) 
You can watch failed checksums then with netstat -s

-Andi

Index: linux-2.6.21-rc3-net/drivers/net/r8169.c
===================================================================
--- linux-2.6.21-rc3-net.orig/drivers/net/r8169.c
+++ linux-2.6.21-rc3-net/drivers/net/r8169.c
 <at>  <at>  -2477,6 +2477,7  <at>  <at>  static inline int rtl8169_fragmented_fra
(Continue reading)

Francois Romieu | 4 Apr 2007 20:45

Re: Silent corruption with r8169

Andi Kleen <andi <at> firstfloor.org> :
> Aaron Lehmann <aaronl <at> vitelus.com> writes:
> 
> [adding netdev]
> [meta-comment: I wish people wouldn't use such unnecessarily broad subjects 
> -- how is it the x86-64 port's or AMD's fault when you have broken hardware? 
> Would anybody write "Silent corruption on i386" or "Silent corruption 
> on Intel" or "Silent corruption on Linux"?]

I hope you feel better now that I changed the subject.

Aaron, I see no clear suspect between 2.6.20.1 and current -git
that could explain nor fix a corruption in the r8169 driver.

Can you apply on top of latest 2.6.21-rc5-git the patches available at
http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402

000[12]-r8169-foo-bar.patch have been committed a few minutes ago: you
should check if they apply or not.

netconsole appears compiled as a module. Is it used ?

--

-- 
Ueimor
Aaron Lehmann | 4 Apr 2007 22:06

Re: Silent corruption with r8169

On Wed, Apr 04, 2007 at 08:45:04PM +0200, Francois Romieu wrote:
> > [adding netdev]
> > [meta-comment: I wish people wouldn't use such unnecessarily broad subjects 
> > -- how is it the x86-64 port's or AMD's fault when you have broken hardware? 
> > Would anybody write "Silent corruption on i386" or "Silent corruption 
> > on Intel" or "Silent corruption on Linux"?]
> 
> I hope you feel better now that I changed the subject.
> 
> Aaron, I see no clear suspect between 2.6.20.1 and current -git
> that could explain nor fix a corruption in the r8169 driver.
> 
> Can you apply on top of latest 2.6.21-rc5-git the patches available at
> http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402
> 
> 000[12]-r8169-foo-bar.patch have been committed a few minutes ago: you
> should check if they apply or not.

I'll try to get to testing this, but I'm wondering if people may have
misunderstood my original post. I don't get any corruption over
Ethernet; it's just corruption on the filesystem during certain load
patterns that involve the Realtek ethernet card.

Aaron
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Andi Kleen | 5 Apr 2007 13:41

Re: Silent corruption with r8169

> I'll try to get to testing this, but I'm wondering if people may have
> misunderstood my original post. I don't get any corruption over
> Ethernet; it's just corruption on the filesystem during certain load
> patterns that involve the Realtek ethernet card.

When disabling hardware checksums helps then you know the corruption
is on the Ethernetside. Otherwise it's somewhere else.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Francois Romieu | 4 Apr 2007 22:45

Re: Silent corruption with r8169

Aaron Lehmann <aaronl <at> vitelus.com> :
[...]
> I'll try to get to testing this, but I'm wondering if people may have
> misunderstood my original post. I don't get any corruption over
> Ethernet; it's just corruption on the filesystem during certain load
> patterns that involve the Realtek ethernet card.

It is too soon to label it a debug feature or a genuine bug in the
r8169 driver but at least there is a r8169 patchkit to help with
various annoyances (obscure bug on amd platform for instance).
It could make a difference.

The disk io + r8169 driver bugs can be very frustrating to debug :o/

--

-- 
Ueimor

Anybody got a battery for my Ultra 10 ?
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rick Jones | 2 Apr 2007 19:46
Picon
Favicon

NIC data corruption

I changed the title to be more accurate, and culled the distribution to 
individuals and netdev

The mention of trying to turn-off CKO and see if the data corruption 
goes away leads me to ask a possibly "delicate" question:

   Should "Linux" only enable CKO on those NICs certified to have
   ECC/parity throughout their _entire_ data path?

rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen | 2 Apr 2007 20:07

Re: NIC data corruption

On Mon, Apr 02, 2007 at 10:46:00AM -0700, Rick Jones wrote:
> I changed the title to be more accurate, and culled the distribution to 
> individuals and netdev
> 
> The mention of trying to turn-off CKO and see if the data corruption 
> goes away leads me to ask a possibly "delicate" question:
> 
>   Should "Linux" only enable CKO on those NICs certified to have
>   ECC/parity throughout their _entire_ data path?

Even with reliable software checksumming you can have quite a lot of undetected
errors (there was a interesting study about this some years ago). 
If you really care about your data you should use SSL or some
other protocol with strong checksums.

That said it would probably make quite a lot of people unhappy
because it would make their NICs much slower and the occasional
bit error that is missed by the NIC is likely not a big issue for them.

What might be a good idea would to have some optional knob 
somewhere that allows to disable hardware checksumming for NICs
that are not trusted this way. But then someone would need to 
do the necessary research for the hundreds of NICs Linux support

Just providing a general global "disable hardware checksumming"
knob for the paranoid would be much easier. I guess that would
be a good idea.

-Andi

(Continue reading)

Herbert Xu | 4 Apr 2007 07:00
Picon
Picon

Re: NIC data corruption

Andi Kleen <andi <at> firstfloor.org> wrote:
> 
> Just providing a general global "disable hardware checksumming"
> knob for the paranoid would be much easier. I guess that would
> be a good idea.

FWIW you can disable RX checksuming with

	ethtool -K <ifname> rx off

Cheers,
--

-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert <at> gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Gmane