Alex Chekholko | 17 Feb 00:23 2010

sf bugid 2945058 - e1000e and 82574L problems

Hi all,

I have 17 new nodes, all identical.  Identical hw, identical sw, and I can reproduce this problem on any one of
the nodes (well, honestly, I only tried three, but I assume the other 14 will have the same issue).  But I
suppose that doesn't necessarily rule out a hw problem, they could all have the same hw problem.

The problem is that the NIC drops off the network under heavy load.  I am now able to reproduce this quickly and
easily by booting the node and running a disk benchmark over the network to my I/O server.

So I run 'bonnie++ -u nobody -d /gpfs/fs0/test/'

In another terminal, I can do 'watch --differences "cat /proc/interrupts"'

After a while, some interrupts appear on the interrupt 82 line:

Every 2.0s: cat /proc/interrupts                                                                   Tue Feb 16 18:13:45 2010

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
  0:     836608          0          0          0          0          0          0          0    IO-APIC-edge  timer
  1:          3          0          0          0          0          0          0          0    IO-APIC-edge  i8042
  8:          1          0          0          0          0          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0          0          0          0          0   IO-APIC-level  acpi
 12:          4          0          0          0          0          0          0          0    IO-APIC-edge  i8042
 58:       8159       1010       2069          0          0          0          0          0         PCI-MSI  ahci
 74:        637          0          0          0          0          0          0    1532887       PCI-MSI-X  eth0-Q0
 82:       7585          0          0          0          0          0      83121          0       PCI-MSI-X  eth0
169:         60          0          0          0          0          0          0          0   IO-APIC-level  ehci_hcd:usb1
233:         63          0          0          0          0          0          0          0   IO-APIC-level  ehci_hcd:usb2
NMI:        854         87        531         96        100         85        166        107
LOC:     836190     836569     836462     836354     835971     836134     836028     835070
(Continue reading)

Tantilov, Emil S | 17 Feb 00:40 2010
Picon

Re: sf bugid 2945058 - e1000e and 82574L problems

Alex Chekholko wrote:
> Hi all,
> 
> I have 17 new nodes, all identical.  Identical hw, identical sw, and
> I can reproduce this problem on any one of the nodes (well, honestly,
> I only tried three, but I assume the other 14 will have the same
> issue).  But I suppose that doesn't necessarily rule out a hw
> problem, they could all have the same hw problem.

What is the kernel version and the version of the driver you are using? 
Also information about the system (including lspci output) would be very helpful.

Thanks,
Emil

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

Alex Chekholko | 17 Feb 00:47 2010

Re: sf bugid 2945058 - e1000e and 82574L problems

On Tue, 16 Feb 2010 16:40:11 -0700
"Tantilov, Emil S" <emil.s.tantilov <at> intel.com> wrote:

> Alex Chekholko wrote:
> > Hi all,
> > 
> > I have 17 new nodes, all identical.  Identical hw, identical sw, and
> > I can reproduce this problem on any one of the nodes (well, honestly,
> > I only tried three, but I assume the other 14 will have the same
> > issue).  But I suppose that doesn't necessarily rule out a hw
> > problem, they could all have the same hw problem.
> 
> What is the kernel version and the version of the driver you are using? 
> Also information about the system (including lspci output) would be very helpful.
> 

Sorry, forgot to mention that I have that listed in the bug tracker.  But I will reproduce it here.
This link to the bug may work:
http://sourceforge.net/tracker/?func=detail&aid=2945058&group_id=42302&atid=447449

kernel (EL 5.4 stock latest): 2.6.18-164.11.1.el5

e1000e latest from sourceforge:
e1000e: Intel(R) PRO/1000 Network Driver - 1.1.2-NAPI
same issue with the driver included in EL5.4 by default, that one is 1.0.2-k2

'lspci -v' output:
04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
        Subsystem: Super Micro Computer Inc Unknown device 0605
        Flags: bus master, fast devsel, latency 0, IRQ 169
(Continue reading)

Tantilov, Emil S | 2 Mar 19:51 2010
Picon

Re: sf bugid 2945058 - e1000e and 82574L problems

We have one of the Supermicro models (X8DTL-6F) here in the lab and were
able to reproduce some of the issues that were reported on those platforms
with on-board 82574 parts. Specifically Tx hangs that can lead to loss of connecticity. The system became
much more stable after we updated the IPMI FW. 

Could you please update your IPMI FW and report back with results?

To get the updated FW - go to the Supermicro web page related to your
sepcific system model and click on "Update Your Bios" link. Current version
of IPMI for those systems is 1.29.  

Thanks,
Emil

-----Original Message-----
From: Alex Chekholko [mailto:chekh <at> pcbi.upenn.edu] 
Sent: Tuesday, February 16, 2010 3:48 PM
To: Tantilov, Emil S
Cc: e1000-devel <at> lists.sourceforge.net
Subject: Re: [E1000-devel] sf bugid 2945058 - e1000e and 82574L problems

On Tue, 16 Feb 2010 16:40:11 -0700
"Tantilov, Emil S" <emil.s.tantilov <at> intel.com> wrote:

> Alex Chekholko wrote:
> > Hi all,
> > 
> > I have 17 new nodes, all identical.  Identical hw, identical sw, and
> > I can reproduce this problem on any one of the nodes (well, honestly,
> > I only tried three, but I assume the other 14 will have the same
(Continue reading)


Gmane