17 Feb 2010 00:23
sf bugid 2945058 - e1000e and 82574L problems
Alex Chekholko <chekh <at> pcbi.upenn.edu>
2010-02-16 23:23:18 GMT
2010-02-16 23:23:18 GMT
Hi all,
I have 17 new nodes, all identical. Identical hw, identical sw, and I can reproduce this problem on any one of
the nodes (well, honestly, I only tried three, but I assume the other 14 will have the same issue). But I
suppose that doesn't necessarily rule out a hw problem, they could all have the same hw problem.
The problem is that the NIC drops off the network under heavy load. I am now able to reproduce this quickly and
easily by booting the node and running a disk benchmark over the network to my I/O server.
So I run 'bonnie++ -u nobody -d /gpfs/fs0/test/'
In another terminal, I can do 'watch --differences "cat /proc/interrupts"'
After a while, some interrupts appear on the interrupt 82 line:
Every 2.0s: cat /proc/interrupts Tue Feb 16 18:13:45 2010
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 836608 0 0 0 0 0 0 0 IO-APIC-edge timer
1: 3 0 0 0 0 0 0 0 IO-APIC-edge i8042
8: 1 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
12: 4 0 0 0 0 0 0 0 IO-APIC-edge i8042
58: 8159 1010 2069 0 0 0 0 0 PCI-MSI ahci
74: 637 0 0 0 0 0 0 1532887 PCI-MSI-X eth0-Q0
82: 7585 0 0 0 0 0 83121 0 PCI-MSI-X eth0
169: 60 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb1
233: 63 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb2
NMI: 854 87 531 96 100 85 166 107
LOC: 836190 836569 836462 836354 835971 836134 836028 835070
(Continue reading)
RSS Feed