Nicolas Jungers | 29 Dec 2011 18:08

Massive corruption - obvious error ?

Hello,

I have a test setup that I can't get right. When accessing the target, 
my filesystem is hopelessly corrupted. The setup follows.

The target uses a cheap NIC, that's why the payload isn't better, and 
both servers are connected back to back, but I had the same result with 
a cheap switch.

What did I miss?

Tia,
Nicolas

Target
======
root <at> oserver:~# uname -a
Linux oserver 3.0.0-14-server #23-Ubuntu SMP Mon Nov 21 20:49:05 UTC 
2011 x86_64 x86_64 x86_64 GNU/Linux

root <at> oserver:~# aoe-version
               aoetools: 32
   installed aoe driver: 78
     running aoe driver: (none)

root <at> oserver:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
[raid4] [raid10]
md127 : active raid5 sdd2[3] sda2[0] sde2[5] sdc2[2] sdb2[1]
       7814043648 blocks super 1.2 level 5, 512k chunk, algorithm 2 
(Continue reading)

Nicolas Jungers | 29 Dec 2011 18:52

Re: Massive corruption - obvious error ?

On 2011-12-29 18:08, Nicolas Jungers wrote:
> Hello,
>
> I have a test setup that I can't get right. When accessing the target,
> my filesystem is hopelessly corrupted. The setup follows.
>
> The target uses a cheap NIC, that's why the payload isn't better, and
> both servers are connected back to back, but I had the same result with
> a cheap switch.
>
> What did I miss?
>

I forgot the log extract...

Dec 28 08:12:14 be0 kernel: [53427.511891] aoe: e1.1: setting 6656 byte 
data frames
Dec 28 08:12:14 be0 kernel: [53427.512108] aoe: 002522adf3d0 e1.1 v4014 
has 15628087296 sectors
Dec 28 08:12:14 be0 kernel: [53427.512259]  etherd/e1.1: unknown 
partition table
Dec 28 08:13:03 be0 kernel: [53477.071975] aoe: warning: unknown aoe 
command type 0x20
Dec 28 08:13:05 be0 kernel: [53478.441995] aoe: warning: unknown aoe 
command type 0x20
Dec 28 08:13:05 be0 kernel: [53478.444485] aoe: warning: unknown aoe 
command type 0x20
Dec 28 08:13:05 be0 kernel: [53478.454679] aoe: warning: unknown aoe 
command type 0x20
Dec 28 08:13:05 be0 kernel: [53478.545570] aoe: ata error cmd=24h 
(Continue reading)

Tracy Reed | 29 Dec 2011 20:12

Re: Massive corruption - obvious error ?

On Thu, Dec 29, 2011 at 06:08:57PM +0100, Nicolas Jungers spake thusly:
> The target uses a cheap NIC, that's why the payload isn't better, and 
> both servers are connected back to back, but I had the same result with 
> a cheap switch.

To try to isolate the issue I would try to devise some sort of network test
such as running a dd of /dev/urandom over netcat for a few gigs from one
machine to another and doing an md5sum sum on the data to see if there was any
corruption in the network independent of AoE/filesystem etc.

For what it's worth, I have never had an AoE data corruption problem.

--

-- 
Tracy Reed
http://tracyreed.org
Digital signature attached for your safety.
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Aoetools-discuss mailing list
Aoetools-discuss@...
https://lists.sourceforge.net/lists/listinfo/aoetools-discuss
(Continue reading)

Nicolas Jungers | 31 Dec 2011 18:01

Re: Massive corruption - obvious error ?

On 2011-12-29 20:12, Tracy Reed wrote:
> On Thu, Dec 29, 2011 at 06:08:57PM +0100, Nicolas Jungers spake thusly:
>> The target uses a cheap NIC, that's why the payload isn't better, and
>> both servers are connected back to back, but I had the same result with
>> a cheap switch.
>
> To try to isolate the issue I would try to devise some sort of network test
> such as running a dd of /dev/urandom over netcat for a few gigs from one
> machine to another and doing an md5sum sum on the data to see if there was any
> corruption in the network independent of AoE/filesystem etc.
>
> For what it's worth, I have never had an AoE data corruption problem.
>

It is definitively a NIC problem.  I'm one server NIC to short for my 
test and I started them with a realtek. They doesn't handle jumbo frame 
that well.

N.

------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox

Gmane