Anthony Tong | 6 Oct 2007 18:21
Picon

Re: 2.6.3, file hole issues?

On Sat, Oct 06, 2007 at 08:41:59AM -0500, Sam Lang wrote:
> Hi Anthony,
>
> Are you able to reproduce the problem with pvfs2-client-core?  Also, are 
> you running pvfs2-client-core-threaded directly or through pvfs2-client 
> --threaded?

Just tried pvfs2-client-core for several test runs.
Seems to work ok!

Previously, I had pvfs2-client start the threaded version by:

/usr/local/pvfs2/bin/pvfs2-client -p /usr/local/pvfs2/bin/pvfs2-client-core-threaded

> From the traces you've included below, it looks like you're 
> mounting/unmounting the filesystem over and over between each IO.  Any 
> reason to do that?

I definitely am not conscious of anything doing that.

To clarify what you said & what I see in the logs: before each problematic
IO, it looks like the client daemons are remounting the fs? Maybe
the threaded client core was getting restarted for some reason?

Let me know what debug settings/logs would be useful
Anthony Tong | 6 Oct 2007 23:45
Picon

Re: 2.6.3, file hole issues?

On Sat, Oct 06, 2007 at 11:21:58AM -0500, Anthony Tong wrote:
> To clarify what you said & what I see in the logs: before each problematic
> IO, it looks like the client daemons are remounting the fs? Maybe
> the threaded client core was getting restarted for some reason?

After some fiddling, got cores turned on for the threaded client
tend and they all tend look like this.

#0  0x0075c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) where
#0  0x0075c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x007a17a5 in raise () from /lib/tls/libc.so.6
#2  0x007a3209 in abort () from /lib/tls/libc.so.6
#3  0x0079ad91 in __assert_fail () from /lib/tls/libc.so.6
#4  0x08076773 in io_datafile_complete_operations ()
#5  0x0806dcef in PINT_client_state_machine_testsome ()
#6  0x08056537 in process_vfs_requests ()
#7  0x08058a2a in main ()

This is the actual assert

pvfs2-client-core: src/client/sysint/sys-io.sm:1860: io_post_write_ack_recv: Assertion `ret ==
0' failed.

Also I'm not quite sure why I'm not seeing error messages (that
I'd expect) in the client log
Sam Lang | 8 Oct 2007 02:06
Favicon

Re: 2.6.3, file hole issues?


Hi Anthony,

It sounds like you've discovered a bug in the threaded version of the  
client.  We'll take a look at that and figure it out, but if  
possible, just use the non-threaded version for now.

Thanks,
-sam

On Oct 6, 2007, at 4:45 PM, Anthony Tong wrote:

> On Sat, Oct 06, 2007 at 11:21:58AM -0500, Anthony Tong wrote:
>> To clarify what you said & what I see in the logs: before each  
>> problematic
>> IO, it looks like the client daemons are remounting the fs? Maybe
>> the threaded client core was getting restarted for some reason?
>
> After some fiddling, got cores turned on for the threaded client
> tend and they all tend look like this.
>
> #0  0x0075c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> (gdb) where
> #0  0x0075c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0x007a17a5 in raise () from /lib/tls/libc.so.6
> #2  0x007a3209 in abort () from /lib/tls/libc.so.6
> #3  0x0079ad91 in __assert_fail () from /lib/tls/libc.so.6
> #4  0x08076773 in io_datafile_complete_operations ()
> #5  0x0806dcef in PINT_client_state_machine_testsome ()
> #6  0x08056537 in process_vfs_requests ()
(Continue reading)

Anthony Tong | 8 Oct 2007 17:18
Picon

Re: 2.6.3, file hole issues?

Hi Sam,

One of my original concerns is that there was an error/unknown condition
encountered but 1) the calling application never knew about it, and 2)
no indication of any trouble in any logs--which resulted in silent data
corruption.

-at

On Sun, Oct 07, 2007 at 07:06:02PM -0500, Sam Lang wrote:
>
> Hi Anthony,
>
> It sounds like you've discovered a bug in the threaded version of the 
> client.  We'll take a look at that and figure it out, but if possible, just 
> use the non-threaded version for now.
>
> Thanks,
> -sam
>
> On Oct 6, 2007, at 4:45 PM, Anthony Tong wrote:
>
>> On Sat, Oct 06, 2007 at 11:21:58AM -0500, Anthony Tong wrote:
>>> To clarify what you said & what I see in the logs: before each 
>>> problematic
>>> IO, it looks like the client daemons are remounting the fs? Maybe
>>> the threaded client core was getting restarted for some reason?
>>
>> After some fiddling, got cores turned on for the threaded client
>> tend and they all tend look like this.
(Continue reading)

Murali Vilayannur | 7 Oct 2007 10:48
Picon

Re: 2.6.3, file hole issues?

Hi Anthony,
>
> #0  0x0075c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> (gdb) where
> #0  0x0075c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0x007a17a5 in raise () from /lib/tls/libc.so.6
> #2  0x007a3209 in abort () from /lib/tls/libc.so.6
> #3  0x0079ad91 in __assert_fail () from /lib/tls/libc.so.6
> #4  0x08076773 in io_datafile_complete_operations ()
> #5  0x0806dcef in PINT_client_state_machine_testsome ()
> #6  0x08056537 in process_vfs_requests ()
> #7  0x08058a2a in main ()
>
> This is the actual assert
>
> pvfs2-client-core: src/client/sysint/sys-io.sm:1860: io_post_write_ack_recv: Assertion `ret ==
0' failed.

Interesting.. It looks like job_bmi_recv() was returning 1 indicating
immediate completion for the final write ack's receive which seems
impossible..
Sam, Phil: That shouldn't happen, right?
I don't see this on my setup even with -threaded client-cores...

Sam: WHat is the reason for building the -threaded version of pvfs2-client-core?
I forget now..BMI on client drives progress using threads?
The main thread/event loop is certainly  uni threaded as far as I can tell.
thanks,
Murali
(Continue reading)

Sam Lang | 8 Oct 2007 02:10
Favicon

Re: 2.6.3, file hole issues?


On Oct 7, 2007, at 3:48 AM, Murali Vilayannur wrote:

> Hi Anthony,
>>
>> #0  0x0075c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>> (gdb) where
>> #0  0x0075c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>> #1  0x007a17a5 in raise () from /lib/tls/libc.so.6
>> #2  0x007a3209 in abort () from /lib/tls/libc.so.6
>> #3  0x0079ad91 in __assert_fail () from /lib/tls/libc.so.6
>> #4  0x08076773 in io_datafile_complete_operations ()
>> #5  0x0806dcef in PINT_client_state_machine_testsome ()
>> #6  0x08056537 in process_vfs_requests ()
>> #7  0x08058a2a in main ()
>>
>> This is the actual assert
>>
>> pvfs2-client-core: src/client/sysint/sys-io.sm:1860:  
>> io_post_write_ack_recv: Assertion `ret == 0' failed.
>
> Interesting.. It looks like job_bmi_recv() was returning 1 indicating
> immediate completion for the final write ack's receive which seems
> impossible..
> Sam, Phil: That shouldn't happen, right?

Right.  It should be returning an error if it fails -- never 1.

Anthony, which BMI method (tcp, ib, gm) are you using with your setup?

(Continue reading)

Anthony Tong | 8 Oct 2007 16:57
Picon

Re: 2.6.3, file hole issues?

On Sun, Oct 07, 2007 at 07:10:20PM -0500, Sam Lang wrote:
>>> pvfs2-client-core: src/client/sysint/sys-io.sm:1860: 
>>> io_post_write_ack_recv: Assertion `ret == 0' failed.
>>
>> Interesting.. It looks like job_bmi_recv() was returning 1 indicating
>> immediate completion for the final write ack's receive which seems
>> impossible..
>> Sam, Phil: That shouldn't happen, right?
>
> Right.  It should be returning an error if it fails -- never 1.
>
> Anthony, which BMI method (tcp, ib, gm) are you using with your setup?

I'm using tcp BMI.

It is returning an immediate successful receive of 0x18 bytes.
I dont know enough about pvfs internals to be sure this is the correct
fix, but for this state it seems to work if I set write_ack_has_been_posted
and return success.. I get successful writes with correct data, no
io hangs, no crashes..

>> I don't see this on my setup even with -threaded client-cores...
>>
>> Sam: WHat is the reason for building the -threaded version of 
>> pvfs2-client-core?
>
> Multiple threads allows for a device thread to handle unexpected operations 
> from the kernel module.  We were seeing minor performance improvements with 
> some tests we did, especially for smaller IOs.
>
(Continue reading)


Gmane