Peng Tao | 25 Jul 2012 09:31
Picon

pnfs LD partial sector write

Hi Boaz,

Sorry about the long delay. I had some internal interrupt. Now I'm
looking at the partial LD write problem again. Instead of trying to
bail out unaligned writes blindly, this time I want to fix the write
code to handle partial write as you suggested before. However, it
seems to be more problematic than I used to think.

The dirty range of a page passed to LD->write_pagelist may be
unaligned to sector size, in which case block layer cannot handle it
correctly. Even worse, I cannot do a read-modify-write cycle within
the same page because bio would read in the entire sector and thus
ruin user data within the same sector. Currently I'm thinking of
creating shadow pages for partial sector write and use them to read in
the sector and copy necessary data into user pages. But it is way too
tricky and I don't feel like it at all. So I want to ask how you solve
the partial sector write problem in object layout driver.

I looked at the ore code and found that you are using bio to deal with
partial page read/write as well. But in places like _add_to_r4w(), I
don't see how partial sectors are handled. Maybe I was misreading the
code. Would you please shed some light? More specifically, how does
object layout driver handle partial sector writers like in bellow
simple testcase? Thanks in advance.

--

-- 
Best,
Tao

flock-partial-write.c:
(Continue reading)

Boaz Harrosh | 25 Jul 2012 12:28
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/25/2012 10:31 AM, Peng Tao wrote:

> Hi Boaz,
> 
> Sorry about the long delay. I had some internal interrupt. Now I'm
> looking at the partial LD write problem again. Instead of trying to
> bail out unaligned writes blindly, this time I want to fix the write
> code to handle partial write as you suggested before. However, it
> seems to be more problematic than I used to think.
> 
> The dirty range of a page passed to LD->write_pagelist may be
> unaligned to sector size, in which case block layer cannot handle it
> correctly. Even worse, I cannot do a read-modify-write cycle within
> the same page because bio would read in the entire sector and thus
> ruin user data within the same sector. Currently I'm thinking of
> creating shadow pages for partial sector write and use them to read in
> the sector and copy necessary data into user pages. But it is way too
> tricky and I don't feel like it at all. So I want to ask how you solve
> the partial sector write problem in object layout driver.
> 
> I looked at the ore code and found that you are using bio to deal with
> partial page read/write as well. But in places like _add_to_r4w(), I
> don't see how partial sectors are handled. Maybe I was misreading the
> code. Would you please shed some light? More specifically, how does
> object layout driver handle partial sector writers like in bellow
> simple testcase? Thanks in advance.
> 

The objlayout does not have this problem. OSD-SCSI is a byte aligned
protocol, unlike DISK-SCSI.
(Continue reading)

Boaz Harrosh | 25 Jul 2012 12:45
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/25/2012 01:28 PM, Boaz Harrosh wrote:

> On 07/25/2012 10:31 AM, Peng Tao wrote:
> 
>> Hi Boaz,
>>
>> Sorry about the long delay. I had some internal interrupt. Now I'm
>> looking at the partial LD write problem again. Instead of trying to
>> bail out unaligned writes blindly, this time I want to fix the write
>> code to handle partial write as you suggested before. However, it
>> seems to be more problematic than I used to think.
>>
>> The dirty range of a page passed to LD->write_pagelist may be
>> unaligned to sector size, in which case block layer cannot handle it
>> correctly. Even worse, I cannot do a read-modify-write cycle within
>> the same page because bio would read in the entire sector and thus
>> ruin user data within the same sector. Currently I'm thinking of
>> creating shadow pages for partial sector write and use them to read in
>> the sector and copy necessary data into user pages. But it is way too
>> tricky and I don't feel like it at all. So I want to ask how you solve
>> the partial sector write problem in object layout driver.
>>
>> I looked at the ore code and found that you are using bio to deal with
>> partial page read/write as well. But in places like _add_to_r4w(), I
>> don't see how partial sectors are handled. Maybe I was misreading the
>> code. Would you please shed some light? More specifically, how does
>> object layout driver handle partial sector writers like in bellow
>> simple testcase? Thanks in advance.
>>
> 
(Continue reading)

Peng Tao | 25 Jul 2012 16:43
Picon

Re: pnfs LD partial sector write

On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <bharrosh@...> wrote:
> On 07/25/2012 10:31 AM, Peng Tao wrote:
>
>> Hi Boaz,
>>
>> Sorry about the long delay. I had some internal interrupt. Now I'm
>> looking at the partial LD write problem again. Instead of trying to
>> bail out unaligned writes blindly, this time I want to fix the write
>> code to handle partial write as you suggested before. However, it
>> seems to be more problematic than I used to think.
>>
>> The dirty range of a page passed to LD->write_pagelist may be
>> unaligned to sector size, in which case block layer cannot handle it
>> correctly. Even worse, I cannot do a read-modify-write cycle within
>> the same page because bio would read in the entire sector and thus
>> ruin user data within the same sector. Currently I'm thinking of
>> creating shadow pages for partial sector write and use them to read in
>> the sector and copy necessary data into user pages. But it is way too
>> tricky and I don't feel like it at all. So I want to ask how you solve
>> the partial sector write problem in object layout driver.
>>
>> I looked at the ore code and found that you are using bio to deal with
>> partial page read/write as well. But in places like _add_to_r4w(), I
>> don't see how partial sectors are handled. Maybe I was misreading the
>> code. Would you please shed some light? More specifically, how does
>> object layout driver handle partial sector writers like in bellow
>> simple testcase? Thanks in advance.
>>
>
>
(Continue reading)

Boaz Harrosh | 25 Jul 2012 22:29
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/25/2012 05:43 PM, Peng Tao wrote:

> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <bharrosh@...> wrote:
>> On 07/25/2012 10:31 AM, Peng Tao wrote:
>>
>>> Hi Boaz,
>>>
>>> Sorry about the long delay. I had some internal interrupt. Now I'm
>>> looking at the partial LD write problem again. Instead of trying to
>>> bail out unaligned writes blindly, this time I want to fix the write
>>> code to handle partial write as you suggested before. However, it
>>> seems to be more problematic than I used to think.
>>>
>>> The dirty range of a page passed to LD->write_pagelist may be
>>> unaligned to sector size, in which case block layer cannot handle it
>>> correctly. Even worse, I cannot do a read-modify-write cycle within
>>> the same page because bio would read in the entire sector and thus
>>> ruin user data within the same sector. Currently I'm thinking of
>>> creating shadow pages for partial sector write and use them to read in
>>> the sector and copy necessary data into user pages. But it is way too
>>> tricky and I don't feel like it at all. So I want to ask how you solve
>>> the partial sector write problem in object layout driver.
>>>
>>> I looked at the ore code and found that you are using bio to deal with
>>> partial page read/write as well. But in places like _add_to_r4w(), I
>>> don't see how partial sectors are handled. Maybe I was misreading the
>>> code. Would you please shed some light? More specifically, how does
>>> object layout driver handle partial sector writers like in bellow
>>> simple testcase? Thanks in advance.
>>>
(Continue reading)

Peng Tao | 26 Jul 2012 04:43
Picon

Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 4:29 AM, Boaz Harrosh <bharrosh@...> wrote:
> On 07/25/2012 05:43 PM, Peng Tao wrote:
>
>> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <bharrosh@...> wrote:
>>> On 07/25/2012 10:31 AM, Peng Tao wrote:
>>>
>>>> Hi Boaz,
>>>>
>>>> Sorry about the long delay. I had some internal interrupt. Now I'm
>>>> looking at the partial LD write problem again. Instead of trying to
>>>> bail out unaligned writes blindly, this time I want to fix the write
>>>> code to handle partial write as you suggested before. However, it
>>>> seems to be more problematic than I used to think.
>>>>
>>>> The dirty range of a page passed to LD->write_pagelist may be
>>>> unaligned to sector size, in which case block layer cannot handle it
>>>> correctly. Even worse, I cannot do a read-modify-write cycle within
>>>> the same page because bio would read in the entire sector and thus
>>>> ruin user data within the same sector. Currently I'm thinking of
>>>> creating shadow pages for partial sector write and use them to read in
>>>> the sector and copy necessary data into user pages. But it is way too
>>>> tricky and I don't feel like it at all. So I want to ask how you solve
>>>> the partial sector write problem in object layout driver.
>>>>
>>>> I looked at the ore code and found that you are using bio to deal with
>>>> partial page read/write as well. But in places like _add_to_r4w(), I
>>>> don't see how partial sectors are handled. Maybe I was misreading the
>>>> code. Would you please shed some light? More specifically, how does
>>>> object layout driver handle partial sector writers like in bellow
>>>> simple testcase? Thanks in advance.
(Continue reading)

Boaz Harrosh | 26 Jul 2012 09:29
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/26/2012 05:43 AM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 4:29 AM, Boaz Harrosh <bharrosh@...> wrote:
>> On 07/25/2012 05:43 PM, Peng Tao wrote:
>>
>>> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <bharrosh@...> wrote:
>>>> On 07/25/2012 10:31 AM, Peng Tao wrote:
>>>>
>>>>> Hi Boaz,
>>>>>
>>>>> Sorry about the long delay. I had some internal interrupt. Now I'm
>>>>> looking at the partial LD write problem again. Instead of trying to
>>>>> bail out unaligned writes blindly, this time I want to fix the write
>>>>> code to handle partial write as you suggested before. However, it
>>>>> seems to be more problematic than I used to think.
>>>>>
>>>>> The dirty range of a page passed to LD->write_pagelist may be
>>>>> unaligned to sector size, in which case block layer cannot handle it
>>>>> correctly. Even worse, I cannot do a read-modify-write cycle within
>>>>> the same page because bio would read in the entire sector and thus
>>>>> ruin user data within the same sector. Currently I'm thinking of
>>>>> creating shadow pages for partial sector write and use them to read in
>>>>> the sector and copy necessary data into user pages. But it is way too
>>>>> tricky and I don't feel like it at all. So I want to ask how you solve
>>>>> the partial sector write problem in object layout driver.
>>>>>
>>>>> I looked at the ore code and found that you are using bio to deal with
>>>>> partial page read/write as well. But in places like _add_to_r4w(), I
>>>>> don't see how partial sectors are handled. Maybe I was misreading the
>>>>> code. Would you please shed some light? More specifically, how does
(Continue reading)

Peng Tao | 26 Jul 2012 10:25
Picon

Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 3:29 PM, Boaz Harrosh <bharrosh@...> wrote:
> On 07/26/2012 05:43 AM, Peng Tao wrote:
>
>> On Thu, Jul 26, 2012 at 4:29 AM, Boaz Harrosh <bharrosh@...> wrote:
>>> On 07/25/2012 05:43 PM, Peng Tao wrote:
>>>
>>>> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <bharrosh@...> wrote:
>>>>> On 07/25/2012 10:31 AM, Peng Tao wrote:
>>>>>
>>>>>> Hi Boaz,
>>>>>>
>>>>>> Sorry about the long delay. I had some internal interrupt. Now I'm
>>>>>> looking at the partial LD write problem again. Instead of trying to
>>>>>> bail out unaligned writes blindly, this time I want to fix the write
>>>>>> code to handle partial write as you suggested before. However, it
>>>>>> seems to be more problematic than I used to think.
>>>>>>
>>>>>> The dirty range of a page passed to LD->write_pagelist may be
>>>>>> unaligned to sector size, in which case block layer cannot handle it
>>>>>> correctly. Even worse, I cannot do a read-modify-write cycle within
>>>>>> the same page because bio would read in the entire sector and thus
>>>>>> ruin user data within the same sector. Currently I'm thinking of
>>>>>> creating shadow pages for partial sector write and use them to read in
>>>>>> the sector and copy necessary data into user pages. But it is way too
>>>>>> tricky and I don't feel like it at all. So I want to ask how you solve
>>>>>> the partial sector write problem in object layout driver.
>>>>>>
>>>>>> I looked at the ore code and found that you are using bio to deal with
>>>>>> partial page read/write as well. But in places like _add_to_r4w(), I
>>>>>> don't see how partial sectors are handled. Maybe I was misreading the
(Continue reading)

Boaz Harrosh | 26 Jul 2012 14:16
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/26/2012 11:25 AM, Peng Tao wrote:

> For these two sectors, I need to allocate two pages... Just look at
> struct bio_vec.
> 

NO! I know all about bio_vecs

You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
did you get the "two pages" from?

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peng Tao | 26 Jul 2012 15:57
Picon

Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 8:16 PM, Boaz Harrosh <bharrosh@...> wrote:
> On 07/26/2012 11:25 AM, Peng Tao wrote:
>
>> For these two sectors, I need to allocate two pages... Just look at
>> struct bio_vec.
>>
>
>
> NO! I know all about bio_vecs
>
> You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
> did you get the "two pages" from?
>
What do you put int bio_vec->bv_page? Even if you just use 512 bytes
of a page, it is still allocated page.

--

-- 
Thanks,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Boaz Harrosh | 26 Jul 2012 16:30
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/26/2012 04:57 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 8:16 PM, Boaz Harrosh <bharrosh@...> wrote:
>> On 07/26/2012 11:25 AM, Peng Tao wrote:
>>
>>> For these two sectors, I need to allocate two pages... Just look at
>>> struct bio_vec.
>>>
>>
>>
>> NO! I know all about bio_vecs
>>
>> You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
>> did you get the "two pages" from?
>>
> What do you put int bio_vec->bv_page? Even if you just use 512 bytes
> of a page, it is still allocated page.
> 

No!!

You just use bio_map_kern or in one go blk_rq_map_kern() with any: kmalloc,
stack, or kernel pointer. And that's that. It will take what it will take.

Two such BIOs can use the same page different regions, or a small region
sharing a page with other kmalloc allocations.

I don't see how you got your idea from?

And for the bio itself you use bio_kmalloc(GFP_KERNEL/GFP_NOIO, numentries);
(Continue reading)

Peng Tao | 26 Jul 2012 17:30
Picon

Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 10:30 PM, Boaz Harrosh <bharrosh@...> wrote:
> On 07/26/2012 04:57 PM, Peng Tao wrote:
>
>> On Thu, Jul 26, 2012 at 8:16 PM, Boaz Harrosh <bharrosh@...> wrote:
>>> On 07/26/2012 11:25 AM, Peng Tao wrote:
>>>
>>>> For these two sectors, I need to allocate two pages... Just look at
>>>> struct bio_vec.
>>>>
>>>
>>>
>>> NO! I know all about bio_vecs
>>>
>>> You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
>>> did you get the "two pages" from?
>>>
>> What do you put int bio_vec->bv_page? Even if you just use 512 bytes
>> of a page, it is still allocated page.
>>
>
>
> No!!
>
> You just use bio_map_kern or in one go blk_rq_map_kern() with any: kmalloc,
> stack, or kernel pointer. And that's that. It will take what it will take.
>
First I should admit I don't know bio_map_kern() alike. Thanks for
teaching me about them (see, here is your credit :)

Looking at them, I don't think it is proper to use them in block
(Continue reading)

Boaz Harrosh | 26 Jul 2012 17:44
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/26/2012 06:30 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 10:30 PM, Boaz Harrosh <bharrosh@...> wrote:
>> On 07/26/2012 04:57 PM, Peng Tao wrote:
>>
>>> On Thu, Jul 26, 2012 at 8:16 PM, Boaz Harrosh <bharrosh@...> wrote:
>>>> On 07/26/2012 11:25 AM, Peng Tao wrote:
>>>>
>>>>> For these two sectors, I need to allocate two pages... Just look at
>>>>> struct bio_vec.
>>>>>
>>>>
>>>>
>>>> NO! I know all about bio_vecs
>>>>
>>>> You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
>>>> did you get the "two pages" from?
>>>>
>>> What do you put int bio_vec->bv_page? Even if you just use 512 bytes
>>> of a page, it is still allocated page.
>>>
>>
>>
>> No!!
>>
>> You just use bio_map_kern or in one go blk_rq_map_kern() with any: kmalloc,
>> stack, or kernel pointer. And that's that. It will take what it will take.
>>
> First I should admit I don't know bio_map_kern() alike. Thanks for
> teaching me about them (see, here is your credit :)
(Continue reading)

Boaz Harrosh | 26 Jul 2012 09:47
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/26/2012 05:43 AM, Peng Tao wrote:

> Another thing is, this further complicates direct writes, where I
> cannot use pagecache to ensure proper locking for concurrent writers
> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
> be serialized internally. IOW, the same code cannot be reused by DIO
> writes. sigh...
> 

One last thing. Applications who use direct IO know to allocate
and issue sector aligned requests both at offset and length.
That's a Kernel requirement. It is not for NFS, but even so.

Just refuse sector unaligned DIO and revert to MDS.

With sector aligned IO you directly DIO to DIO pages,
problem solved.

If you need the COW of partial blocks, you still use
page-cache pages, which is fine because they do not
intersect any of the DIO.

Cheers
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Peng Tao | 26 Jul 2012 11:12
Picon

Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 3:47 PM, Boaz Harrosh <bharrosh@...> wrote:
> On 07/26/2012 05:43 AM, Peng Tao wrote:
>
>> Another thing is, this further complicates direct writes, where I
>> cannot use pagecache to ensure proper locking for concurrent writers
>> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
>> be serialized internally. IOW, the same code cannot be reused by DIO
>> writes. sigh...
>>
>
>
> One last thing. Applications who use direct IO know to allocate
> and issue sector aligned requests both at offset and length.
> That's a Kernel requirement. It is not for NFS, but even so.
>
> Just refuse sector unaligned DIO and revert to MDS.
>
> With sector aligned IO you directly DIO to DIO pages,
> problem solved.
>
> If you need the COW of partial blocks, you still use
> page-cache pages, which is fine because they do not
> intersect any of the DIO.
>
I certainly thought about it, but it doesn't work for AIO DIO case.
Assuming BLOCK size is 8K, process A write to 0~4095 bytes of file foo
with AIO DIO, at the same time process B write to 4096~8191 with AIO
DIO at the same time. If kernel ever tries to reply on page cache to
cope with invalid extent, it ends up with data corruption.

(Continue reading)

Boaz Harrosh | 26 Jul 2012 16:12
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/26/2012 12:12 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 3:47 PM, Boaz Harrosh <bharrosh@...> wrote:
>> On 07/26/2012 05:43 AM, Peng Tao wrote:
>>
>>> Another thing is, this further complicates direct writes, where I
>>> cannot use pagecache to ensure proper locking for concurrent writers
>>> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
>>> be serialized internally. IOW, the same code cannot be reused by DIO
>>> writes. sigh...
>>>
>>
>>
>> One last thing. Applications who use direct IO know to allocate
>> and issue sector aligned requests both at offset and length.
>> That's a Kernel requirement. It is not for NFS, but even so.
>>
>> Just refuse sector unaligned DIO and revert to MDS.
>>
>> With sector aligned IO you directly DIO to DIO pages,
>> problem solved.
>>
>> If you need the COW of partial blocks, you still use
>> page-cache pages, which is fine because they do not
>> intersect any of the DIO.
>>
> I certainly thought about it, but it doesn't work for AIO DIO case.
> Assuming BLOCK size is 8K, process A write to 0~4095 bytes of file foo
> with AIO DIO, at the same time process B write to 4096~8191 with AIO
> DIO at the same time. If kernel ever tries to reply on page cache to
(Continue reading)

Peng Tao | 26 Jul 2012 17:07
Picon

Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 10:12 PM, Boaz Harrosh <bharrosh@...> wrote:
> There is an easy locking solution for DIO which will not cost much
> for DIO and will cost nothing for buffered IO. You use the page-cache
> page lock.
>
> What you do is grab the zero-page of each block lock before/during writing to
> any block. So for your example above they will all be serialized by page-zero
> lock.
Yeah, I agree this can work. But I'd prefer not to mix DIO with buffer
IO, which is often error prone. If in any case I need to serialize
AIODIO, I'd prefer to do it in easier ways like locking invalid
extents etc, without messing with page cache.

>
> Of course you need like before to flush the page-cache pages before DIO and
> invalidate all pages (NotUpToDate). You keep at least one page in page-cache
> per block, but during DIO it will always be in Not-Up-To-Date empty state.
>
> Then if needed, like example above the first time COW you still do through
> page-cache
>
> *
> * That said I think your solution for only allowing BLOCK aligned DIO is good
> * Applications should learn. They should however find out what BLOCK size is.
> *
>
> You could keep the proper info at the DM device you create for each device_id
> See here: http://people.redhat.com/msnitzer/docs/io-limits.txt
> The "logical_block_size" should be the proper BLOCK size above.
>
(Continue reading)

Boaz Harrosh | 26 Jul 2012 18:00
Favicon
Gravatar

Re: pnfs LD partial sector write

On 07/26/2012 06:07 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 10:12 PM, Boaz Harrosh <bharrosh@...> wrote:
>> There is an easy locking solution for DIO which will not cost much
>> for DIO and will cost nothing for buffered IO. You use the page-cache
>> page lock.
>>
>> What you do is grab the zero-page of each block lock before/during writing to
>> any block. So for your example above they will all be serialized by page-zero
>> lock.
> Yeah, I agree this can work. But I'd prefer not to mix DIO with buffer
> IO, which is often error prone. If in any case I need to serialize
> AIODIO, I'd prefer to do it in easier ways like locking invalid
> extents etc, without messing with page cache.
> 

Ye, just keep it BLOCK aligned and that's it. Apps will learn fast enough.
Simple is always better.

Currently I support any alignment but I might do the same in objlayout in
the raid5/6 case. and DIO

<>

> Or maybe somehow through statfs(2), since the blocksize attribute is
> actually a file system's attribute instead of block device's.
> 

Good point!!
	statfs->f_bsize
(Continue reading)


Gmane