Linus Torvalds | 11 Mar 2009 18:33
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Ingo Molnar wrote:
> 
> FYI, in case you missed it. Large MM fix - and it's awfully late 
> in -rc7.

Yeah, I'm not taking this at this point. No way, no-how.

If there is no simpler and obvious fix, it needs to go through -stable, 
after having cooked in 2.6.30-rc for a while. Especially as this is a 
totally uninteresting usage case that I can't see as being at all relevant 
to any real world.

Anybody who mixes O_DIRECT and fork() (and threads) is already doing some 
seriously strange things. Nothing new there.

And quite frankly, the patch is so ugly as-is that I'm not likely to take 
it even into the 2.6.30 merge window unless it can be cleaned up. That 
whole fork_pre_cow function is too f*cking ugly to live. We just don't 
write code like this in the kernel.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

Andrea Arcangeli | 11 Mar 2009 19:22
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, Mar 11, 2009 at 10:33:00AM -0700, Linus Torvalds wrote:
> 
> On Wed, 11 Mar 2009, Ingo Molnar wrote:
> > 
> > FYI, in case you missed it. Large MM fix - and it's awfully late 
> > in -rc7.

I didn't specify it, but I didn't mean to submit it for immediate
inclusion. I posted it because it's ready and I wanted feedback from
Hugh/Nick/linux-mm so we can get this fixed when next merge window
open.

> Yeah, I'm not taking this at this point. No way, no-how.
> 
> If there is no simpler and obvious fix, it needs to go through -stable, 
> after having cooked in 2.6.30-rc for a while. Especially as this is a 
> totally uninteresting usage case that I can't see as being at all relevant 
> to any real world.

Actually AFIK there are mission critical real world applications that
used 512byte blocksize that were affected by this (I CC'ed relevant
people who knows). However this is rare thing so it almost never
triggers because the window is so small.

> Anybody who mixes O_DIRECT and fork() (and threads) is already doing some 
> seriously strange things. Nothing new there.

Most apps aren't affected of course. But almost all apps eventually
call fork (system/fork/exec/anything). Calling fork currently is
enough to generate memory corruption in the parent (i.e. lost O_DIRECT
(Continue reading)

Ingo Molnar | 11 Mar 2009 20:06
Picon
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


* Andrea Arcangeli <aarcange <at> redhat.com> wrote:

> On Wed, Mar 11, 2009 at 10:33:00AM -0700, Linus Torvalds wrote:
> > 
> > On Wed, 11 Mar 2009, Ingo Molnar wrote:
> > > 
> > > FYI, in case you missed it. Large MM fix - and it's awfully late 
> > > in -rc7.
> 
> I didn't specify it, but I didn't mean to submit it for 
> immediate inclusion. I posted it because it's ready and I 
> wanted feedback from Hugh/Nick/linux-mm so we can get this 
> fixed when next merge window open.

Good - i saw the '(fast-)gup fix' qualifier and fast-gup is a 
fresh feature. If the problem existed in earlier kernels too 
then i guess it isnt urgent.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Andrea Arcangeli | 11 Mar 2009 20:15
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, Mar 11, 2009 at 08:06:55PM +0100, Ingo Molnar wrote:
> Good - i saw the '(fast-)gup fix' qualifier and fast-gup is a 
> fresh feature. If the problem existed in earlier kernels too 
> then i guess it isnt urgent.

It always existed yes. The reason of the (fast-) qualifier is because
gup-fast made it harder to fix this in mainline (there is also a patch
floating around for 2.6.18 based kernels that is simpler thanks to
gup-fast not being there). The trouble of gup-fast is that doing the
check of page_count inside PT lock (or mmap_sem write mode like in
fork(), but ksm only takes mmap_sem in read mode and it relied on PT
lock only) wasn't enough anymore to be sure the page_count wouldn't
increase from under us just after we read it, because a gup-fast could
be running in another CPU without mmap_sem and without PT lock
taken. So fixing this on mainline has been a bit harder as I had to
prevent gup-fast to go ahead in the fast path, in a way that didn't
send IPIs to flush the smp-tlb before reading the page_count (so to
avoid sending IPIs for every anon page mapped writeable).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Ingo Molnar | 11 Mar 2009 18:41
Picon
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


* Linus Torvalds <torvalds <at> linux-foundation.org> wrote:

> On Wed, 11 Mar 2009, Ingo Molnar wrote:
> > 
> > FYI, in case you missed it. Large MM fix - and it's awfully 
> > late in -rc7.
> 
> Yeah, I'm not taking this at this point. No way, no-how.
> 
> If there is no simpler and obvious fix, it needs to go through 
> -stable, after having cooked in 2.6.30-rc for a while. 
> Especially as this is a totally uninteresting usage case that 
> I can't see as being at all relevant to any real world.
> 
> Anybody who mixes O_DIRECT and fork() (and threads) is already 
> doing some seriously strange things. Nothing new there.

Hm, is there any security impact? Andrea is talking about data 
corruption. I'm wondering whether that's just corruption 
relative to whatever twisted semantics O_DIRECT has in this case 
[which would be harmless], or some true pagecache corruption 
going across COW (or other) protection domains that could be 
exploited [which would not be harmless].

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
(Continue reading)

Andrea Arcangeli | 11 Mar 2009 19:53
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

Hello,

On Wed, Mar 11, 2009 at 06:41:03PM +0100, Ingo Molnar wrote:
> Hm, is there any security impact? Andrea is talking about data 
> corruption. I'm wondering whether that's just corruption 
> relative to whatever twisted semantics O_DIRECT has in this case 
> [which would be harmless], or some true pagecache corruption 

I don't think it's exploitable and I don't see this much as a security
issue. This can only corrupt user data inside anonymous pages (not
filesystem metadata or kernel pagecache). Side effects will be the
usual ones of random user memory corruption or as worse it can lead to
I/O corruption on disk, but only in user data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 11 Mar 2009 18:58
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Ingo Molnar wrote:
> 
> Hm, is there any security impact? Andrea is talking about data 
> corruption. I'm wondering whether that's just corruption 
> relative to whatever twisted semantics O_DIRECT has in this case 
> [which would be harmless], or some true pagecache corruption 
> going across COW (or other) protection domains that could be 
> exploited [which would not be harmless].

As far as I can tell, it's the same old problem that we've always had: if 
you fork(), it's unclear who is going to do the first write - parent or 
child (and "parent" in this case can include any number of threads that 
share the VM, of course).

And that means that anything that relies on pinned pages will never know 
whether it is pinning a page in the parent or the child - because whoever 
does the first COW of that page is the one that just gets a _copy_, not 
the original pinned page.

This isn't anything new. Anything that does anything by physical address 
will simply not do the right thing over a fork. The physical page may have 
started out as the parents physical page, but it may end up in the end 
being the _childs_ physical page if the parent wrote to it and triggered 
the cow.

The rule has always been: don't mix fork() with page pinning. It doesn't 
work. It never worked. It likely never will.

			Linus
(Continue reading)

Andrea Arcangeli | 11 Mar 2009 19:37
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote:
> As far as I can tell, it's the same old problem that we've always had: if 
> you fork(), it's unclear who is going to do the first write - parent or 
> child (and "parent" in this case can include any number of threads that 
> share the VM, of course).

The child doesn't touch any page. Calling fork just generates O_DIRECT
corruption in the parent regardless of what the child does.

> This isn't anything new. Anything that does anything by physical address 

This is nothing new also in the sense that all linux kernels out there
had this bug thus far.

> will simply not do the right thing over a fork. The physical page may have 
> started out as the parents physical page, but it may end up in the end 
> being the _childs_ physical page if the parent wrote to it and triggered 
> the cow.

Actually the child will get corrupted too. Not just the parent by
losing the O_DIRECT reads. The child always assumes its anon page
contents will not get lost or overwritten after changing them in the
child.

> The rule has always been: don't mix fork() with page pinning. It doesn't 
> work. It never worked. It likely never will.

I never heard this rule here, but surely I agree there will not be
many apps out there capable of triggering this. Mostly because most
apps uses O_DIRECT on top of shm (surely not because they're not
(Continue reading)

Linus Torvalds | 11 Mar 2009 19:46
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Andrea Arcangeli wrote:

> On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote:
> > As far as I can tell, it's the same old problem that we've always had: if 
> > you fork(), it's unclear who is going to do the first write - parent or 
> > child (and "parent" in this case can include any number of threads that 
> > share the VM, of course).
> 
> The child doesn't touch any page. Calling fork just generates O_DIRECT
> corruption in the parent regardless of what the child does.

You aren't listening.

It depends on who does the write. If the _parent_ does the write (with 
another thread or not), then the _parent_ gets the COW.

That's all I said.

> > The rule has always been: don't mix fork() with page pinning. It doesn't 
> > work. It never worked. It likely never will.
> 
> I never heard this rule here

It's never been written down, but it's obvious to anybody who looks at how 
COW works for even five seconds. The fact is, the person doing the COW 
after a fork() is the person who no longer has the same physical page 
(because he got a new page).

So _anything- that depends on physical addresses simply _cannot_ work 
(Continue reading)

Nick Piggin | 12 Mar 2009 06:36
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Thursday 12 March 2009 05:46:17 Linus Torvalds wrote:
> On Wed, 11 Mar 2009, Andrea Arcangeli wrote:

> > > The rule has always been: don't mix fork() with page pinning. It
> > > doesn't work. It never worked. It likely never will.
> >
> > I never heard this rule here
>
> It's never been written down, but it's obvious to anybody who looks at how
> COW works for even five seconds. The fact is, the person doing the COW
> after a fork() is the person who no longer has the same physical page
> (because he got a new page).
>
> So _anything- that depends on physical addresses simply _cannot_ work
> concurrently with a fork. That has always been true.
>
> If the idiots who use O_DIRECT don't understand that, then hey, it's their
> problem. I have long been of the opinion that we should not support
> O_DIRECT at all, and that it's a totally broken premise to start with.
>
> This is just one of millions of reasons.

Well it is a quite well known issue at this stage I think. We've had
MADV_DONTFORK since 2.6.16 which is basically to solve this issue I
think with infiniband library. I guess if it would be really helpful
we *could* add MADV_DONTCOW.

Assuming we want to try fixing it transparently... what about another
approach, mark a vma as VM_DONTCOW and uncow all existing pages in it
if it ever has get_user_pages run on it. Big hammer approach.
(Continue reading)

Nick Piggin | 12 Mar 2009 17:23
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Thursday 12 March 2009 16:36:18 Nick Piggin wrote:

> Assuming we want to try fixing it transparently... what about another
> approach, mark a vma as VM_DONTCOW and uncow all existing pages in it
> if it ever has get_user_pages run on it. Big hammer approach.
>
> fast gup would be a little bit harder because looking up the vma
> defeats the purpose. However if we use another page bit to say the
> page belongs to a VM_DONTCOW vma, then we only need to check that
> once and fall back to slow gup if it is clear. So there would be no
> extra atomics in the repeat case. Yes it would be slower, but apps
> that really care should know what they are doing and set
> MADV_DONTFORK or MADV_DONTCOW on the vma by hand before doing the
> zero copy IO.
>
> Would this work? Anyone see any holes? (I imagine someone might argue
> against big hammer, but I would prefer it if it is lighter impact on
> the VM and still allows good applications to avoid the hammer)

OK, this is as far as I got tonight.

This passes Andrea's dma_thread test case. I haven't started hugepages,
and it isn't quite right to drop the mmap_sem and retake it for write
in get_user_pages (firstly, caller might hold mmap_sem for write,
secondly, it may not be able to tolerate mmap_sem being dropped).

Annoying that it has to take mmap_sem for write to add this bit to
vm_flags. Possibly we could use a different way to signal it is a
"dontcow" vma... something in anon_vma maybe?

(Continue reading)

Andrea Arcangeli | 12 Mar 2009 18:00
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Fri, Mar 13, 2009 at 03:23:40AM +1100, Nick Piggin wrote:
> OK, this is as far as I got tonight.
> 
> This passes Andrea's dma_thread test case. I haven't started hugepages,
> and it isn't quite right to drop the mmap_sem and retake it for write
> in get_user_pages (firstly, caller might hold mmap_sem for write,
> secondly, it may not be able to tolerate mmap_sem being dropped).

What's the point? I mean this will simply work worse than my patch
because it'll have to don't-cow the whole range regardless if it's
pinned or not. Which will slowdown fork in the O_DIRECT case even
more, for no good reason. I thought the complaint here was only a
beauty issue of not wanting to add a function called fork_pre_cow or
your equivalent decow_one_pte in the fork path, not any practical
issue with my patch which already passed all sort of regression
testing and performance valuations. Plus you still have a per-page
bitflag, and I think you have implementation issues in the patch (the
parent pte can't be left writeable if you are in a don't-cow vma, or
the copy will not be atomic, and glibc will have no chance to fix its
bugs). You're not removing the fork_pre_cow logic from fork, so I can
only see it as a regression to make the logic less granular in the
vma.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

Nick Piggin | 12 Mar 2009 18:20
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Friday 13 March 2009 04:00:11 Andrea Arcangeli wrote:
> On Fri, Mar 13, 2009 at 03:23:40AM +1100, Nick Piggin wrote:
> > OK, this is as far as I got tonight.
> >
> > This passes Andrea's dma_thread test case. I haven't started hugepages,
> > and it isn't quite right to drop the mmap_sem and retake it for write
> > in get_user_pages (firstly, caller might hold mmap_sem for write,
> > secondly, it may not be able to tolerate mmap_sem being dropped).
>
> What's the point?

Well the main point is to avoid atomics and barriers and stuff like
that especially in the fast gup path. It also seems very much smaller
(the vast majority of the change is the addition of decow function).

> I mean this will simply work worse than my patch
> because it'll have to don't-cow the whole range regardless if it's
> pinned or not. Which will slowdown fork in the O_DIRECT case even
> more, for no good reason. 

Hmm, maybe. It probably can possibly work entirely without the vm_flag
and just use the page flag, however. Yes I think it could, and that
might just avoid the whole problem of modifying vm_flags in gup. I'll
have to consider it more tomorrow.

But this case is just if we want to transparently support this without
too much intrusive. Apps that know and care very much could use
MADV_DONTFORK to avoid the copy completely.

> I thought the complaint here was only a
(Continue reading)

Andrea Arcangeli | 12 Mar 2009 19:06
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Fri, Mar 13, 2009 at 04:20:27AM +1100, Nick Piggin wrote:
> Well the main point is to avoid atomics and barriers and stuff like
> that especially in the fast gup path. It also seems very much smaller
> (the vast majority of the change is the addition of decow function).

Well if you remove the hugetlb part and you remove the pass of src/dst
vma that is needed anyway to fix PAT bugs, my patch will get quite
smaller too.

Agree about the gup-fast path, but frankly I miss how you avoid having
to change gup-fast... I wanted to asked about that...

> Hmm, maybe. It probably can possibly work entirely without the vm_flag
> and just use the page flag, however. Yes I think it could, and that

Right I only use the page flag, and you seem to have a page flag
PG_dontcow too after all.

> might just avoid the whole problem of modifying vm_flags in gup. I'll
> have to consider it more tomorrow.

Ok.

> But this case is just if we want to transparently support this without
> too much intrusive. Apps that know and care very much could use
> MADV_DONTFORK to avoid the copy completely.

Well those apps aren't the problem.

> My complaint is not decow / pre cow (I think I suggested it as the
(Continue reading)

Nick Piggin | 13 Mar 2009 17:09
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Friday 13 March 2009 05:06:48 Andrea Arcangeli wrote:
> On Fri, Mar 13, 2009 at 04:20:27AM +1100, Nick Piggin wrote:
> > Well the main point is to avoid atomics and barriers and stuff like
> > that especially in the fast gup path. It also seems very much smaller
> > (the vast majority of the change is the addition of decow function).
>
> Well if you remove the hugetlb part and you remove the pass of src/dst
> vma that is needed anyway to fix PAT bugs, my patch will get quite
> smaller too.

Possibly true. OK, it wasn't a very good argument to compare my incomplete,
RFC patch based on size alone :)

> Agree about the gup-fast path, but frankly I miss how you avoid having
> to change gup-fast... I wanted to asked about that...

It is more straightforward than your version because it does not try to
make the page re-cow-able again after the GUP is finished. The main
conceptual difference between our fixes I think (ignoring my silly
vma-wide decow), is this issue.

Of course I could have a race in fast-gup, but I don't think I can see
one. I'm working on removing the vma stuff and just making it per-page,
which might make it easier to review.

> > Oh, we need to do that? OK, then just take out that statement, and
> > change VM_BUG_ON(PageDontCOW()) in do_wp_page to
> > VM_BUG_ON(PageDontCOW() && !reuse);
>
> Not sure how do_wp_page is relevant, the problem I pointed out is in
(Continue reading)

Nick Piggin | 14 Mar 2009 05:46
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Saturday 14 March 2009 03:09:39 Nick Piggin wrote:
> On Friday 13 March 2009 05:06:48 Andrea Arcangeli wrote:

> > The thing is quite simple, if an app has a 1G of vma loaded, you'll
> > allocate 1G of ram for no good reason. It can even OOM, it's not just
> > a performance issue. While doing it per-page like I do, won't be
> > noticeable, as the in-flight I/O will be minor.
>
> Yes I agree now it is a silly way to do it.

Here is an updated patch that just does it on a per-page basis.
Actually it is still a bit sloppy because I just reused some code
from my last patch for the decow logic... possibly I can just use
the same precow code that you do for small and huge pages (although
Linus didn't like it so much... it is very hard to do nicely right
down there in the call chain :()

Anyway, ignoring the decow implementation (that's not really the
interesting part of the patch), I think this is looking pretty good
now.
---
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/include/linux/mm.h	2009-03-14 15:12:13.000000000 +1100
 <at>  <at>  -789,7 +789,7  <at>  <at>  int walk_page_range(unsigned long addr, 
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
(Continue reading)

Nick Piggin | 14 Mar 2009 06:06
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Saturday 14 March 2009 15:46:30 Nick Piggin wrote:

> Index: linux-2.6/arch/x86/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/gup.c	2009-03-14 02:48:06.000000000 +1100
> +++ linux-2.6/arch/x86/mm/gup.c	2009-03-14 02:48:12.000000000 +1100
>  <at>  <at>  -83,11 +83,14  <at>  <at>  static noinline int gup_pte_range(pmd_t
>  		struct page *page;
>
>  		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
> +failed:
>  			pte_unmap(ptep);
>  			return 0;
>  		}
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
> +		if (unlikely(!PageDontCOW(page)))
> +			goto failed;
>  		get_page(page);
>  		pages[*nr] = page;
>  		(*nr)++;

Ah, that's stupid, the test should be confined just to PageAnon 
&& !PageDontCOW pages, of course.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>
(Continue reading)

Andrea Arcangeli | 13 Mar 2009 20:34
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Sat, Mar 14, 2009 at 03:09:39AM +1100, Nick Piggin wrote:
> Of course I could have a race in fast-gup, but I don't think I can see
> one. I'm working on removing the vma stuff and just making it per-page,
> which might make it easier to review.

If you didn't touch gup-fast and you don't send ipis in fork, you most
certainly have one, it's the one Linus pointed out and that I've fixed
(with Izik, then I sorted out the ordering details and how to make it
safe on frok side).

> Well, it would save having to touch the parent's pagetables after
> doing the atomic copy-on-fork in the child. Just have the parent do
> a do_wp_page, which will notice it is the only user of the page and
> reuse it rather than COW it (now that Hugh has fixed the races in
> the reuse check that should be fine).

If we're into the trouble path, it means parent already owns the
page. I just leave it owned to the parent, pte remains the same before
and after fork. No point in changing the pte value if we're in the
troublesome path as far as I can tell. I only verify that the parent pte
didn't go away from under fork when I temporarily release the parent
PT lock to allocate the cow page in the slow path (see the -EAGAIN
path, I also verified it triggers with swapping and system survives fine ;).

> Now I also see that your patch still hasn't covered the other side of
> the race, wheras my scheme should do. Hmm, I think that if we want to

Sorry, but can you elaborate again what the other side of the race is?

If child gets a whole new page, and parent keeps its own page with pte
(Continue reading)

Nick Piggin | 14 Mar 2009 05:59
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Saturday 14 March 2009 06:34:16 Andrea Arcangeli wrote:
> On Sat, Mar 14, 2009 at 03:09:39AM +1100, Nick Piggin wrote:
> > Of course I could have a race in fast-gup, but I don't think I can see
> > one. I'm working on removing the vma stuff and just making it per-page,
> > which might make it easier to review.
>
> If you didn't touch gup-fast and you don't send ipis in fork, you most
> certainly have one, it's the one Linus pointed out and that I've fixed
> (with Izik, then I sorted out the ordering details and how to make it
> safe on frok side).

It does touch gup-fast, but it just adds one branch and no barrier in the
case the page is de-cowed (and would be able to work with hugepages with
the get_page_multiple still I think although I haven't done hugepage
implementation yet).

> > Well, it would save having to touch the parent's pagetables after
> > doing the atomic copy-on-fork in the child. Just have the parent do
> > a do_wp_page, which will notice it is the only user of the page and
> > reuse it rather than COW it (now that Hugh has fixed the races in
> > the reuse check that should be fine).
>
> If we're into the trouble path, it means parent already owns the
> page. I just leave it owned to the parent, pte remains the same before
> and after fork. No point in changing the pte value if we're in the
> troublesome path as far as I can tell. I only verify that the parent pte
> didn't go away from under fork when I temporarily release the parent
> PT lock to allocate the cow page in the slow path (see the -EAGAIN
> path, I also verified it triggers with swapping and system survives fine
> ;).
(Continue reading)

Andrea Arcangeli | 16 Mar 2009 14:56
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Sat, Mar 14, 2009 at 03:59:11PM +1100, Nick Piggin wrote:
> It does touch gup-fast, but it just adds one branch and no barrier in the

My question is what trick to you use to stop gup-fast from returning
the page mapped read-write by the pte if gup-fast doesn't take any
lock whatsoever, it doesn't set any bit in any page or vma, and it
doesn't recheck the pte is still viable after having set any bit on
page or vmas, and you still don't send a flood of ipis from fork fast
path (no race case).

> case the page is de-cowed (and would be able to work with hugepages with
> the get_page_multiple still I think although I haven't done hugepage
> implementation yet).

Yes let's ignore hugetlb for now, I fixed hugetlb too but that can be
left for later.

> Possibly that's the right way to go. Depends if it is in the slightest
> performance critical. If not, I would just let do_wp_page do the work
> to avoid a little bit of logic, but either way is not a big deal to me.

fork is less performance critical than do_wp_page, still in fork
microbenchmark no slowdown is measured with the patch. Before I
introduced PG_gup there were false positives triggered by the pagevec
temporary pins, that was measurable, after PG_gup the fast path is
unaffected (I've still to measure gup-fast slowdown in setting PG_gup
but I'm rather optimistic that you're understimating the cost of
walking 4 layers of pagetables compared to a locked op on a l1
exclusive cacheline, so I think it'll be lost in the noise). I think
the big thing of gup-fast is primarly in not having to search vmas,
(Continue reading)

Nick Piggin | 16 Mar 2009 17:01
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 00:56:54 Andrea Arcangeli wrote:
> On Sat, Mar 14, 2009 at 03:59:11PM +1100, Nick Piggin wrote:
> > It does touch gup-fast, but it just adds one branch and no barrier in the
>
> My question is what trick to you use to stop gup-fast from returning
> the page mapped read-write by the pte if gup-fast doesn't take any
> lock whatsoever, it doesn't set any bit in any page or vma, and it
> doesn't recheck the pte is still viable after having set any bit on
> page or vmas, and you still don't send a flood of ipis from fork fast
> path (no race case).

If the page is not marked PageDontCOW, then it decows it, which
gives synchronisation against fork. If it is marked PageDontCOW,
then it can't possibly be COWed by fork, previous or subsequent.

> > Possibly that's the right way to go. Depends if it is in the slightest
> > performance critical. If not, I would just let do_wp_page do the work
> > to avoid a little bit of logic, but either way is not a big deal to me.
>
> fork is less performance critical than do_wp_page, still in fork
> microbenchmark no slowdown is measured with the patch. Before I
> introduced PG_gup there were false positives triggered by the pagevec
> temporary pins, that was measurable, after PG_gup the fast path is

OK. Mine doesn't get false positives, but it doesn't try to reintroduce
pages as COW candidates after the get_user_pages is finished. This is
how it is simpler than your patch.

> unaffected (I've still to measure gup-fast slowdown in setting PG_gup
> but I'm rather optimistic that you're understimating the cost of
(Continue reading)

Andrea Arcangeli | 12 Mar 2009 19:58
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Thu, Mar 12, 2009 at 07:06:48PM +0100, Andrea Arcangeli wrote:
> again. BTW, I start to think I forgot a tlb flush after setting the
> pte writable again, that could generate a minor fault that we can
> avoid by flushing the tlb, right? But this is a minor thing, and it'd

Ah no, that is already taken care of by the fork flush in the parent
before returning, so no problem (and it would have been a minor thing
anyway).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Nick Piggin | 12 Mar 2009 18:23
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Friday 13 March 2009 04:20:27 Nick Piggin wrote:
> On Friday 13 March 2009 04:00:11 Andrea Arcangeli wrote:

> > and I think you have implementation issues in the patch (the
> > parent pte can't be left writeable if you are in a don't-cow vma, or
> > the copy will not be atomic, and glibc will have no chance to fix its
> > bugs)
>
> Oh, we need to do that? OK, then just take out that statement, and

Should read: "take out that *if* statement" (the one which I put in to
avoid wrprotect in the parent)

> change VM_BUG_ON(PageDontCOW()) in do_wp_page to
> VM_BUG_ON(PageDontCOW() && !reuse);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 11 Mar 2009 20:01
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Linus Torvalds wrote:
> 
> It's never been written down, but it's obvious to anybody who looks at how 
> COW works for even five seconds. The fact is, the person doing the COW 
> after a fork() is the person who no longer has the same physical page 
> (because he got a new page).

Btw, I think your patch has a race. Admittedly a really small one.

When you look up the page in gup.c, and then set the GUP flag on the 
"struct page", in between the lookup and the setting of the flag, another 
thread can come in and do that same fork+write thing.

	CPU0:			CPU1

	gup:			fork:
	 - look up page
	 - it's read-write
	...
				set_wr_protect
				test GUP bit - not set, good
				done

	- Mark it GUP
				tlb_flush

				write to it from user space - COW

since there is no lockng on the GUP side (there's the TLB flush that will 
(Continue reading)

Andrea Arcangeli | 11 Mar 2009 20:59
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, Mar 11, 2009 at 12:01:56PM -0700, Linus Torvalds wrote:
> Btw, I think your patch has a race. Admittedly a really small one.
> 
> When you look up the page in gup.c, and then set the GUP flag on the 
> "struct page", in between the lookup and the setting of the flag, another 
> thread can come in and do that same fork+write thing.
> 
> 	CPU0:			CPU1
> 
> 	gup:			fork:
> 	 - look up page
> 	 - it's read-write
> 	...
> 				set_wr_protect
> 				test GUP bit - not set, good
> 				done
> 
> 	- Mark it GUP
> 				tlb_flush
> 
> 				write to it from user space - COW

Did you notice the check after 'mark it gup' that will run in CPU0?

+		if (PageAnon(page)) {
+			if (!PageGUP(page))
+				SetPageGUP(page);
+			smp_mb();
+			/*
+			 * Fork doesn't want to flush the smp-tlb for
(Continue reading)

Linus Torvalds | 11 Mar 2009 21:19
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Andrea Arcangeli wrote:
> 
> Did you notice the check after 'mark it gup' that will run in CPU0?

Ahh, no. I just read the patch through fairly quickly, and the whole 
"(gup_get_pte & mask) != mask" didn't trigger as obvious. But yeah, I see 
that it ends up re-checking the RW bit.

> gup-fast will _not_ succeed because of the set_wr_protect that just 
> happened on CPU1. That's why I added the above check after 
> setpagegup/get_page.

Ok, with the recheck I think it's fine.

> > Also, having to set the PG_GUP bit means that the "fast" gup is likely not 
> > much faster than the slow one. It now has two atomics per page it looks 
> > up, afaik, which sounds like it would delete any advantage it had over the 
> > slow version that needed locking.
> 
> gup-fast has already to get_page, so I don't see it.

That's my point. It used to have one atomic. Now it has two (and a memory 
barrier). Those tend to be pretty expensive - even when there's no 
cacheline bouncing.

> Furthermore starting from the second access GUP is already
> set

That's a totally bogus argument. It will be true for _benchmarks_, but if 
(Continue reading)

Benjamin Herrenschmidt | 14 Mar 2009 06:06

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, 2009-03-11 at 13:19 -0700, Linus Torvalds wrote:
> 
> That said, I don't know who the crazy O_DIRECT users are. It may be true 
> that some O_DIRECT users end up using the same pages over and over again, 
> and that this is a good optimization for them.

Just my 2 cents here...

While I agree mostly with what you say about O_DIRECT crazyness,
unfortunately, gup is also a fashionable interface in a few other areas,
such as IB or RDMA'ish things, and I'm pretty sure we'll see others
popping here or there.

Right, it's a bit stinky, but it -is- somewhat nice for a driver to be
able to take a chunk of existing user addresses and not care whether
they are anonymous, shmem, file mappings, large pages, ... and just gup
and get some DMA pounding on them. There are various usage scenarios
where it's in fact less ugly than anything else you can come up with ...
pretty much.

IB folks so far have been avoiding the fork() trap thanks to
madvise(MADV_DONTFORK) afaik. And it all goes generally well when the
whole application knows what it's doing and just plain avoids fork.

-But- things get nasty if for some reason, the user of gup is somewhere
deep in some kind of library that an application uses without knowing,
while forking here or there to run shell scripts or other helpers.

I've seen it :-)

(Continue reading)

Nick Piggin | 14 Mar 2009 06:20
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Saturday 14 March 2009 16:06:29 Benjamin Herrenschmidt wrote:
> On Wed, 2009-03-11 at 13:19 -0700, Linus Torvalds wrote:
> > That said, I don't know who the crazy O_DIRECT users are. It may be true
> > that some O_DIRECT users end up using the same pages over and over again,
> > and that this is a good optimization for them.
>
> Just my 2 cents here...
>
> While I agree mostly with what you say about O_DIRECT crazyness,
> unfortunately, gup is also a fashionable interface in a few other areas,
> such as IB or RDMA'ish things, and I'm pretty sure we'll see others
> popping here or there.
>
> Right, it's a bit stinky, but it -is- somewhat nice for a driver to be
> able to take a chunk of existing user addresses and not care whether
> they are anonymous, shmem, file mappings, large pages, ... and just gup
> and get some DMA pounding on them. There are various usage scenarios
> where it's in fact less ugly than anything else you can come up with ...
> pretty much.
>
> IB folks so far have been avoiding the fork() trap thanks to
> madvise(MADV_DONTFORK) afaik. And it all goes generally well when the
> whole application knows what it's doing and just plain avoids fork.
>
> -But- things get nasty if for some reason, the user of gup is somewhere
> deep in some kind of library that an application uses without knowing,
> while forking here or there to run shell scripts or other helpers.
>
> I've seen it :-)
>
(Continue reading)

KOSAKI Motohiro | 16 Mar 2009 17:01
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

Hi

> > IB folks so far have been avoiding the fork() trap thanks to
> > madvise(MADV_DONTFORK) afaik. And it all goes generally well when the
> > whole application knows what it's doing and just plain avoids fork.
> >
> > -But- things get nasty if for some reason, the user of gup is somewhere
> > deep in some kind of library that an application uses without knowing,
> > while forking here or there to run shell scripts or other helpers.
> >
> > I've seen it :-)
> >
> > So if a solution can be found that doesn't uglify the whole thing beyond
> > recognition, it's probably worth it.
> 
> AFAIKS, the approach I've posted is probably the simplest (and maybe only
> way) to really fix it. It's not too ugly.

May I join this discussion?

if we only need concern to O_DIRECT, below patch is enough.

Yes, my patch isn't realy solusion.
Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork issue.
*and* my patch is crazy slow :)

So, my point is, I merely oppose easily decision to give up fixing.

Currently, I agree we don't have easily fixinig way.
but I believe we can solve this problem completely in the nealy future 
(Continue reading)

Linus Torvalds | 17 Mar 2009 01:44
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, KOSAKI Motohiro wrote:
> 
> if we only need concern to O_DIRECT, below patch is enough.

.. together with something like this, to handle the other direction. This 
should take care of the case of an O_DIRECT write() call using a page that 
was duplicated by an _earlier_ fork(), and then got split up by a COW in
the wrong direction (ie having data from the child show up in the write).

Untested. But fairly trivial, after all. We simply do the same old 
"reuse_swap_page()" count, but we only break the COW if the page count 
afterwards is 1 (reuse_swap_page will have removed it from the swap cache 
if it returns success).

Does this (together with Kosaki's patch) pass the tests that Andrea had?

		Linus

---
 mm/memory.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index baa999e..2bd5fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
 <at>  <at>  -1928,7 +1928,13  <at>  <at>  static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 			page_cache_release(old_page);
(Continue reading)

Andrea Arcangeli | 17 Mar 2009 13:19
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Mon, Mar 16, 2009 at 05:44:25PM -0700, Linus Torvalds wrote:
> -		reuse = reuse_swap_page(old_page);
> +		/*
> +		 * If we can re-use the swap page _and_ the end
> +		 * result has only one user (the mapping), then
> +		 * we reuse the whole page
> +		 */
> +		if (reuse_swap_page(old_page))
> +			reuse = page_count(old_page) == 1;
>  		unlock_page(old_page);

Think if the anon page is added to swapcache and the pte is unmapped
by the VM and set non present after GUP taken the page for a O_DIRECT
read (write to memory). If a thread writes to the page while the
O_DIRECT read is running in another thread (or aio), then do_wp_page
will make a copy of the swapcache under O_DIRECT read, and part of the
read operation will get lost.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 17 Mar 2009 17:43
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Andrea Arcangeli wrote:
> 
> Think if the anon page is added to swapcache and the pte is unmapped
> by the VM and set non present after GUP taken the page for a O_DIRECT
> read (write to memory). If a thread writes to the page while the
> O_DIRECT read is running in another thread (or aio), then do_wp_page
> will make a copy of the swapcache under O_DIRECT read, and part of the
> read operation will get lost.

In that case, you aren't getting to the "do_wp_page()" case at all, you're 
getting the "do_swap_page()" case. Which does its own reuse_swap_page() 
thing (and that one I didn't touch - on purpose).

But you're right - it only does that for writes. If we _first_ do a read 
(to swap it back in), it will mark it read-only and _then_ we can get a 
"do_wp_page()" that splits it.

So yes - I had expected our VM to be sane, and have a writable private 
page _stay_ writable (in the absense of fork() it should never turn into a 
COW page), but the swapout+swapin code can result in a rw page that turns 
read-only in order to catch a swap cache invalidation.

Good catch. Let me think about it.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
(Continue reading)

Linus Torvalds | 17 Mar 2009 18:01
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Linus Torvalds wrote:
> 
> So yes - I had expected our VM to be sane, and have a writable private 
> page _stay_ writable (in the absense of fork() it should never turn into a 
> COW page), but the swapout+swapin code can result in a rw page that turns 
> read-only in order to catch a swap cache invalidation.
> 
> Good catch. Let me think about it.

Btw, I think this is actually a pre-existing bug regardless of my patch.

That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT 
write - exactly for the same reason. When swapin turns the page into a 
read-only page in order to keep the physical page in the swap cache, the 
write to the physical page (that was gotten by get_user_pages() earlier) 
will bypass all that.

So the get_user_pages() users will then write to the page, but the next 
time we swap things out, if nobody _else_ wrote to it, that write will be 
lost because we'll just drop the page (it was in the swap cache!) even 
though it had changed data on it.

My patch changed the schenario a bit (split page rather than dropped 
page), but the fundamental cause seems to be the same - the swap cache 
code very much depends on writes to the _virtual_ address.

Or am I missing something?

			Linus
(Continue reading)

Andrea Arcangeli | 17 Mar 2009 18:10
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tue, Mar 17, 2009 at 10:01:06AM -0700, Linus Torvalds wrote:
> That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT 

I think the dirty bit is set in dio_bio_complete (or
bio_check_pages_dirty for the aio case) so forcing the swapcache to be
written out again before the page can be freed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 17 Mar 2009 18:43
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Andrea Arcangeli wrote:

> On Tue, Mar 17, 2009 at 10:01:06AM -0700, Linus Torvalds wrote:
> > That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT 
> 
> I think the dirty bit is set in dio_bio_complete (or
> bio_check_pages_dirty for the aio case) so forcing the swapcache to be
> written out again before the page can be freed.

Do all the other get_user_pages() users do that, though?

[ Looks around - at least access_process_vm(), IB and the NFS direct code 
  do. So we seem to be mostly ok, at least for the main users ]

Ok, no worries.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 17 Mar 2009 19:09
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Linus Torvalds wrote:
> 
> Do all the other get_user_pages() users do that, though?
> 
> [ Looks around - at least access_process_vm(), IB and the NFS direct code 
>   do. So we seem to be mostly ok, at least for the main users ]
> 
> Ok, no worries.

This problem is actually pretty easy to fix for anonymous pages: since the 
act of pinning (for writes) should have done all the COW stuff and made 
sure the page is not in the swap cache, we only need to avoid adding it 
back.

IOW, something like the following makes sense on all levels regardless 
(note: I didn't check if there is some off-by-one issue where we've raised 
the page count for other reasons when scanning it, so this is not meant to 
be a serious patch, just a "something along these lines" thing).

This does not obviate the need to mark pages dirty afterwards, though, 
since true shared mappings always cause that (and we cannot keep them 
dirty, since somebody may be doing fsync() on them or something like 
that).

But since the COW issue is only a matter of private pages, this handles 
that trivially.

			Linus

(Continue reading)

Linus Torvalds | 17 Mar 2009 19:19
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Linus Torvalds wrote:
> 
> This problem is actually pretty easy to fix for anonymous pages: since the 
> act of pinning (for writes) should have done all the COW stuff and made 
> sure the page is not in the swap cache, we only need to avoid adding it 
> back.

An alternative approach would have been to just count page pinning as 
being a "referenced", which to some degree would be even more logical (we 
don't set the referenced flag when we look those pages up). That would 
also affect pages that were get_user_page'd just for reading, which might 
be seen as an additional bonus.

The "don't turn pinned pages into swap cache pages" is a somewhat more 
direct patch, though. It gives more obvious guarantees about the lifetime 
behaviour of anon pages wrt get_user_pages[_fast]().. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Andrea Arcangeli | 17 Mar 2009 19:46
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tue, Mar 17, 2009 at 11:19:59AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 17 Mar 2009, Linus Torvalds wrote:
> > 
> > This problem is actually pretty easy to fix for anonymous pages: since the 
> > act of pinning (for writes) should have done all the COW stuff and made 
> > sure the page is not in the swap cache, we only need to avoid adding it 
> > back.
> 
> An alternative approach would have been to just count page pinning as 
> being a "referenced", which to some degree would be even more logical (we 
> don't set the referenced flag when we look those pages up). That would 
> also affect pages that were get_user_page'd just for reading, which might 
> be seen as an additional bonus.
> 
> The "don't turn pinned pages into swap cache pages" is a somewhat more 
> direct patch, though. It gives more obvious guarantees about the lifetime 
> behaviour of anon pages wrt get_user_pages[_fast]().. 

I don't think you can tackle this from add_to_swap because the page
may be in the swapcache well before gup runs (gup(write=1) can map the
swapcache as exclusive and read-write in the pte). So then what
happens is again that the VM unmaps the page, do_swap_page map it as
readonly swapcache (so far so good), and the do_wp_page copies the
page under O_DIRECT read again.

The off by one is most certain as it's invoked by the VM but that's an
implementation detail not relevant for this discussion agreed, and I
guess you also meant page_mapcount instead of page_mapped or I think
(Continue reading)

Linus Torvalds | 17 Mar 2009 20:03
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Andrea Arcangeli wrote:
> 
> I don't think you can tackle this from add_to_swap because the page
> may be in the swapcache well before gup runs (gup(write=1) can map the
> swapcache as exclusive and read-write in the pte).

If it's in the swap cache, it should be mapped read-only, and gup(write=1) 
will do the COW break and un-swapcache it.

When can it be writably in the swap cache? The write-only thing is the one 
we use to invalidate stale swap cache entries, and when we mark those 
pages writable (in do_wp_page or do_swap_page) we always remove the page 
from the swap cache at the same time.

Or is there some other path I missed?

> My preference is still to keeps pages with elevated refcount pinned in
> the ptes like 2.6.7 did, that will allow do_wp_page to takeover only
> pages with page_count not elevated without risk of calling do_wp_page
> on any page under gup.

I agree that that would also work - and be even simpler. If done right, we 
can even avoid clearing the dirty bit (in page_mkclean()) for such pages, 
and now it works for _all_ pages, not just anonymous pages.

IOW, even if you had a shared mapping and were to GUP() those pages for 
writing, they'd _stay_ dirty until you free'd them - no need to re-dirty 
them in case somebody did IO on them. 

(Continue reading)

Andrea Arcangeli | 17 Mar 2009 20:35
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tue, Mar 17, 2009 at 12:03:55PM -0700, Linus Torvalds wrote:
> If it's in the swap cache, it should be mapped read-only, and gup(write=1) 
> will do the COW break and un-swapcache it.

It may turn it read-write instead of COW break and un-swapcache.

   if (write_access && reuse_swap_page(page)) {
      pte = maybe_mkwrite(pte_mkdirty(pte), vma);

This is done to avoid fragmenting the swap device.

> I agree that that would also work - and be even simpler. If done right, we 
> can even avoid clearing the dirty bit (in page_mkclean()) for such pages, 
> and now it works for _all_ pages, not just anonymous pages.
> 
> IOW, even if you had a shared mapping and were to GUP() those pages for 
> writing, they'd _stay_ dirty until you free'd them - no need to re-dirty 
> them in case somebody did IO on them. 

I agree in principle, if the VM stays away from pages under GUP
theoretically the dirty bit shouldn't be transferred to the PG_dirty
of the page until after the I/O is complete, so the dirty bit set by
gup in the pte may be enough. Not sure if there are other places that
could transfer the dirty bit of the pte before the gup user releases
the page-pin.

> I don't think you can use just mapcount on its own - you have to compare 
> it to page_count(). Otherwise perfectly normal (non-gup) pages will 
> trigger, since that page count is the only thing that differs between the 
> two cases.
(Continue reading)

Linus Torvalds | 17 Mar 2009 20:55
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Andrea Arcangeli wrote:

> On Tue, Mar 17, 2009 at 12:03:55PM -0700, Linus Torvalds wrote:
> > If it's in the swap cache, it should be mapped read-only, and gup(write=1) 
> > will do the COW break and un-swapcache it.
> 
> It may turn it read-write instead of COW break and un-swapcache.
> 
>    if (write_access && reuse_swap_page(page)) {
>       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> 
> This is done to avoid fragmenting the swap device.

Right, but reuse_swap_page() will have removed it from the swapcache if it 
returns success.

So if the page is writable in the page tables, it should not be in the 
swap cache.

Oh, except that we do it in shrink_page_list(), and while we're going to 
do that whole "try_to_unmap()", I guess it can fail to unmap there? In 
that case, you could actually have it in the page tables while in the swap 
cache.

And besides, we do remove it from the page tables in the wrong order (ie 
we add it to the swap cache first, _then_ remove it), so I guess that also 
ends up being a race with another CPU doing fast-gup. And we _have_ to do 
it in that order at least for the map_count > 1 case, since a read-only 
swap page may be shared by multiple mm's, and the swap-cache is how we 
(Continue reading)

KAMEZAWA Hiroyuki | 17 Mar 2009 01:56
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Mon, 16 Mar 2009 17:44:25 -0700 (PDT)
Linus Torvalds <torvalds <at> linux-foundation.org> wrote:

> 
> 
> On Tue, 17 Mar 2009, KOSAKI Motohiro wrote:
> > 
> > if we only need concern to O_DIRECT, below patch is enough.
> 
> .. together with something like this, to handle the other direction. This 
> should take care of the case of an O_DIRECT write() call using a page that 
> was duplicated by an _earlier_ fork(), and then got split up by a COW in
> the wrong direction (ie having data from the child show up in the write).
> 
> Untested. But fairly trivial, after all. We simply do the same old 
> "reuse_swap_page()" count, but we only break the COW if the page count 
> afterwards is 1 (reuse_swap_page will have removed it from the swap cache 
> if it returns success).
> 
> Does this (together with Kosaki's patch) pass the tests that Andrea had?
> 
I'm not sure but I doubt "AIO" case.

+	down_read(&current->mm->directio_sem);
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
+	up_read(&current->mm->directio_sem);

If AIO, this semaphore range seems to be not enough. 

(Continue reading)

Nick Piggin | 16 Mar 2009 17:23
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 03:01:42 KOSAKI Motohiro wrote:
> Hi
>

> > AFAIKS, the approach I've posted is probably the simplest (and maybe only
> > way) to really fix it. It's not too ugly.
>
> May I join this discussion?

Of course :)

> if we only need concern to O_DIRECT, below patch is enough.
>
> Yes, my patch isn't realy solusion.
> Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork
> issue. *and* my patch is crazy slow :)

Well, it's an interesting question. I'd say it probably is more than
just O_DIRECT. vmsplice too, for example (which I think is much harder
to fix this way because the pages are retired by the other end of
the pipe, so I don't think you can hold a lock across it).

For other device drivers, one could argue that they are "special" and
require special knowledge and apps to use MADV_DONTFORK... Ben didn't
like that so much, and also some other users of get_user_pages might
come up.

But your patch is interesting. I don't think it is crazy slow... well
it might be a bit slow in the case that a threaded app doing a lot of
direct IO or an app doing async IO forks. But how common is that?
(Continue reading)

KOSAKI Motohiro | 18 Mar 2009 03:04
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

Hi

> > ---
> >  fs/direct-io.c            |    2 ++
> >  include/linux/init_task.h |    1 +
> >  include/linux/mm_types.h  |    3 +++
> >  kernel/fork.c             |    3 +++
> >  4 files changed, 9 insertions(+), 0 deletions(-)
> 
> It is an interesting patch. Thanks for throwing it into the discussion.
> I do prefer to close the race up for all cases if we decide to do
> anything at all about it, ie. all or nothing. But maybe others disagree.

Honestly, I wan't excepting linus's reaction. but I hope to make my v2.

My point is:
  - my patch don't prevent implement madvice(DONTCOW), I think.
  - andrea patch's complexity is mainly caused by avoiding perfromance degression effort,
    then, kernel later improvement can shrink his patch automatically.
    furtunately KSM don't merge yet. we can discuss his patch again at KSM submitting.
  - anyway, it can fix the bug.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

KOSAKI Motohiro | 22 Mar 2009 13:23
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

Hi

following patch is my v2 approach.
it survive Andrea's three dio test-case.

Linus suggested to change add_to_swap() and shrink_page_list() stuff
for avoid false cow in do_wp_page() when page become to swapcache.

I think it's good idea. but it's a bit radical. so I think it's for development
tree tackle.

Then, I decide to use Nick's early decow in 
get_user_pages() and RO mapped page don't use gup_fast.

yeah, my approach is extream brutal way and big hammer. but I think 
it don't have performance issue in real world.

why?

Practically, we can assume following two thing.

(1) the buffer of passed write(2) syscall argument is RW mapped
    page or COWed RO page.

if anybody write following code, my path cause performance degression.

   buf = mmap()
   memset(buf, 0x11, len);
   mprotect(buf, len, PROT_READ)
   fd = open(O_DIRECT)
(Continue reading)

Nick Piggin | 24 Mar 2009 14:43
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Sunday 22 March 2009 23:23:56 KOSAKI Motohiro wrote:
> Hi
>
> following patch is my v2 approach.
> it survive Andrea's three dio test-case.
>
> Linus suggested to change add_to_swap() and shrink_page_list() stuff
> for avoid false cow in do_wp_page() when page become to swapcache.
>
> I think it's good idea. but it's a bit radical. so I think it's for
> development tree tackle.
>
> Then, I decide to use Nick's early decow in
> get_user_pages() and RO mapped page don't use gup_fast.

You probably should be testing for PageAnon pages in gup_fast.
Also, using a bit in page->flags you could potentially get
anonymous, readonly mappings working again (I thought I had
them working in my patch, but on second thoughts perhaps I
had a bug in tagging them, I'll try to fix that).

> yeah, my approach is extream brutal way and big hammer. but I think
> it don't have performance issue in real world.
>
> why?
>
> Practically, we can assume following two thing.
>
> (1) the buffer of passed write(2) syscall argument is RW mapped
>     page or COWed RO page.
(Continue reading)

KOSAKI Motohiro | 30 Mar 2009 12:52
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

Hi Nick,

> > Am I missing any thing?
> 
> I still don't understand why this way is so much better than
> my last proposal. I just wanted to let that simmer down for a 
> few days :) But I'm honestly really just interested in a good
> discussion and I don't mind being sworn at if I'm being stupid,
> but I really want to hear opinions of why I'm wrong too.
> 
> Yes my patch has downsides I'm quite happy to admit. But I just
> don't see that copy-on-fork rather than wrprotect-on-fork is
> the showstopper. To me it seemed nice because it is practically
> just reusing code straight from do_wp_page, and pretty well
> isolated out of the fastpath.

Firstly, I'm very sorry for very long delay responce. This month, I'm
very busy and I don't have enough developing time ;)

Secondly, I have strongly obsession to bugfix. (I guess you alread know it)
but I don't have obsession to bugfix _way_. my patch was made for
creating good discussion, not NAK your patch.

I think your patch is good. but it have few disadvantage.
(yeah, I agree mine have lot disadvantage)

1. using page->flags
   nowadays, page->flags is one of most prime estate in linux.
   as far as possible, we can avoid to use it.
2. don't have GUP_FLAGS_PINNING_PAGE flag
(Continue reading)

Linus Torvalds | 24 Mar 2009 18:56
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 25 Mar 2009, Nick Piggin wrote:
> 
> I still don't understand why this way is so much better than
> my last proposal.

Take a look at the diffstat.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

KOSAKI Motohiro | 23 Mar 2009 01:13
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

> Hi
> 
> following patch is my v2 approach.
> it survive Andrea's three dio test-case.
> 
> Linus suggested to change add_to_swap() and shrink_page_list() stuff
> for avoid false cow in do_wp_page() when page become to swapcache.
> 
> I think it's good idea. but it's a bit radical. so I think it's for development
> tree tackle.
> 
> Then, I decide to use Nick's early decow in 
> get_user_pages() and RO mapped page don't use gup_fast.
> 
> yeah, my approach is extream brutal way and big hammer. but I think 
> it don't have performance issue in real world.
> 
> why?
> 
> Practically, we can assume following two thing.
> 
> (1) the buffer of passed write(2) syscall argument is RW mapped
>     page or COWed RO page.
> 
> if anybody write following code, my path cause performance degression.
> 
>    buf = mmap()
>    memset(buf, 0x11, len);
>    mprotect(buf, len, PROT_READ)
>    fd = open(O_DIRECT)
(Continue reading)

Ingo Molnar | 23 Mar 2009 17:29
Picon
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


* KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:

> > following patch is my v2 approach.
> > it survive Andrea's three dio test-case.
> >
> > [...]

> Frankly, linus sugessted to insert one branch into do_wp_page(), 
> but I remove one branch from gup_fast.
> 
> I think it's good performance trade-off. but if anybody hate my 
> approach, I'll drop my chicken heart and try to linus suggested 
> way.

We started out with a difficult corner case problem (for an arguably 
botched syscall promise we made to user-space many moons ago), and 
an invasive and unmaintainable looking patch:

    8 files changed, 342 insertions(+), 77 deletions(-)

And your v2 is now:

    9 files changed, 66 insertions(+), 21 deletions(-)

... and it is also speeding up fast-gup. Which is a marked 
improvement IMO.

	Ingo

(Continue reading)

Linus Torvalds | 23 Mar 2009 17:46
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Mon, 23 Mar 2009, Ingo Molnar wrote:
> 
> And your v2 is now:
> 
>     9 files changed, 66 insertions(+), 21 deletions(-)
> 
> ... and it is also speeding up fast-gup. Which is a marked 
> improvement IMO.

Yeah, I have no problems with that patch. I'd just suggest a final 
simplification, and getting rid of the

        mask = _PAGE_PRESENT|_PAGE_USER;
        /* Maybe the read only pte is cow mapped page. (or not maybe)
           So, falling back to get_user_pages() is better */
        mask |= _PAGE_RW;

and just doing something like

	/*
	 * fast-GUP only handles the simple cases where we have
	 * full access to the page (ie private pages are copied
	 * etc).
	 */
	#define GUP_MASK (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW)

and leaving it at that.

Of course, maybe somebody does O_DIRECT writes on a fork'ed image in order 
(Continue reading)

KOSAKI Motohiro | 24 Mar 2009 06:08
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

Hi

> > And your v2 is now:
> > 
> >     9 files changed, 66 insertions(+), 21 deletions(-)
> > 
> > ... and it is also speeding up fast-gup. Which is a marked 
> > improvement IMO.
> 
> Yeah, I have no problems with that patch. I'd just suggest a final 
> simplification, and getting rid of the
> 
>         mask = _PAGE_PRESENT|_PAGE_USER;
>         /* Maybe the read only pte is cow mapped page. (or not maybe)
>            So, falling back to get_user_pages() is better */
>         mask |= _PAGE_RW;
> 
> and just doing something like
> 
> 	/*
> 	 * fast-GUP only handles the simple cases where we have
> 	 * full access to the page (ie private pages are copied
> 	 * etc).
> 	 */
> 	#define GUP_MASK (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW)

OK! I'll do that.
Thanks good reviewing!

> and leaving it at that.
(Continue reading)

Linus Torvalds | 16 Mar 2009 17:32
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Nick Piggin wrote:
> > Yes, my patch isn't realy solusion.
> > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork
> > issue. *and* my patch is crazy slow :)
> 
> Well, it's an interesting question. I'd say it probably is more than
> just O_DIRECT. vmsplice too, for example (which I think is much harder
> to fix this way because the pages are retired by the other end of
> the pipe, so I don't think you can hold a lock across it).

Well, only the "fork()" has the race problem.

So having a fork-specific lock (but not naming it by directio) actually 
does make sense. The fork is much less performance-critical than most 
random mmap_sem users - and doesn't have the same scalability issues 
either (ie people probably _do_ want to do mmap/munmap/brk concurrently 
with gup lookup, but there's much less worry about concurrent fork() 
performance).

It doesn't necessarily make the general problem go away, but it makes the 
_particular_ race between get_user_pages() and fork() go away. Then you 
can do per-page flags or whatever and not have to worry about concurrent 
lookups.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
(Continue reading)

Nick Piggin | 16 Mar 2009 17:50
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 03:32:11 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > > Yes, my patch isn't realy solusion.
> > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs
> > > fork issue. *and* my patch is crazy slow :)
> >
> > Well, it's an interesting question. I'd say it probably is more than
> > just O_DIRECT. vmsplice too, for example (which I think is much harder
> > to fix this way because the pages are retired by the other end of
> > the pipe, so I don't think you can hold a lock across it).
>
> Well, only the "fork()" has the race problem.
>
> So having a fork-specific lock (but not naming it by directio) actually
> does make sense. The fork is much less performance-critical than most
> random mmap_sem users - and doesn't have the same scalability issues
> either (ie people probably _do_ want to do mmap/munmap/brk concurrently
> with gup lookup, but there's much less worry about concurrent fork()
> performance).
>
> It doesn't necessarily make the general problem go away, but it makes the
> _particular_ race between get_user_pages() and fork() go away. Then you
> can do per-page flags or whatever and not have to worry about concurrent
> lookups.

Hmm, I see what you mean there; it can be used to solve Andrea's race
instead of using set_bit/memory barriers. But I think then you would
still need to put this lock in fork and get_user_pages[_fast], *and*
still do most of the other stuff required in Andrea's patch.

(Continue reading)

KAMEZAWA Hiroyuki | 17 Mar 2009 00:59
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tue, 17 Mar 2009 03:50:12 +1100
Nick Piggin <nickpiggin <at> yahoo.com.au> wrote:

> On Tuesday 17 March 2009 03:32:11 Linus Torvalds wrote:
> > On Tue, 17 Mar 2009, Nick Piggin wrote:
> > > > Yes, my patch isn't realy solusion.
> > > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs
> > > > fork issue. *and* my patch is crazy slow :)
> > >
> > > Well, it's an interesting question. I'd say it probably is more than
> > > just O_DIRECT. vmsplice too, for example (which I think is much harder
> > > to fix this way because the pages are retired by the other end of
> > > the pipe, so I don't think you can hold a lock across it).
> >
> > Well, only the "fork()" has the race problem.
> >
> > So having a fork-specific lock (but not naming it by directio) actually
> > does make sense. The fork is much less performance-critical than most
> > random mmap_sem users - and doesn't have the same scalability issues
> > either (ie people probably _do_ want to do mmap/munmap/brk concurrently
> > with gup lookup, but there's much less worry about concurrent fork()
> > performance).
> >
> > It doesn't necessarily make the general problem go away, but it makes the
> > _particular_ race between get_user_pages() and fork() go away. Then you
> > can do per-page flags or whatever and not have to worry about concurrent
> > lookups.
> 
> Hmm, I see what you mean there; it can be used to solve Andrea's race
> instead of using set_bit/memory barriers. But I think then you would
(Continue reading)

Linus Torvalds | 16 Mar 2009 18:02
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> Hmm, I see what you mean there; it can be used to solve Andrea's race
> instead of using set_bit/memory barriers. But I think then you would
> still need to put this lock in fork and get_user_pages[_fast], *and*
> still do most of the other stuff required in Andrea's patch.

Well, yes and no. 

What if we just did the caller get the lock? And then leave it entirely to 
the caller to decide how it wants to synchronize with fork?

In particular, we really _could_ just say "hold the lock for reading for 
as long as you hold the reference count to the page" - since now the lock 
only matters for fork(), nothing else.

And make the forking part use "down_write_killable()", so that you can 
kill the process if it does something bad.

Now you can make vmsplice literally get a read-lock for the whole IO 
operation. The process that does "vmsplice()" will not be able to fork 
until the IO is done, but let's be honest here: if you're doing 
vmsplice(), that is damn well what you WANT!

splice() already has a callback for releasing the pages, so it's doable.

O_DIRECT has similar issues - by the time we return from an O_DIRECT 
write, the pages had better already be written out, so we could just take 
the read-lock over the whole operation.
(Continue reading)

Nick Piggin | 16 Mar 2009 18:19
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 04:02:02 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > Hmm, I see what you mean there; it can be used to solve Andrea's race
> > instead of using set_bit/memory barriers. But I think then you would
> > still need to put this lock in fork and get_user_pages[_fast], *and*
> > still do most of the other stuff required in Andrea's patch.
>
> Well, yes and no.
>
> What if we just did the caller get the lock? And then leave it entirely to
> the caller to decide how it wants to synchronize with fork?
>
> In particular, we really _could_ just say "hold the lock for reading for
> as long as you hold the reference count to the page" - since now the lock
> only matters for fork(), nothing else.

Well that in theory should close the race in one direction (writing into
the wrong page).

I don't think it closes it in the other direction (reading the wrong data
from the page).

I'm also not quite convinced of vmsplice.

> And make the forking part use "down_write_killable()", so that you can
> kill the process if it does something bad.
>
> Now you can make vmsplice literally get a read-lock for the whole IO
> operation. The process that does "vmsplice()" will not be able to fork
> until the IO is done, but let's be honest here: if you're doing
(Continue reading)

Linus Torvalds | 16 Mar 2009 18:42
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> Well that in theory should close the race in one direction (writing into
> the wrong page).
> 
> I don't think it closes it in the other direction (reading the wrong data
> from the page).

Why?

If somebody does a COW while we have a get_user_pages() page frame cached, 
the get_user_pages() will have increased the page count, so regardless of 
_who_ writes to the page, the writer will always get a new page. No?

So reading data from the page will always get the old pre-cow data. 

[ goes to reading code ]

Oh, damn. That's how it used to work a long time ago when we looked at the 
page count. Now we just look at the page *map* count, we don't look at any 
other counts. So the COW logic won't see that somebody else has a copy.

Maybe we could go back to also looking at page counts?

> BTW. have you looked at my approach yet? I've tried to solve the fork
> vs gup race in yet another way. Don't know if you think it is palatable.

I really think we should be able to fix this without _anything_ like that 
at all. Just the lock (and some reuse_swap_page() logic changes).
(Continue reading)

Andrea Arcangeli | 16 Mar 2009 19:28
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Mon, Mar 16, 2009 at 10:42:48AM -0700, Linus Torvalds wrote:
> Maybe we could go back to also looking at page counts?

Hugh just recently reminded me why we switched to mapcount and
explanation is here: c475a8ab625d567eacf5e30ec35d6d8704558062
which wasn't entirely safe until this was added too:
ab967d86015a19777955370deebc8262d50fed63 which reliably allowed to
takeover swapcache pages taken by gup and at the same time it allowed
the VM to unmap ptes pointing to swapcache taken by GUP.

Yes it's possible to go back to page counts, then we have only to
reintroduce by 2.6.7 solution that will prevent the VM to unmap ptes
that are mapping pages take by GUP. Otherwise do_wp_page won't be able
to remap into the pte the same swapcache that was unmapped by the pte
by the VM leading to disk corruption with swapping (the 2.4 bug, fixed
in 2.4 with a simpler PG_lock local to direct-io, that prevented the
VM to unmap ptes on the page as long as I/O was in progress, and
PG_lock was released by the ->end_io async handler from irq IIRC).

The only problem I can see is if mapcount and page count can change
freely while PT lock and rmap locks are taken, comparing them won't be
as reliable as in ksm/fork (in my version of the fix) where we're
guaranteed mapcount is 1 and stays 1 as long as we hold PT lock,
because pte_write(pte) == true and PageAnon == true (I also added a
BUG_ON to check mapcount to be always 1 with the other two conditions
are true). That makes ksm/forkfix quite obviously safe in this regard.

But for the VM to decide not to unmap a pte taken by GUP, we also have
to deal with a mapcount > 1 and pte_write(pte) == false and PageAnon
== true. So if we solve that ordering issue between reading mapcount
(Continue reading)

Nick Piggin | 16 Mar 2009 19:02
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 04:42:48 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > Well that in theory should close the race in one direction (writing into
> > the wrong page).
> >
> > I don't think it closes it in the other direction (reading the wrong data
> > from the page).
>
> Why?
>
> If somebody does a COW while we have a get_user_pages() page frame cached,
> the get_user_pages() will have increased the page count, so regardless of
> _who_ writes to the page, the writer will always get a new page. No?

[(no)]

> Maybe we could go back to also looking at page counts?

Hmm, possibly could.

> > BTW. have you looked at my approach yet? I've tried to solve the fork
> > vs gup race in yet another way. Don't know if you think it is palatable.
>
> I really think we should be able to fix this without _anything_ like that
> at all. Just the lock (and some reuse_swap_page() logic changes).

What part of that do you dislike, though? I don't think the lock is a
particularly elegant idea either (shared cacheline, vmsplice, converting
callers).

(Continue reading)

Linus Torvalds | 16 Mar 2009 19:14
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> What part of that do you dislike, though? I don't think the lock is a
> particularly elegant idea either (shared cacheline, vmsplice, converting
> callers).

All of the absolute *crap* for no good reason.

Did you even look at your patch? It wasn't as ugly as Andrea's, but it was 
ugly enough, and it was buggy. That whole "decow" stuff was too f*cking 
ugly to live.

Couple that with the fact that no real-life user can possibly care, and 
that O_DIRECT is broken to begin with, and I say: "let's fix this with a 
_much_ smaller patch".

You may think that the lock isn't particularly "elegant", but I can only 
say "f*ck that, look at the number of lines of code, and the simplicity".

Your "elegant" argument is total and utter sh*t, in other words. The lock 
approach is tons more elegant, considering that it solves the problem much 
more cleanly, and with _much_ less crap.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
(Continue reading)

Andrea Arcangeli | 16 Mar 2009 19:37
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Mon, Mar 16, 2009 at 11:14:59AM -0700, Linus Torvalds wrote:
> You may think that the lock isn't particularly "elegant", but I can only 
> say "f*ck that, look at the number of lines of code, and the simplicity".

I'm sorry but the number of lines that you're reading in the
direct_io_worker patch, aren't representative of what it takes to fix
it with a mm wide lock. It may be conceptually simpler to fix it
outside GUP, on that I can certainly agree (with the downside of
leaving splice broken etc..), but I can't see how that small patch can
fix anything as releasing the semaphore after direct_io_worker returns
with O_DIRECT mixed with async-io. Before claiming that the outer lock
results in less number of lines of code, I'd wait to see a fix that
works with O_DIRECT+async-io too as well as mine and Nick's do.

> Your "elegant" argument is total and utter sh*t, in other words. The lock 
> approach is tons more elegant, considering that it solves the problem much 
> more cleanly, and with _much_ less crap.

I guess elegant is relative, but the size argument is objective, and
that should be possible to compare if somebody writes a full fix that
doesn't fall apart if return value of direct_io_worker is -EIOCBQUEUED.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Nick Piggin | 16 Mar 2009 19:29
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 05:14:59 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > What part of that do you dislike, though? I don't think the lock is a
> > particularly elegant idea either (shared cacheline, vmsplice, converting
> > callers).
>
> All of the absolute *crap* for no good reason.
>
> Did you even look at your patch? It wasn't as ugly as Andrea's, but it was
> ugly enough, and it was buggy. That whole "decow" stuff was too f*cking
> ugly to live.

What's buggy about it? Stupid bugs, or fundamentally broken?

> Couple that with the fact that no real-life user can possibly care, and
> that O_DIRECT is broken to begin with, and I say: "let's fix this with a
> _much_ smaller patch".

If it is based on nobody caring, I would prefer not to add anything at
all to "fix" it? We have MADV_DONTFORK already...

> You may think that the lock isn't particularly "elegant", but I can only
> say "f*ck that, look at the number of lines of code, and the simplicity".
>
> Your "elegant" argument is total and utter sh*t, in other words. The lock
> approach is tons more elegant, considering that it solves the problem much
> more cleanly, and with _much_ less crap.

In my opinion it is not, given that you have to convert callers. If you
say that you only care about fixing O_DIRECT, then yes I would probably
(Continue reading)

Linus Torvalds | 16 Mar 2009 20:17
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> What's buggy about it? Stupid bugs, or fundamentally broken?

The lack of locking.

> In my opinion it is not, given that you have to convert callers. If you
> say that you only care about fixing O_DIRECT, then yes I would probably
> agree the lock is nicer in that case.

F*ck me, I'm not going to bother to argue. I'm not going to merge your 
patch, it's that easy.

Quite frankly, I don't think that the "bug" is a bug to begin with. 
O_DIRECT+fork() can damn well continue to be broken. But if we fix it, we 
fix it the _clean_ way with a simple patch, not with that shit-for-logic 
horrible decow crap.

It's that simple. I refuse to take putrid industrial waste patches for 
something like this.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

Nick Piggin | 17 Mar 2009 06:42
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 06:17:21 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > What's buggy about it? Stupid bugs, or fundamentally broken?
>
> The lack of locking.

I don't think it's broken. I can't see a problem.

> > In my opinion it is not, given that you have to convert callers. If you
> > say that you only care about fixing O_DIRECT, then yes I would probably
> > agree the lock is nicer in that case.
>
> F*ck me, I'm not going to bother to argue. I'm not going to merge your
> patch, it's that easy.
>
> Quite frankly, I don't think that the "bug" is a bug to begin with.
> O_DIRECT+fork() can damn well continue to be broken. But if we fix it, we
> fix it the _clean_ way with a simple patch, not with that shit-for-logic
> horrible decow crap.
>
> It's that simple. I refuse to take putrid industrial waste patches for
> something like this.

I consider it is clean because it only adds branches in 3 places that
are not taken unless direct IO and fork are used, and it fixes the
"problem" in the VM directly leaving get_user_pages unchanged.

I don't think it is conceptually such a problem to copy pages rather
than COW them in fork. Seems fairly straightforward to me.

(Continue reading)

Nick Piggin | 17 Mar 2009 06:58
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 16:42:24 Nick Piggin wrote:

> I consider it is clean because it only adds branches in 3 places that
> are not taken unless direct IO and fork are used, and it fixes the
> "problem" in the VM directly leaving get_user_pages unchanged.

leaving get_user_pages callers unchanged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Nick Piggin | 16 Mar 2009 19:05
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 05:02:56 Nick Piggin wrote:
> On Tuesday 17 March 2009 04:42:48 Linus Torvalds wrote:
> > On Tue, 17 Mar 2009, Nick Piggin wrote:

> > > BTW. have you looked at my approach yet? I've tried to solve the fork
> > > vs gup race in yet another way. Don't know if you think it is
> > > palatable.
> >
> > I really think we should be able to fix this without _anything_ like that
> > at all. Just the lock (and some reuse_swap_page() logic changes).
>
> What part of that do you dislike, though?

If you disregard code motion and extra argument to copy_page_range,
my fix is a couple of dozen lines change to existing code, plus the
"decow" function (which could probably share a fair bit of code
with do_wp_page).

Do you dislike the added complexity of the code? Or the behaviour
that gets changed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 16 Mar 2009 19:17
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> If you disregard code motion and extra argument to copy_page_range,
> my fix is a couple of dozen lines change to existing code, plus the
> "decow" function (which could probably share a fair bit of code
> with do_wp_page).
> 
> Do you dislike the added complexity of the code? Or the behaviour
> that gets changed?

The complexity. That decow thing is shit. So is all the extra flags for no 
good reason. 

What's your argument against "keep it simple with a single lock, and 
adding basically a single line to reuse_swap_page() to say "don't reuse 
the page if the count is elevated"?

THAT is simple and elegant, and needs none of the complexity.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Nick Piggin | 16 Mar 2009 19:33
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 05:17:02 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > If you disregard code motion and extra argument to copy_page_range,
> > my fix is a couple of dozen lines change to existing code, plus the
> > "decow" function (which could probably share a fair bit of code
> > with do_wp_page).
> >
> > Do you dislike the added complexity of the code? Or the behaviour
> > that gets changed?
>
> The complexity. That decow thing is shit.

copying the page on fork instead of write protecting it? The code or
the idea? Code can certainly be improved...

> So is all the extra flags for no
> good reason.

Which extra flags are you referring to?

> What's your argument against "keep it simple with a single lock, and
> adding basically a single line to reuse_swap_page() to say "don't reuse
> the page if the count is elevated"?

I made them in a previous message. It depends on what callers you want
to convert I guess. I don't think vmsplice takes to the lock approach
very well though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
(Continue reading)

Linus Torvalds | 16 Mar 2009 20:22
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> > So is all the extra flags for no
> > good reason.
> 
> Which extra flags are you referring to?

Fuck me, didn't you even read your own patch?

What do you call PG_dontcow? 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Nick Piggin | 17 Mar 2009 06:44
Picon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Tuesday 17 March 2009 06:22:12 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > > So is all the extra flags for no
> > > good reason.
> >
> > Which extra flags are you referring to?
>
> Fuck me, didn't you even read your own patch?
>
> What do you call PG_dontcow?

It is a flag, there for a good reason.

It sounded like you were seeing more than one flag, and that
you thought they were useless.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Andrea Arcangeli | 11 Mar 2009 21:48
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, Mar 11, 2009 at 01:19:03PM -0700, Linus Torvalds wrote:
> That said, I don't know who the crazy O_DIRECT users are. It may be true 
> that some O_DIRECT users end up using the same pages over and over again, 
> and that this is a good optimization for them.

If it's done on new pages chances are that gup-fast fast-path can't
run in the first place, modulo glibc memalign re-using previously
freed areas. Overall I think it's worthwhile optimization, to avoid
the locked op in the rewrite case that I think it's common enough.

But I totally agree that it'd be good to benchmark gup-fast on already
instantiated ptes where SetPageGUP will run. I thought it'd be like
below measurement error and not measurable but good to check it.

> The advantage of it is that it fixes the problem not just in one place, 
> but "forever". No hacks about exactly how you access the mappings etc.
> 
> Of course, nothing _really_ solves things. If you do some delayed IO after 
> having looked up the mapping and turned it into a physical page, and the 
> original allocator actually unmaps it (or exits), then the same issue can 
> still happen (well, not the _same_ one - but the very similar issue of the 
> child seeing changes even though the IO was started in the parent). 
> 
> This is why I think any "look up by physical" is fundamentally flawed. It 
> very basically becomes a "I have a secret local TLB that cannot be changed 
> or flushed". And any single-bit solution (GUP) is always going to be 
> fairly broken. 

One of the reasons of not sharing when PG_gup is set and page_count is
shown as pinned, is also to fix all sort of drivers that are doing gup
(Continue reading)

Linus Torvalds | 11 Mar 2009 21:33
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Linus Torvalds wrote:
> 
> Agreed. However, I really think this is a O_DIRECT problem. Just document 
> it. Tell people that O_DIRECT simply doesn't work with COW, and 
> fundamentally can never work well.
> 
> If you use O_DIRECT with threading, you had better know what the hell 
> you're doing anyway. I do not think that the kernel should do stupid 
> things just because stupid users don't understand the semantics of the 
> _non-stupid_ thing (which is to just let people think about COW for five 
> seconds).

Btw, if we don't do that, then there are better alternatives. One is:

 - fork already always takes the write lock on mmap_sem (and f*ck no, I 
   doubt anybody will ever care one whit how "parallel" you can do forks 
   from threads, so I don't think this is an issue)

 - Just make the rule be that people who use get_user_pages() always 
   have to have the read-lock on mmap_sem until they've used the pages.

We already take the read-lock for the lookup (well, not for the gup, but 
for all the slow cases), but I'm saying that we could go one step further 
- just read-lock over the _whole_ O_DIRECT read or write. That way you 
literally protect against concurrent fork()s.

		Linus

--
(Continue reading)

Benjamin Herrenschmidt | 14 Mar 2009 06:07

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, 2009-03-11 at 13:33 -0700, Linus Torvalds wrote:
>  - Just make the rule be that people who use get_user_pages() always 
>    have to have the read-lock on mmap_sem until they've used the
> pages.
> 

That's not going to work with IB and friends who gup() whole bunches of
user memory forever...

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Andrea Arcangeli | 11 Mar 2009 21:55
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, Mar 11, 2009 at 01:33:17PM -0700, Linus Torvalds wrote:
> Btw, if we don't do that, then there are better alternatives. One is:
> 
>  - fork already always takes the write lock on mmap_sem (and f*ck no, I 
>    doubt anybody will ever care one whit how "parallel" you can do forks 
>    from threads, so I don't think this is an issue)
> 
>  - Just make the rule be that people who use get_user_pages() always 
>    have to have the read-lock on mmap_sem until they've used the pages.

How do you handle pages where gup already returned and I/O still in
flight? Forcing gup-fast to be called with mmap_sem already hold (like
gup used to require) only avoids the need of changes in gup-fast
AFAICT. You'll still get pages that are pinned and calling gup-fast
under mmap_sem (no matter if read or even write mode) won't make a
difference, still those pages will be pinned while fork runs and with
dma going to them (by O_DIRECT or some driver using gup, as long as
PageReserved isn't set on them).

> We already take the read-lock for the lookup (well, not for the gup, but 
> for all the slow cases), but I'm saying that we could go one step further 
> - just read-lock over the _whole_ O_DIRECT read or write. That way you 
> literally protect against concurrent fork()s.

Releasing the mmap_sem read mode in the irq-completion handler context
should be possible, however fork will end up throttled blocking for
I/O which isn't very nice behavior. BTW, direct-io.c is a total mess,
I couldn't even figure out where to release those locks in the I/O
completion handlers when I tried something like this with PG_lock
instead of the mmap_sem...  Eventually I gave it up because this isn't
(Continue reading)

Linus Torvalds | 11 Mar 2009 22:28
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Andrea Arcangeli wrote:

> On Wed, Mar 11, 2009 at 01:33:17PM -0700, Linus Torvalds wrote:
> > Btw, if we don't do that, then there are better alternatives. One is:
> > 
> >  - fork already always takes the write lock on mmap_sem (and f*ck no, I 
> >    doubt anybody will ever care one whit how "parallel" you can do forks 
> >    from threads, so I don't think this is an issue)
> > 
> >  - Just make the rule be that people who use get_user_pages() always 
> >    have to have the read-lock on mmap_sem until they've used the pages.
> 
> How do you handle pages where gup already returned and I/O still in
> flight?

The rule is:
 - either keep the mmap_sem for reading until the IO is done
 - admit the fact that IO is asynchronous, and has visible async behavior.

> Forcing gup-fast to be called with mmap_sem already hold (like
> gup used to require) only avoids the need of changes in gup-fast
> AFAICT. You'll still get pages that are pinned and calling gup-fast
> under mmap_sem (no matter if read or even write mode) won't make a
> difference, still those pages will be pinned while fork runs and with
> dma going to them (by O_DIRECT or some driver using gup, as long as
> PageReserved isn't set on them).

The point I'm trying to make is that anybody who thinks that pages are 
stable over various behavior that runs in another thread - be it a fork, a 
(Continue reading)

Andrea Arcangeli | 11 Mar 2009 22:57
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, Mar 11, 2009 at 02:28:08PM -0700, Linus Torvalds wrote:
> The fact that the non-fast "get_user_pages()" takes the mmap semaphore for 
> reading doesn't even protect that. It just means that the pages made sense 
> at the time the get_user_pages() happened, not necessarily at the time 
> when the actual use of them did. 

Indeed this is a generic problem, not specific to
get_user_pages_fast. get_user_pages_fast just adds a few complications
to serialize against.

> O_DIRECT is actually the _simple_ case, since we won't be returning until 
> it is done (ie it's not actually a async interface). So no, O_DIRECT 
> doesn't need any interrupt handler games. It would just need to hold the 
> sem over the actual call to the filesystem (ie just over the ->direct_IO() 
> call).

I don't see how you can solve the race by only holding the sem only
over the direct_IO call (and not until the I/O completion handler
fires). I think to solve the race using mmap_sem only, the bio I/O
completion handler that eventually calls into direct-io.c from irq
context would need to up_read(&mmap_sem).

The way my patch avoids to alter the I/O completion path running from
irq context is by ensuring no I/O is going on at all to the pages that
are being shared with the child, and by ensuring that any gup or
gup-fast will trigger cow before it can write to the shared
page. Pages simply can't be shared before I/O is complete.

> People want the relaxed synchronization we give them, and that's literally 
> why get_user_pages_fast exists - because people don't want _more_ 
(Continue reading)

Linus Torvalds | 11 Mar 2009 23:06
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Andrea Arcangeli wrote:
> 
> > People want the relaxed synchronization we give them, and that's literally 
> > why get_user_pages_fast exists - because people don't want _more_ 
> > synchronization, they want _less_.
> > 
> > But the thing is, with less synchronization, the behavior really is 
> > surprising in the edge cases. Which is why I think "threaded fork" plus 
> > "get_user_pages_fast" just doesn't make sense to even _worry_ about. If 
> > you use O_DIRECT and mix it with fork, you get what you get, and it's 
> > random - exactly because people who want O_DIRECT don't want any locking. 
> > 
> > It's a user-space issue, not a kernel issue.
> 
> I think your point of view is clear, I sure can write userland code
> that copes it the currently altered memory protection semantics of
> read vs fork if fd is opened with O_DIRECT or drivers using gup, so
> I'll let the userland folks comment on it, some are in CC.

Btw, we could make it easier for people to not screw up.

In particular, "fork()" in a threaded program is almost always wrong. If 
you want to exec another program from a threaded one, you should either 
just do execve() (which kills all threads) or you should do vfork+execve 
(which has none of the COW issues).

An we could add a warning for it. Something like "if this is a threaded 
program, and it has ever used get_user_pages(), and it does a fork(), warn 
about it once". Maybe people would realize what a stupid thing they are 
(Continue reading)

Davide Libenzi | 11 Mar 2009 23:22

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, 11 Mar 2009, Linus Torvalds wrote:

> In particular, "fork()" in a threaded program is almost always wrong. If 
> you want to exec another program from a threaded one, you should either 
> just do execve() (which kills all threads) or you should do vfork+execve 
> (which has none of the COW issues).

Didn't follow the lengthy thread, but if we make fork+exec to fail inside 
a threaded program, we might end up making a lot of people unhappy.

- Davide

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 11 Mar 2009 23:32
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Davide Libenzi wrote:
> 
> Didn't follow the lengthy thread, but if we make fork+exec to fail inside 
> a threaded program, we might end up making a lot of people unhappy.

Yeah, no, we don't want to fail it, but we could do a one-time warning or 
something, to at least see who does it and perhaps see if some of them 
might realize the problems.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 11 Mar 2009 23:07
Gravatar

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]


On Wed, 11 Mar 2009, Linus Torvalds wrote:
> 
> An we could add a warning for it. Something like "if this is a threaded 
> program, and it has ever used get_user_pages(), and it does a fork(), warn 
> about it once". Maybe people would realize what a stupid thing they are 
> doing, and that there is a simple fix (vfork).

Ehh. vfork is only simple if you literally are going to execve. If you are 
using a fork as some kind of odd way to snapshot, I don't know what you 
should do. You can't sanely snapshot a threaded app with fork, but I bet 
some people try.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Andrea Arcangeli | 11 Mar 2009 20:06
Picon
Favicon

Re: [aarcange <at> redhat.com: [PATCH] fork vs gup(-fast) fix]

On Wed, Mar 11, 2009 at 11:46:17AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 11 Mar 2009, Andrea Arcangeli wrote:
> 
> > On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote:
> > > As far as I can tell, it's the same old problem that we've always had: if 
> > > you fork(), it's unclear who is going to do the first write - parent or 
> > > child (and "parent" in this case can include any number of threads that 
> > > share the VM, of course).
> > 
> > The child doesn't touch any page. Calling fork just generates O_DIRECT
> > corruption in the parent regardless of what the child does.
> 
> You aren't listening.
> 
> It depends on who does the write. If the _parent_ does the write (with 
> another thread or not), then the _parent_ gets the COW.
> 
> That's all I said.

I only wanted to clarify this doesn't require the child to touch the
page at all.

> If the idiots who use O_DIRECT don't understand that, then hey, it's their 
> problem. I have long been of the opinion that we should not support 
> O_DIRECT at all, and that it's a totally broken premise to start with. 

Well if you don't like it used by databases, O_DIRECT is still ideal for
KVM. Guest caches runs at cpu core speed unlike host cache. Not that
(Continue reading)


Gmane