Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 00/20] Unify TLB gather implementations -v3

Its been a while since I last send this out, but here goes..

There's no arch left over, I finally got s390 converted too.
The series is compile tested on:

 arm, powerpc64, sparc64, sparc32, s390x, arm, ia64, xtensa

I lack a working toolchain for: sh, avr32
Simply wouldn't build:          mips, parisc 

---
 arch/Kconfig                         |   16 ++
 arch/alpha/include/asm/tlb.h         |    2 -
 arch/arm/Kconfig                     |    1 +
 arch/arm/include/asm/tlb.h           |  183 ++--------------------
 arch/avr32/Kconfig                   |    1 +
 arch/avr32/include/asm/tlb.h         |   11 --
 arch/blackfin/include/asm/tlb.h      |    6 -
 arch/c6x/include/asm/tlb.h           |    2 -
 arch/cris/include/asm/tlb.h          |    1 -
 arch/frv/include/asm/tlb.h           |    5 -
 arch/h8300/include/asm/tlb.h         |   13 --
 arch/hexagon/include/asm/tlb.h       |    5 -
 arch/ia64/Kconfig                    |    1 +
 arch/ia64/include/asm/tlb.h          |  233 +---------------------------
 arch/ia64/include/asm/tlbflush.h     |   25 +++
 arch/ia64/mm/tlb.c                   |   24 +++-
 arch/m32r/include/asm/tlb.h          |    6 -
 arch/m68k/include/asm/tlb.h          |    6 -
 arch/microblaze/include/asm/tlb.h    |    2 -
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 08/20] mm: Optimize fullmm TLB flushing

This originated from s390 which does something similar and would allow
s390 to use the generic TLB flushing code.

The idea is to flush the mm wide cache and tlb a priory and not bother
with multiple flushes if the batching isn't large enough.

This can be safely done since there cannot be any concurrency on this
mm, its either after the process died (exit) or in the middle of
execve where the thread switched to the new mm.

Cc: Martin Schwidefsky <schwidefsky <at> de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 mm/memory.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)
--- a/mm/memory.c
+++ b/mm/memory.c
 <at>  <at>  -215,16 +215,22  <at>  <at>  void tlb_gather_mmu(struct mmu_gather *t
 	tlb->active     = &tlb->local;

 	tlb_table_init(tlb);
+
+	if (fullmm) {
+		flush_cache_mm(mm);
+		flush_tlb_mm(mm);
+	}
 }

 void tlb_flush_mmu(struct mmu_gather *tlb)
 {
(Continue reading)

Linus Torvalds | 28 Jun 2012 00:26
Gravatar

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Wed, Jun 27, 2012 at 2:15 PM, Peter Zijlstra <a.p.zijlstra <at> chello.nl> wrote:
> This originated from s390 which does something similar and would allow
> s390 to use the generic TLB flushing code.
>
> The idea is to flush the mm wide cache and tlb a priory and not bother
> with multiple flushes if the batching isn't large enough.
>
> This can be safely done since there cannot be any concurrency on this
> mm, its either after the process died (exit) or in the middle of
> execve where the thread switched to the new mm.

I think we actually *used* to do the final TLB flush from within the
context of the process that died. That doesn't seem to ever be the
case any more, but it does worry me a bit. Maybe a

   VM_BUG_ON(current->active_mm == mm);

or something for the fullmm case?

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Peter Zijlstra | 28 Jun 2012 01:02
Picon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Wed, 2012-06-27 at 15:26 -0700, Linus Torvalds wrote:
> On Wed, Jun 27, 2012 at 2:15 PM, Peter Zijlstra <a.p.zijlstra <at> chello.nl> wrote:
> > This originated from s390 which does something similar and would allow
> > s390 to use the generic TLB flushing code.
> >
> > The idea is to flush the mm wide cache and tlb a priory and not bother
> > with multiple flushes if the batching isn't large enough.
> >
> > This can be safely done since there cannot be any concurrency on this
> > mm, its either after the process died (exit) or in the middle of
> > execve where the thread switched to the new mm.
> 
> I think we actually *used* to do the final TLB flush from within the
> context of the process that died. That doesn't seem to ever be the
> case any more, but it does worry me a bit. Maybe a
> 
>    VM_BUG_ON(current->active_mm == mm);
> 
> or something for the fullmm case?

OK, added it and am rebooting the test box..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Peter Zijlstra | 28 Jun 2012 01:13
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, 2012-06-28 at 01:02 +0200, Peter Zijlstra wrote:
> On Wed, 2012-06-27 at 15:26 -0700, Linus Torvalds wrote:
> > On Wed, Jun 27, 2012 at 2:15 PM, Peter Zijlstra <a.p.zijlstra <at> chello.nl> wrote:
> > > This originated from s390 which does something similar and would allow
> > > s390 to use the generic TLB flushing code.
> > >
> > > The idea is to flush the mm wide cache and tlb a priory and not bother
> > > with multiple flushes if the batching isn't large enough.
> > >
> > > This can be safely done since there cannot be any concurrency on this
> > > mm, its either after the process died (exit) or in the middle of
> > > execve where the thread switched to the new mm.
> > 
> > I think we actually *used* to do the final TLB flush from within the
> > context of the process that died. That doesn't seem to ever be the
> > case any more, but it does worry me a bit. Maybe a
> > 
> >    VM_BUG_ON(current->active_mm == mm);
> > 
> > or something for the fullmm case?
> 
> OK, added it and am rebooting the test box..

That triggered.. is this a problem though, at this point userspace is
very dead so it shouldn't matter, right?

Will have to properly think about it tomorrow, its been 1am, brain is
mostly sleeping already.

------------[ cut here ]------------
(Continue reading)

Linus Torvalds | 28 Jun 2012 01:23
Gravatar

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Wed, Jun 27, 2012 at 4:13 PM, Peter Zijlstra <peterz <at> infradead.org> wrote:
>
> That triggered.. is this a problem though, at this point userspace is
> very dead so it shouldn't matter, right?

It still matters. Even if user space is dead, kernel space accesses
can result in TLB fills in user space. Exactly because of things like
speculative fills etc.

So what can happen - for example - is that the kernel does a indirect
jump, and the CPU predicts the destination of the jump that using the
branch prediction tables.

But the branch prediction tables are obviously just predictions, and
they easily contain user addresses etc in them. So the kernel may well
end up speculatively doing a TLB fill on a user access.

And your whole optimization depends on this not happening, unless I
read the logic wrong. The whole "invalidate the TLB just once
up-front" approach is *only* valid if you know that nothing is going
to ever fill that TLB again. But see above - if we're still running
within that TLB context, we have no idea what speculative execution
may or may not end up filling.

That said, maybe I misread your patch?

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
(Continue reading)

Linus Torvalds | 28 Jun 2012 01:33
Gravatar

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Wed, Jun 27, 2012 at 4:23 PM, Linus Torvalds
<torvalds <at> linux-foundation.org> wrote:
>
> But the branch prediction tables are obviously just predictions, and
> they easily contain user addresses etc in them. So the kernel may well
> end up speculatively doing a TLB fill on a user access.

That should be ".. on a user *address*", hopefully that was clear from
the context, if not from the text.

IOW, the point I'm trying to make is that even if there are zero
*actual* accesses of user space (because user space is dead, and the
kernel hopefully does no "get_user()/put_user()" stuff at this point
any more), the CPU may speculatively use user addresses for the
bog-standard kernel addresses that happen.

Taking a user address from the BTB is just one example. Speculative
memory accesses might happen after a mis-predicted branch, where we
test a pointer against NULL, and after the branch we access it. So
doing a speculative TLB walk of the NULL address would not necessarily
even be unusual. Obviously normally nothing is actually mapped there,
but these kinds of things can *easily* result in the page tables
themselves being cached, even if the final page doesn't exist.

Also, all of this obviously depends on how aggressive the speculation
is. It's entirely possible that effects like these are really hard to
see in practice, and you'll almost never hit it. But stale TLB
contents (or stale page directory caches) are *really* nasty when they
do happen, and almost impossible to debug. So we want to be insanely
anal in this area.
(Continue reading)

Catalin Marinas | 28 Jun 2012 11:16
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, Jun 28, 2012 at 12:33:44AM +0100, Linus Torvalds wrote:
> On Wed, Jun 27, 2012 at 4:23 PM, Linus Torvalds
> <torvalds <at> linux-foundation.org> wrote:
> > But the branch prediction tables are obviously just predictions, and
> > they easily contain user addresses etc in them. So the kernel may well
> > end up speculatively doing a TLB fill on a user access.
> 
> That should be ".. on a user *address*", hopefully that was clear from
> the context, if not from the text.
> 
> IOW, the point I'm trying to make is that even if there are zero
> *actual* accesses of user space (because user space is dead, and the
> kernel hopefully does no "get_user()/put_user()" stuff at this point
> any more), the CPU may speculatively use user addresses for the
> bog-standard kernel addresses that happen.

That's definitely an issue on ARM and it was hit on older kernels.
Basically ARM processors can cache any page translation level in the
TLB. We need to make sure that no page entry at any level (either cached
in the TLB or not) points to an invalid next level table (hence the TLB
shootdown). For example, in cases like free_pgd_range(), if the cached
pgd entry points to an already freed pud/pmd table (pgd_clear is not
enough) it may walk the page tables speculatively cache another entry in
the TLB. Depending on the random data it reads from an old table page,
it may find a global entry (it's just a bit in the pte) which is not
tagged with an ASID (application specific id). A latter flush_tlb_mm()
only flushes the current ASID and doesn't touch global entries (used
only by kernel mappings). So we end up with global TLB entry in user
space that overrides any other application mapping.

(Continue reading)

Benjamin Herrenschmidt | 28 Jun 2012 12:39

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, 2012-06-28 at 10:16 +0100, Catalin Marinas wrote:
> That's definitely an issue on ARM and it was hit on older kernels.
> Basically ARM processors can cache any page translation level in the
> TLB. We need to make sure that no page entry at any level (either cached
> in the TLB or not) points to an invalid next level table (hence the TLB
> shootdown). For example, in cases like free_pgd_range(), if the cached
> pgd entry points to an already freed pud/pmd table (pgd_clear is not
> enough) it may walk the page tables speculatively cache another entry in
> the TLB. Depending on the random data it reads from an old table page,
> it may find a global entry (it's just a bit in the pte) which is not
> tagged with an ASID (application specific id). A latter flush_tlb_mm()
> only flushes the current ASID and doesn't touch global entries (used
> only by kernel mappings). So we end up with global TLB entry in user
> space that overrides any other application mapping.

Right, that's the typical scenario. I haven't looked at your flush
implementation though, but surely you can defer the actual freeing so
you can batch them & limit the number of TLB flushes right ?

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Peter Zijlstra | 28 Jun 2012 12:59
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, 2012-06-28 at 20:39 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2012-06-28 at 10:16 +0100, Catalin Marinas wrote:
> > That's definitely an issue on ARM and it was hit on older kernels.
> > Basically ARM processors can cache any page translation level in the
> > TLB. We need to make sure that no page entry at any level (either cached
> > in the TLB or not) points to an invalid next level table (hence the TLB
> > shootdown). For example, in cases like free_pgd_range(), if the cached
> > pgd entry points to an already freed pud/pmd table (pgd_clear is not
> > enough) it may walk the page tables speculatively cache another entry in
> > the TLB. Depending on the random data it reads from an old table page,
> > it may find a global entry (it's just a bit in the pte) which is not
> > tagged with an ASID (application specific id). A latter flush_tlb_mm()
> > only flushes the current ASID and doesn't touch global entries (used
> > only by kernel mappings). So we end up with global TLB entry in user
> > space that overrides any other application mapping.
> 
> Right, that's the typical scenario. I haven't looked at your flush
> implementation though, but surely you can defer the actual freeing so
> you can batch them & limit the number of TLB flushes right ?

Yes they do.. its just the up-front TLB invalidate for fullmm that's a
problem.

s390 really wants this so it can avoid the per pte invalidate otherwise
required by ptep_get_and_clear_full().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
(Continue reading)

Catalin Marinas | 28 Jun 2012 16:53
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, Jun 28, 2012 at 11:59:56AM +0100, Peter Zijlstra wrote:
> On Thu, 2012-06-28 at 20:39 +1000, Benjamin Herrenschmidt wrote:
> > On Thu, 2012-06-28 at 10:16 +0100, Catalin Marinas wrote:
> > > That's definitely an issue on ARM and it was hit on older kernels.
> > > Basically ARM processors can cache any page translation level in the
> > > TLB. We need to make sure that no page entry at any level (either cached
> > > in the TLB or not) points to an invalid next level table (hence the TLB
> > > shootdown). For example, in cases like free_pgd_range(), if the cached
> > > pgd entry points to an already freed pud/pmd table (pgd_clear is not
> > > enough) it may walk the page tables speculatively cache another entry in
> > > the TLB. Depending on the random data it reads from an old table page,
> > > it may find a global entry (it's just a bit in the pte) which is not
> > > tagged with an ASID (application specific id). A latter flush_tlb_mm()
> > > only flushes the current ASID and doesn't touch global entries (used
> > > only by kernel mappings). So we end up with global TLB entry in user
> > > space that overrides any other application mapping.
> > 
> > Right, that's the typical scenario. I haven't looked at your flush
> > implementation though, but surely you can defer the actual freeing so
> > you can batch them & limit the number of TLB flushes right ?
> 
> Yes they do.. its just the up-front TLB invalidate for fullmm that's a
> problem.

The upfront invalidate is fine (i.e. harmless), it's the tlb_flush_mmu()
change to check for !tlb->fullmm that's not helpful on ARM.

--

-- 
Catalin

(Continue reading)

Peter Zijlstra | 28 Jun 2012 18:20
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, 2012-06-28 at 15:53 +0100, Catalin Marinas wrote:

> > Yes they do.. its just the up-front TLB invalidate for fullmm that's a
> > problem.
> 
> The upfront invalidate is fine (i.e. harmless), it's the tlb_flush_mmu()
> change to check for !tlb->fullmm that's not helpful on ARM.

I think we're saying the same but differently. The point is that the
flush up front isn't sufficient for most of us.

Also, we'd very much want to avoid superfluous flushes since they are
somewhat expensive.

How horrid is something like the below. It detaches the mm so that
hardware speculation simply doesn't matter.

Now the switch_mm should imply the same cache+TBL flush we'd otherwise
do, and I'd think that that would be the majority of the cost. Am I
wrong there?

Also, the below seems to leak mm_structs so I did mess up the
ref-counting, its too bloody hot here.

---
 mm/memory.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 47 insertions(+), 4 deletions(-)
--- a/mm/memory.c
+++ b/mm/memory.c
 <at>  <at>  -65,6 +65,7  <at>  <at> 
(Continue reading)

Peter Zijlstra | 28 Jun 2012 18:38
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, 2012-06-28 at 18:20 +0200, Peter Zijlstra wrote:
> Now the switch_mm should imply the same cache+TBL flush we'd otherwise
> do, and I'd think that that would be the majority of the cost. Am I
> wrong there? 

The advantage of doing this is that you don't need any of the batching
and possibly multiple invalidate nonsense you otherwise need. So it
might still be an over-all win, even if the switch is slightly more
expensive than a regular flush. Simply because you can avoid most (if
not all) the usual complexities.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Linus Torvalds | 28 Jun 2012 18:45
Gravatar

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, Jun 28, 2012 at 9:20 AM, Peter Zijlstra <peterz <at> infradead.org> wrote:
>
> How horrid is something like the below. It detaches the mm so that
> hardware speculation simply doesn't matter.

Actually, that's wrong. Even when detached, kernel threads may still
use that mm lazily. Now, that only happens on other CPU's (if any
scheduling happens on *this* CPU, they will lazily take the mm of the
thread it scheduled away from), but even if you detach the VM that
doesn't mean that hardware speculation wouldn't matter. Kernel threads
on other CPU's may still be doing TLB accesses.

Of course, I *think* that if we do an IPI on the thing, we also kick
those kernel threads out of using that mm. So it may actually work if
you also do that explicit TLB flush to make sure other CPU's don't
have this MM. I don't think switch_mm() does that for you, it only
does a local-cpu invalidate.

I didn't look at the code, though. Maybe I'm wrong in thinking that
you are wrong.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

Peter Zijlstra | 28 Jun 2012 12:55
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Wed, 2012-06-27 at 16:33 -0700, Linus Torvalds wrote:
> IOW, the point I'm trying to make is that even if there are zero
> *actual* accesses of user space (because user space is dead, and the
> kernel hopefully does no "get_user()/put_user()" stuff at this point
> any more), the CPU may speculatively use user addresses for the
> bog-standard kernel addresses that happen. 

Right.. and s390 having done this only says that s390 appears to be ok
with it. Martin, does s390 hardware guarantee no speculative stuff like
Linus explained, or might there even be a latent issue on s390?

But it looks like we cannot do this in general, and esp. ARM (as already
noted by Catalin) has very aggressive speculative behaviour.

The alternative is that we do a switch_mm() to init_mm instead of the
TLB flush. On x86 that should be about the same cost, but I've not
looked at other architectures yet.

The second and least favourite alternative is of course special casing
this for s390 if it turns out its a safe thing to do for them.

/me goes look through arch code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

Martin Schwidefsky | 28 Jun 2012 13:19
Picon
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, 28 Jun 2012 12:55:04 +0200
Peter Zijlstra <peterz <at> infradead.org> wrote:

> On Wed, 2012-06-27 at 16:33 -0700, Linus Torvalds wrote:
> > IOW, the point I'm trying to make is that even if there are zero
> > *actual* accesses of user space (because user space is dead, and the
> > kernel hopefully does no "get_user()/put_user()" stuff at this point
> > any more), the CPU may speculatively use user addresses for the
> > bog-standard kernel addresses that happen. 
> 
> Right.. and s390 having done this only says that s390 appears to be ok
> with it. Martin, does s390 hardware guarantee no speculative stuff like
> Linus explained, or might there even be a latent issue on s390?

The cpu can create speculative TLB entries, but only if it runs in the
mode that uses the respective mm. We have two mm's active at the same
time, the kernel mm (init_mm) and the user mm. While the cpu runs only
in kernel mode it is not allowed to create TLBs for the user mm.
While running in user mode it is allowed to speculatively create TLBs.

> But it looks like we cannot do this in general, and esp. ARM (as already
> noted by Catalin) has very aggressive speculative behaviour.
> 
> The alternative is that we do a switch_mm() to init_mm instead of the
> TLB flush. On x86 that should be about the same cost, but I've not
> looked at other architectures yet.
> 
> The second and least favourite alternative is of course special casing
> this for s390 if it turns out its a safe thing to do for them.
> 
(Continue reading)

Peter Zijlstra | 28 Jun 2012 13:30
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On Thu, 2012-06-28 at 13:19 +0200, Martin Schwidefsky wrote:

> The cpu can create speculative TLB entries, but only if it runs in the
> mode that uses the respective mm. We have two mm's active at the same
> time, the kernel mm (init_mm) and the user mm. While the cpu runs only
> in kernel mode it is not allowed to create TLBs for the user mm.
> While running in user mode it is allowed to speculatively create TLBs.

OK, that's neat.

> Basically we have two special requirements on s390:
> 1) do not modify ptes while attached to another cpu except with the
>    special IPTE / IDTE instructions

Right, and your fullmm case works by doing a global invalidate after all
threads have ceased userspace execution, this allows you to do away with
the IPTE/IDTE instructions since there's no other active cpus on the
userspace mm anymore.

> 2) do a TLB flush before freeing any kind of page table page, s390
>    needs a flush for pud, pmd & pte tables. 

Right, we do that (now)..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

Avi Kivity | 28 Jun 2012 18:00
Picon
Favicon

Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing

On 06/28/2012 02:30 PM, Peter Zijlstra wrote:
> On Thu, 2012-06-28 at 13:19 +0200, Martin Schwidefsky wrote:
> 
>> The cpu can create speculative TLB entries, but only if it runs in the
>> mode that uses the respective mm. We have two mm's active at the same
>> time, the kernel mm (init_mm) and the user mm. While the cpu runs only
>> in kernel mode it is not allowed to create TLBs for the user mm.
>> While running in user mode it is allowed to speculatively create TLBs.
> 
> OK, that's neat.

Note that we can do that for x86 now using the new PCID feature.
Basically you get a tagged TLB, so you can switch between the
kernel-only address space and the kernel+user address space quickly.

It's still going to be slower than what we do now, but it might please
some security people if the kernel can't accidentally access user data.

--

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 04/20] mm, s390: use generic RCU page-table freeing code

Now that we fixed the problem that caused the revert cd94154cc6a
("[S390] fix tlb flushing for page table pages") of the original
36409f6353fc2 ("[S390] use generic RCU page-table freeing code"), we
can revert the revert.

Original-patch-by: Martin Schwidefsky <schwidefsky <at> de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/s390/Kconfig               |    1 
 arch/s390/include/asm/pgalloc.h |    3 +
 arch/s390/include/asm/tlb.h     |   22 +++++++++++++
 arch/s390/mm/pgtable.c          |   63 +---------------------------------------
 4 files changed, 28 insertions(+), 61 deletions(-)

--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
 <at>  <at>  -84,6 +84,7  <at>  <at>  config S390
 	select HAVE_KERNEL_XZ
 	select HAVE_ARCH_MUTEX_CPU_RELAX
 	select HAVE_ARCH_JUMP_LABEL if !MARCH_G5
+	select HAVE_RCU_TABLE_FREE if SMP
 	select ARCH_SAVE_PAGE_KEYS if HIBERNATION
 	select HAVE_MEMBLOCK
 	select HAVE_MEMBLOCK_NODE_MAP
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
 <at>  <at>  -22,7 +22,10  <at>  <at>  void crst_table_free(struct mm_struct *,

 unsigned long *page_table_alloc(struct mm_struct *, unsigned long);
 void page_table_free(struct mm_struct *, unsigned long *);
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 06/20] mm, sparc64: Dont use tlb_flush for external tlb flushes

Both sparc64 and powerpc64 use tlb_flush() to flush their respective
hash-tables which is entirely different from what
flush_tlb_range()/flush_tlb_mm() would do.

Powerpc64 already uses arch_*_lazy_mmu_mode() to batch and flush these
so any tlb_flush() caller should already find an empty batch, make
sparc64 do the same.

This ensures all platforms now have a tlb_flush() implementation that
is either flush_tlb_mm() or flush_tlb_range().

Cc: David Miller <davem <at> davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/sparc/Makefile                  |    1 +
 arch/sparc/include/asm/tlb_64.h      |    2 +-
 arch/sparc/include/asm/tlbflush_64.h |   11 +++++++++++
 3 files changed, 13 insertions(+), 1 deletion(-)
--- a/arch/sparc/Makefile
+++ b/arch/sparc/Makefile
 <at>  <at>  -37,6 +37,7  <at>  <at>  LDFLAGS       := -m elf64_sparc
 export BITS   := 64
 UTS_MACHINE   := sparc64

+KBUILD_CPPFLAGS += -D__HAVE_ARCH_ENTER_LAZY_MMU_MODE
 KBUILD_CFLAGS += -m64 -pipe -mno-fpu -mcpu=ultrasparc -mcmodel=medlow
 KBUILD_CFLAGS += -ffixed-g4 -ffixed-g5 -fcall-used-g7 -Wno-sign-compare
 KBUILD_CFLAGS += -Wa,--undeclared-regs
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 01/20] mm, x86: Add HAVE_RCU_TABLE_FREE support

Implements optional HAVE_RCU_TABLE_FREE support for x86.

This is useful for things like Xen and KVM where paravirt tlb flush
means the software page table walkers like GUP-fast cannot rely on
IRQs disabling like regular x86 can.

Cc: Nikunj A Dadhania <nikunj <at> linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy <at> goop.org>
Cc: Avi Kivity <avi <at> redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/x86/include/asm/tlb.h |    1 +
 arch/x86/mm/pgtable.c      |    6 +++---
 include/asm-generic/tlb.h  |    9 +++++++++
 3 files changed, 13 insertions(+), 3 deletions(-)
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
 <at>  <at>  -1,6 +1,7  <at>  <at> 
 #ifndef _ASM_X86_TLB_H
 #define _ASM_X86_TLB_H

+#define __tlb_remove_table(table) free_page_and_swap_cache(table)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
 <at>  <at>  -51,21 +51,21  <at>  <at>  void ___pte_free_tlb(struct mmu_gather *
 {
 	pgtable_page_dtor(pte);
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:16
Picon

[PATCH 20/20] mm, xtensa: Convert xtensa to generic tlb

Cc: Chris Zankel <chris <at> zankel.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/xtensa/Kconfig           |    1 +
 arch/xtensa/include/asm/tlb.h |   23 -----------------------
 arch/xtensa/mm/tlb.c          |    2 +-
 3 files changed, 2 insertions(+), 24 deletions(-)
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
 <at>  <at>  -10,6 +10,7  <at>  <at>  config XTENSA
 	select HAVE_GENERIC_HARDIRQS
 	select GENERIC_IRQ_SHOW
 	select GENERIC_CPU_DEVICES
+	select HAVE_MMU_GATHER_RANGE
 	help
 	  Xtensa processors are 32-bit RISC machines designed by Tensilica
 	  primarily for embedded systems.  These processors are both
--- a/arch/xtensa/include/asm/tlb.h
+++ b/arch/xtensa/include/asm/tlb.h
 <at>  <at>  -14,29 +14,6  <at>  <at> 
 #include <asm/cache.h>
 #include <asm/page.h>

-#if (DCACHE_WAY_SIZE <= PAGE_SIZE)
-
-/* Note, read http://lkml.org/lkml/2004/1/15/6 */
-
-# define tlb_start_vma(tlb,vma)			do { } while (0)
-# define tlb_end_vma(tlb,vma)			do { } while (0)
-
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 13/20] mm, ia64: Convert ia64 to generic tlb

Cc: Tony Luck <tony.luck <at> intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/ia64/Kconfig                |    1 
 arch/ia64/include/asm/tlb.h      |  233 ---------------------------------------
 arch/ia64/include/asm/tlbflush.h |   25 ++++
 arch/ia64/mm/tlb.c               |   24 +++-
 4 files changed, 49 insertions(+), 234 deletions(-)
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
 <at>  <at>  -28,6 +28,7  <at>  <at>  config IA64
 	select ARCH_DISCARD_MEMBLOCK
 	select GENERIC_IRQ_PROBE
 	select GENERIC_PENDING_IRQ if SMP
+	select HAVE_MMU_GATHER_RANGE
 	select IRQ_PER_CPU
 	select GENERIC_IRQ_SHOW
 	select ARCH_WANT_OPTIONAL_GPIOLIB
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
 <at>  <at>  -46,238 +46,9  <at>  <at> 
 #include <asm/tlbflush.h>
 #include <asm/machvec.h>

-#ifdef CONFIG_SMP
-# define tlb_fast_mode(tlb)	((tlb)->nr == ~0U)
-#else
-# define tlb_fast_mode(tlb)	(1)
-#endif
-
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 16/20] mm, avr32: Convert avr32 to generic tlb

Cc: Hans-Christian Egtvedt <hans-christian.egtvedt <at> atmel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/avr32/Kconfig           |    1 +
 arch/avr32/include/asm/tlb.h |    6 ------
 2 files changed, 1 insertion(+), 6 deletions(-)
--- a/arch/avr32/Kconfig
+++ b/arch/avr32/Kconfig
 <at>  <at>  -14,6 +14,7  <at>  <at>  config AVR32
 	select ARCH_HAVE_CUSTOM_GPIO_H
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select GENERIC_CLOCKEVENTS
+	select HAVE_MMU_GATHER_RANGE
 	help
 	  AVR32 is a high-performance 32-bit RISC microprocessor core,
 	  designed for cost-sensitive embedded applications, with particular
--- a/arch/avr32/include/asm/tlb.h
+++ b/arch/avr32/include/asm/tlb.h
 <at>  <at>  -8,12 +8,6  <at>  <at> 
 #ifndef __ASM_AVR32_TLB_H
 #define __ASM_AVR32_TLB_H

-#define tlb_start_vma(tlb, vma) \
-	flush_cache_range(vma, vma->vm_start, vma->vm_end)
-
-#define tlb_end_vma(tlb, vma) \
-	flush_tlb_range(vma, vma->vm_start, vma->vm_end)
-
 #define __tlb_remove_tlb_entry(tlb, pte, address) do { } while(0)

(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 17/20] mm, mips: Convert mips to generic tlb

Cc: Ralf Baechle <ralf <at> linux-mips.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/mips/Kconfig           |    1 +
 arch/mips/include/asm/tlb.h |   10 ----------
 2 files changed, 1 insertion(+), 10 deletions(-)
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
 <at>  <at>  -34,6 +34,7  <at>  <at>  config MIPS
 	select BUILDTIME_EXTABLE_SORT
 	select GENERIC_CLOCKEVENTS
 	select GENERIC_CMOS_UPDATE
+	select HAVE_MMU_GATHER_RANGE

 menu "Machine selection"

--- a/arch/mips/include/asm/tlb.h
+++ b/arch/mips/include/asm/tlb.h
 <at>  <at>  -1,16 +1,6  <at>  <at> 
 #ifndef __ASM_TLB_H
 #define __ASM_TLB_H

-/*
- * MIPS doesn't need any special per-pte or per-vma handling, except
- * we need to flush cache for area to be unmapped.
- */
-#define tlb_start_vma(tlb, vma) 				\
-	do {							\
-		if (!tlb->fullmm)				\
-			flush_cache_range(vma, vma->vm_start, vma->vm_end); \
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 03/20] mm, tlb: Remove a few #ifdefs


Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 include/asm-generic/tlb.h |   85 ++++++++++++++++++++++++++--------------------
 mm/memory.c               |    6 ---
 2 files changed, 50 insertions(+), 41 deletions(-)
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
 <at>  <at>  -21,6 +21,40  <at>  <at> 

 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);

+/*
+ * If we can't allocate a page to make a big batch of page pointers
+ * to work on, then just handle a few from the on-stack structure.
+ */
+#define MMU_GATHER_BUNDLE	8
+
+struct mmu_gather_batch {
+	struct mmu_gather_batch	*next;
+	unsigned int		nr;
+	unsigned int		max;
+	struct page		*pages[0];
+};
+
+#define MAX_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
+
+/* struct mmu_gather is an opaque type used by the mm code for passing around
+ * any data needed by arch specific code for tlb_remove_page.
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 07/20] mm, arch: Remove tlb_flush()

Since all asm-generic/tlb.h users their tlb_flush() implementation is
now either a nop or flush_tlb_mm(), remove it and make the generic
code use flush_tlb_mm() directly.

Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/alpha/include/asm/tlb.h      |    2 --
 arch/arm/include/asm/tlb.h        |    2 --
 arch/avr32/include/asm/tlb.h      |    5 -----
 arch/blackfin/include/asm/tlb.h   |    6 ------
 arch/c6x/include/asm/tlb.h        |    2 --
 arch/cris/include/asm/tlb.h       |    1 -
 arch/frv/include/asm/tlb.h        |    5 -----
 arch/h8300/include/asm/tlb.h      |   13 -------------
 arch/hexagon/include/asm/tlb.h    |    5 -----
 arch/m32r/include/asm/tlb.h       |    6 ------
 arch/m68k/include/asm/tlb.h       |    6 ------
 arch/microblaze/include/asm/tlb.h |    2 --
 arch/mips/include/asm/tlb.h       |    5 -----
 arch/mn10300/include/asm/tlb.h    |    5 -----
 arch/openrisc/include/asm/tlb.h   |    1 -
 arch/parisc/include/asm/tlb.h     |    5 -----
 arch/powerpc/include/asm/tlb.h    |    2 --
 arch/powerpc/mm/tlb_hash32.c      |   15 ---------------
 arch/powerpc/mm/tlb_hash64.c      |    4 ----
 arch/powerpc/mm/tlb_nohash.c      |    5 -----
 arch/score/include/asm/tlb.h      |    1 -
 arch/sh/include/asm/tlb.h         |    1 -
 arch/sparc/include/asm/tlb_32.h   |    5 -----
 arch/sparc/include/asm/tlb_64.h   |    1 -
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 18/20] mm, parisc: Convert parisc to generic tlb

Cc: Kyle McMartin <kyle <at> mcmartin.ca>
Cc: James Bottomley <jejb <at> parisc-linux.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/parisc/Kconfig           |    1 +
 arch/parisc/include/asm/tlb.h |   10 ----------
 2 files changed, 1 insertion(+), 10 deletions(-)
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
 <at>  <at>  -19,6 +19,7  <at>  <at>  config PARISC
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select GENERIC_SMP_IDLE_THREAD
 	select GENERIC_STRNCPY_FROM_USER
+	select HAVE_MMU_GATHER_RANGE

 	help
 	  The PA-RISC microprocessor is designed by Hewlett-Packard and used
--- a/arch/parisc/include/asm/tlb.h
+++ b/arch/parisc/include/asm/tlb.h
 <at>  <at>  -1,16 +1,6  <at>  <at> 
 #ifndef _PARISC_TLB_H
 #define _PARISC_TLB_H

-#define tlb_start_vma(tlb, vma) \
-do {	if (!(tlb)->fullmm)	\
-		flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-} while (0)
-
-#define tlb_end_vma(tlb, vma)	\
-do {	if (!(tlb)->fullmm)	\
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 19/20] mm, sparc32: Convert sparc32 to generic tlb

Cc: David Miller <davem <at> davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/sparc/Kconfig              |    1 +
 arch/sparc/include/asm/tlb_32.h |   10 ----------
 2 files changed, 1 insertion(+), 10 deletions(-)
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
 <at>  <at>  -41,6 +41,7  <at>  <at>  config SPARC32
 	def_bool !64BIT
 	select GENERIC_ATOMIC64
 	select CLZ_TAB
+	select HAVE_MMU_GATHER_RANGE

 config SPARC64
 	def_bool 64BIT
--- a/arch/sparc/include/asm/tlb_32.h
+++ b/arch/sparc/include/asm/tlb_32.h
 <at>  <at>  -1,16 +1,6  <at>  <at> 
 #ifndef _SPARC_TLB_H
 #define _SPARC_TLB_H

-#define tlb_start_vma(tlb, vma) \
-do {								\
-	flush_cache_range(vma, vma->vm_start, vma->vm_end);	\
-} while (0)
-
-#define tlb_end_vma(tlb, vma) \
-do {								\
-	flush_tlb_range(vma, vma->vm_start, vma->vm_end);	\
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 12/20] mm, arm: Convert arm to generic tlb

Might want to optimize the tlb_flush() function to do a full mm flush
when the range is 'large', IA64 does this too.

Cc: Russell King <rmk <at> arm.linux.org.uk>
Fixes-by: Catalin Marinas <catalin.marinas <at> arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/arm/Kconfig           |    1 
 arch/arm/include/asm/tlb.h |  181 +++------------------------------------------
 include/asm-generic/tlb.h  |    4 
 3 files changed, 19 insertions(+), 167 deletions(-)
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
 <at>  <at>  -45,6 +45,7  <at>  <at>  config ARM
 	select GENERIC_SMP_IDLE_THREAD
 	select KTIME_SCALAR
 	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
+	select HAVE_MMU_GATHER_RANGE if MMU
 	help
 	  The ARM series is a line of low-power-consumption RISC chip designs
 	  licensed by ARM Ltd and targeted at embedded applications and
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
 <at>  <at>  -27,183 +27,37  <at>  <at> 

 #else /* !CONFIG_MMU */

-#include <linux/swap.h>
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 15/20] mm, um: Convert um to generic tlb

Cc: Jeff Dike <jdike <at> addtoit.com>
Cc: Richard Weinberger <richard <at> nod.at>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/um/Kconfig.common    |    1 
 arch/um/include/asm/tlb.h |  111 +---------------------------------------------
 arch/um/kernel/tlb.c      |   13 -----
 3 files changed, 4 insertions(+), 121 deletions(-)
--- a/arch/um/Kconfig.common
+++ b/arch/um/Kconfig.common
 <at>  <at>  -11,6 +11,7  <at>  <at>  config UML
 	select GENERIC_CPU_DEVICES
 	select GENERIC_IO
 	select GENERIC_CLOCKEVENTS
+	select HAVE_MMU_GATHER_RANGE

 config MMU
 	bool
--- a/arch/um/include/asm/tlb.h
+++ b/arch/um/include/asm/tlb.h
 <at>  <at>  -7,114 +7,9  <at>  <at> 
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>

-#define tlb_start_vma(tlb, vma) do { } while (0)
-#define tlb_end_vma(tlb, vma) do { } while (0)
-#define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)
-
-/* struct mmu_gather is an opaque type used by the mm code for passing around
- * any data needed by arch specific code for tlb_remove_page.
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 05/20] mm, powerpc: Dont use tlb_flush for external tlb flushes

Both sparc64 and powerpc64 use tlb_flush() to flush their respective
hash-tables which is entirely different from what
flush_tlb_range()/flush_tlb_mm() would do.

Powerpc64 already uses arch_*_lazy_mmu_mode() to batch and flush these
so any tlb_flush() caller should already find an empty batch. So
remove this functionality from tlb_flush().

Cc: Benjamin Herrenschmidt <benh <at> kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/powerpc/mm/tlb_hash64.c |   10 ----------
 1 file changed, 10 deletions(-)
--- a/arch/powerpc/mm/tlb_hash64.c
+++ b/arch/powerpc/mm/tlb_hash64.c
 <at>  <at>  -155,16 +155,6  <at>  <at>  void __flush_tlb_pending(struct ppc64_tl

 void tlb_flush(struct mmu_gather *tlb)
 {
-	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
-
-	/* If there's a TLB batch pending, then we must flush it because the
-	 * pages are going to be freed and we really don't want to have a CPU
-	 * access a freed page because it has a stale TLB
-	 */
-	if (tlbbatch->index)
-		__flush_tlb_pending(tlbbatch);
-
-	put_cpu_var(ppc64_tlb_batch);
 }
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing

From: Nikunj A. Dadhania <nikunj <at> linux.vnet.ibm.com>

Certain architectures (viz. x86, arm, s390) have hardware page-table
walkers (#PF). So during the RCU page-table teardown process make sure
we do a tlb flush of page-table pages on all relevant CPUs to
synchronize against hardware walkers, and then free the pages.

Moreover, the (mm_users < 2) condition does not hold good for the above
architectures, as the hardware engine is one of the user.

This patch should also make the generic RCU page-table freeing code
suitable for s390 again since it fixes the issues raised in
cd94154cc6a ("[S390] fix tlb flushing for page table pages").

Cc: Martin Schwidefsky <schwidefsky <at> de.ibm.com>
Suggested-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj <at> linux.vnet.ibm.com>
[ Edited Kconfig bit ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/Kconfig |   13 +++++++++++++
 mm/memory.c  |   23 +++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)
--- a/arch/Kconfig
+++ b/arch/Kconfig
 <at>  <at>  -231,6 +231,19  <at>  <at>  config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
 	bool

+config HAVE_HW_PAGE_TABLE_WALKS
(Continue reading)

Linus Torvalds | 28 Jun 2012 00:23
Gravatar

Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing

On Wed, Jun 27, 2012 at 2:15 PM, Peter Zijlstra <a.p.zijlstra <at> chello.nl> wrote:
>
> Certain architectures (viz. x86, arm, s390) have hardware page-table
> walkers (#PF). So during the RCU page-table teardown process make sure
> we do a tlb flush of page-table pages on all relevant CPUs to
> synchronize against hardware walkers, and then free the pages.

NACK.

Why would hw page table walkers be that special? Plus your config
option is horribly done anyway, where you do it as some kind of
"default y" and then have complex conditionals on it.

Plus it really isn't about hardware page table walkers at all. It's
more about the possibility of speculative TLB fils, it has nothing to
do with *how* they are done. Sure, it's likely that a software
pagetable walker wouldn't be something that gets called speculatively,
but it's not out of the question.

So I think your config option is totally mis-designed and actively
misleading. It's also horrible from a design standpoint, since it's
entirely possible that some day POWERPC will actually see the light
and do speculative TLB fills etc.

So *if* this needs to be done, it needs to be done right. That means:

 - don't talk about HW walking, since it's not about that

 - don't say "if you have speculative walkers", and use an ifndef. Say
"If you can *guarantee* that nothing else walks page tables
(Continue reading)

Peter Zijlstra | 28 Jun 2012 01:01
Picon

Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing

On Wed, 2012-06-27 at 15:23 -0700, Linus Torvalds wrote:

> Plus it really isn't about hardware page table walkers at all. It's
> more about the possibility of speculative TLB fils, it has nothing to
> do with *how* they are done. Sure, it's likely that a software
> pagetable walker wouldn't be something that gets called speculatively,
> but it's not out of the question.
> 
Hmm, I would call gup_fast() as speculative as we can get in software.
It does a lock-less walk of the page-tables. That's what the RCU free'd
page-table stuff is for to begin with.
> 
> IOW, if Sparc/PPC really want to guarantee that they never fill TLB
> entries speculatively, and that if we are in a kernel thread they will
> *never* fill the TLB with anything else, then make them enable
> CONFIG_STRICT_TLB_FILL or something in their architecture Kconfig
> files. 

Since we've dealt with the speculative software side by using RCU-ish
stuff, the only thing that's left is hardware, now neither sparc64 nor
ppc actually know about the linux page-tables from what I understood,
they only look at their hash-table thing.

So even if the hardware did do speculative tlb fills, it would do them
from the hash-table, but that's already cleared out.

How about something like this

---
Subject: mm: Add missing TLB invalidate to RCU page-table freeing
(Continue reading)

Linus Torvalds | 28 Jun 2012 01:42
Gravatar

Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing

On Wed, Jun 27, 2012 at 4:01 PM, Peter Zijlstra <a.p.zijlstra <at> chello.nl> wrote:
>
> How about something like this

Looks better.

I'd be even happier if you made the whole

  "When there's less then two users.."

(There's a misspelling there, btw, I didn't notice until I
cut-and-pasted that) logic be a helper function, and have that helper
function be inside that same #ifdef CONFIG_STRICT_TLB_FILL block
together witht he tlb_table_flush_mmu() function.

IOW, something like

  static int tlb_remove_table_quick( struct mmu_gather *tlb, void *table)
  {
        if (atomic_read(&tlb->mm->mm_users) < 2) {
            __tlb_remove_table(table);
            return 1;
        }
        return 0;
  }

for the CONFIG_STRICT_TLB_FILL case, and then the default case just
does an unconditional "return 0".

So that the actual code can avoid having #ifdef's in the middle of a
(Continue reading)

Benjamin Herrenschmidt | 28 Jun 2012 09:09

Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing

On Thu, 2012-06-28 at 01:01 +0200, Peter Zijlstra wrote:
> On Wed, 2012-06-27 at 15:23 -0700, Linus Torvalds wrote:
> 
> > Plus it really isn't about hardware page table walkers at all. It's
> > more about the possibility of speculative TLB fils, it has nothing to
> > do with *how* they are done. Sure, it's likely that a software
> > pagetable walker wouldn't be something that gets called speculatively,
> > but it's not out of the question.
> > 
> Hmm, I would call gup_fast() as speculative as we can get in software.
> It does a lock-less walk of the page-tables. That's what the RCU free'd
> page-table stuff is for to begin with.

Strictly speaking it's not :-) To *begin with* (as in the origin of that
code) it comes from powerpc hash table code which walks the linux page
tables locklessly :-) It then came in handy with gup_fast :-)

> > IOW, if Sparc/PPC really want to guarantee that they never fill TLB
> > entries speculatively, and that if we are in a kernel thread they will
> > *never* fill the TLB with anything else, then make them enable
> > CONFIG_STRICT_TLB_FILL or something in their architecture Kconfig
> > files. 
> 
> Since we've dealt with the speculative software side by using RCU-ish
> stuff, the only thing that's left is hardware, now neither sparc64 nor
> ppc actually know about the linux page-tables from what I understood,
> they only look at their hash-table thing.

Some embedded ppc's know about the lowest level (SW loaded PMD) but
that's not an issue here. We flush these special TLB entries
(Continue reading)

Peter Zijlstra | 28 Jun 2012 13:05
Picon

Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing

On Thu, 2012-06-28 at 17:09 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2012-06-28 at 01:01 +0200, Peter Zijlstra wrote:
> > On Wed, 2012-06-27 at 15:23 -0700, Linus Torvalds wrote:
> > 
> > > Plus it really isn't about hardware page table walkers at all. It's
> > > more about the possibility of speculative TLB fils, it has nothing to
> > > do with *how* they are done. Sure, it's likely that a software
> > > pagetable walker wouldn't be something that gets called speculatively,
> > > but it's not out of the question.
> > > 
> > Hmm, I would call gup_fast() as speculative as we can get in software.
> > It does a lock-less walk of the page-tables. That's what the RCU free'd
> > page-table stuff is for to begin with.
> 
> Strictly speaking it's not :-) To *begin with* (as in the origin of that
> code) it comes from powerpc hash table code which walks the linux page
> tables locklessly :-) It then came in handy with gup_fast :-)

Ah, ok my bad.

> > > IOW, if Sparc/PPC really want to guarantee that they never fill TLB
> > > entries speculatively, and that if we are in a kernel thread they will
> > > *never* fill the TLB with anything else, then make them enable
> > > CONFIG_STRICT_TLB_FILL or something in their architecture Kconfig
> > > files. 
> > 
> > Since we've dealt with the speculative software side by using RCU-ish
> > stuff, the only thing that's left is hardware, now neither sparc64 nor
> > ppc actually know about the linux page-tables from what I understood,
> > they only look at their hash-table thing.
(Continue reading)

Benjamin Herrenschmidt | 28 Jun 2012 14:00

Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing

On Thu, 2012-06-28 at 13:05 +0200, Peter Zijlstra wrote:
> 
> > Some embedded ppc's know about the lowest level (SW loaded PMD) but
> > that's not an issue here. We flush these special TLB entries
> > specifically and synchronously in __pte_free_tlb().
> 
> OK, I missed that.. is that
> arch/powerpc/mm/tlb_nohash.c:tlb_flush_pgtable() ?

Yup.

> > > So even if the hardware did do speculative tlb fills, it would do
> them
> > > from the hash-table, but that's already cleared out.
> > 
> > Right,
> 
> Phew at least I got the important thing right ;-)

Yeah as long as we have that hash :-) The day we move on (if ever) it
will be as bad as ARM :-)

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>
(Continue reading)

Nikunj A Dadhania | 24 Jul 2012 07:12
Picon

Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing

On Thu, 28 Jun 2012 01:01:46 +0200, Peter Zijlstra <a.p.zijlstra <at> chello.nl> wrote:

> +#ifdef CONFIG_STRICT_TLB_FILL
> +/*
> + * Some archictures (sparc64, ppc) cannot refill TLBs after the they've removed
> + * the PTE entries from their hash-table. Their hardware never looks at the
> + * linux page-table structures, so they don't need a hardware TLB invalidate
> + * when tearing down the page-table structure itself.
> + */
> +static inline void tlb_table_flush_mmu(struct mmu_gather *tlb) { }
> +#else
> +static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
> +{
> +	tlb_flush_mmu(tlb);
> +}
> +#endif
> +
>  void tlb_table_flush(struct mmu_gather *tlb)
>  {
>  	struct mmu_table_batch **batch = &tlb->batch;
>  
>  	if (*batch) {
> +		tlb_table_flush_mmu(tlb);
>  		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
>  		*batch = NULL;
>  	}

Hi Peter,

When running munmap(https://lkml.org/lkml/2012/5/17/59) test with KVM
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 14/20] mm, sh: Convert sh to generic tlb

Cc: Paul Mundt <lethal <at> linux-sh.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/sh/Kconfig           |    1 
 arch/sh/include/asm/tlb.h |   98 ++--------------------------------------------
 2 files changed, 6 insertions(+), 93 deletions(-)
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
 <at>  <at>  -28,6 +28,7  <at>  <at>  config SUPERH
 	select IRQ_FORCED_THREADING
 	select RTC_LIB
 	select GENERIC_ATOMIC64
+	select HAVE_MMU_GATHER_RANGE if MMU
 	select GENERIC_IRQ_SHOW
 	select GENERIC_SMP_IDLE_THREAD
 	select GENERIC_CLOCKEVENTS
--- a/arch/sh/include/asm/tlb.h
+++ b/arch/sh/include/asm/tlb.h
 <at>  <at>  -10,100 +10,14  <at>  <at> 

 #ifdef CONFIG_MMU
 #include <linux/swap.h>
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
-#include <asm/mmu_context.h>
-
-/*
- * TLB handling.  This allows us to remove pages from the page
- * tables, and efficiently handle the TLB issues.
- */
(Continue reading)

Paul Mundt | 28 Jun 2012 20:32
Gravatar

Re: [PATCH 14/20] mm, sh: Convert sh to generic tlb

On Wed, Jun 27, 2012 at 11:15:54PM +0200, Peter Zijlstra wrote:
> Cc: Paul Mundt <lethal <at> linux-sh.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
> ---
>  arch/sh/Kconfig           |    1 
>  arch/sh/include/asm/tlb.h |   98 ++--------------------------------------------
>  2 files changed, 6 insertions(+), 93 deletions(-)

This blows up in the same way as last time.

I direct you to the same bug report and patch as before:

http://marc.info/?l=linux-kernel&m=133722116507075&w=2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Peter Zijlstra | 28 Jun 2012 22:27
Picon

Re: [PATCH 14/20] mm, sh: Convert sh to generic tlb

On Fri, 2012-06-29 at 03:32 +0900, Paul Mundt wrote:
> On Wed, Jun 27, 2012 at 11:15:54PM +0200, Peter Zijlstra wrote:
> > Cc: Paul Mundt <lethal <at> linux-sh.org>
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
> > ---
> >  arch/sh/Kconfig           |    1 
> >  arch/sh/include/asm/tlb.h |   98 ++--------------------------------------------
> >  2 files changed, 6 insertions(+), 93 deletions(-)
> 
> This blows up in the same way as last time.
> 
> I direct you to the same bug report and patch as before:
> 
> http://marc.info/?l=linux-kernel&m=133722116507075&w=2

Sorry about that.. /me goes amend.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 10/20] mm: Provide generic range tracking and flushing

In order to convert various architectures to generic tlb we need to
provide some extra infrastructure to track the range of the flushed
page tables.

There are two mmu_gather cases to consider:

  unmap_region()
    tlb_gather_mmu()
    unmap_vmas()
      for (; vma; vma = vma->vm_next)
        unmap_page_range()
          tlb_start_vma() -> flush cache range/track vm_flags
          zap_*_range()
            arch_enter_lazy_mmu_mode()
            ptep_get_and_clear_full() -> batch/track external tlbs
            tlb_remove_tlb_entry() -> track range/external tlbs
            tlb_remove_page() -> batch page
            arch_lazy_leave_mmu_mode() -> flush external tlbs
          tlb_end_vma()
    free_pgtables()
      while (vma)
        unlink_*_vma()
        free_*_range()
          *_free_tlb() -> track range/batch page
    tlb_finish_mmu() -> flush TLBs and flush everything
  free vmas

and:

  shift_arg_pages()
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 09/20] mm, arch: Add end argument to p??_free_tlb()

In order to facilitate range tracking we need the end address of the
object we're freeing. The callsites already compute this address so
change things to simply pass it along.

Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/arm/include/asm/tlb.h         |    6 +++---
 arch/ia64/include/asm/tlb.h        |    6 +++---
 arch/powerpc/mm/hugetlbpage.c      |    4 ++--
 arch/s390/include/asm/tlb.h        |    6 +++---
 arch/sh/include/asm/tlb.h          |    6 +++---
 arch/um/include/asm/tlb.h          |    6 +++---
 include/asm-generic/4level-fixup.h |    2 +-
 include/asm-generic/tlb.h          |    6 +++---
 mm/memory.c                        |   10 +++++-----
 9 files changed, 26 insertions(+), 26 deletions(-)
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
 <at>  <at>  -217,9 +217,9  <at>  <at>  static inline void __pmd_free_tlb(struct
 #endif
 }

-#define pte_free_tlb(tlb, ptep, addr)	__pte_free_tlb(tlb, ptep, addr)
-#define pmd_free_tlb(tlb, pmdp, addr)	__pmd_free_tlb(tlb, pmdp, addr)
-#define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
+#define pte_free_tlb(tlb, ptep, addr, end)	__pte_free_tlb(tlb, ptep, addr)
+#define pmd_free_tlb(tlb, pmdp, addr, end)	__pmd_free_tlb(tlb, pmdp, addr)
+#define pud_free_tlb(tlb, pudp, addr, end)	pud_free((tlb)->mm, pudp)

 #define tlb_migrate_finish(mm)		do { } while (0)
(Continue reading)

Peter Zijlstra | 27 Jun 2012 23:15
Picon

[PATCH 11/20] mm, s390: Convert to use generic mmu_gather

Now that s390 is using the generic RCU freeing of page-table pages,
all that remains different wrt the generic mmu_gather code is the lack
of mmu_gather based TLB flushing for regular entries.

S390 doesn't need a TLB flush after ptep_get_and_clear_full() and
before __tlb_remove_page() because its ptep_get_and_clear*() family
already does a full TLB invalidate. Therefore force it to use
tlb_fast_mode.

Cc: Martin Schwidefsky <schwidefsky <at> de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
---
 arch/s390/include/asm/pgtable.h |    1 
 arch/s390/include/asm/tlb.h     |   85 ++++------------------------------------
 include/asm-generic/tlb.h       |    7 +++
 3 files changed, 17 insertions(+), 76 deletions(-)
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
 <at>  <at>  -1242,6 +1242,7  <at>  <at>  extern int s390_enable_sie(void);
  * No page table caches to initialise
  */
 #define pgtable_cache_init()	do { } while (0)
+#define check_pgt_cache()	do { } while (0)

 #include <asm-generic/pgtable.h>

--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
 <at>  <at>  -28,82 +28,16  <at>  <at> 
 #include <asm/pgalloc.h>
(Continue reading)

Peter Zijlstra | 28 Jun 2012 00:13
Picon

Re: [PATCH 11/20] mm, s390: Convert to use generic mmu_gather

On Wed, 2012-06-27 at 23:15 +0200, Peter Zijlstra wrote:
> 
> S390 doesn't need a TLB flush after ptep_get_and_clear_full() and
> before __tlb_remove_page() because its ptep_get_and_clear*() family
> already does a full TLB invalidate. Therefore force it to use
> tlb_fast_mode. 

On that.. ptep_get_and_clear() says:

/*                                                                                             
 * This is hard to understand. ptep_get_and_clear and ptep_clear_flush                         
 * both clear the TLB for the unmapped pte. The reason is that                                 
 * ptep_get_and_clear is used in common code (e.g. change_pte_range)                           
 * to modify an active pte. The sequence is                                                    
 *   1) ptep_get_and_clear                                                                     
 *   2) set_pte_at                                                                             
 *   3) flush_tlb_range                                                                        
 * On s390 the tlb needs to get flushed with the modification of the pte                       
 * if the pte is active. The only way how this can be implemented is to                        
 * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range                       
 * is a nop.                                                                                   
 */ 

I think there is another way, arch_{enter,leave}_lazy_mmu_mode() seems
to wrap these sites so you can do as SPARC64 and PPC do and batch
through there.

That should save a number of TLB invalidates..

--
(Continue reading)

Martin Schwidefsky | 28 Jun 2012 09:13
Picon
Favicon

Re: [PATCH 11/20] mm, s390: Convert to use generic mmu_gather

On Thu, 28 Jun 2012 00:13:19 +0200
Peter Zijlstra <a.p.zijlstra <at> chello.nl> wrote:

> On Wed, 2012-06-27 at 23:15 +0200, Peter Zijlstra wrote:
> > 
> > S390 doesn't need a TLB flush after ptep_get_and_clear_full() and
> > before __tlb_remove_page() because its ptep_get_and_clear*() family
> > already does a full TLB invalidate. Therefore force it to use
> > tlb_fast_mode. 
> 
> On that.. ptep_get_and_clear() says:
> 
> /*                                                                                             
>  * This is hard to understand. ptep_get_and_clear and ptep_clear_flush                         
>  * both clear the TLB for the unmapped pte. The reason is that                                 
>  * ptep_get_and_clear is used in common code (e.g. change_pte_range)                           
>  * to modify an active pte. The sequence is                                                    
>  *   1) ptep_get_and_clear                                                                     
>  *   2) set_pte_at                                                                             
>  *   3) flush_tlb_range                                                                        
>  * On s390 the tlb needs to get flushed with the modification of the pte                       
>  * if the pte is active. The only way how this can be implemented is to                        
>  * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range                       
>  * is a nop.                                                                                   
>  */ 
> 
> I think there is another way, arch_{enter,leave}_lazy_mmu_mode() seems
> to wrap these sites so you can do as SPARC64 and PPC do and batch
> through there.
> 
(Continue reading)


Gmane