Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 00/10] Introduce huge zero page

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>

#define MB 1024*1024

int main(int argc, char **argv)
{
        char *p;
        int i;

        posix_memalign((void **)&p, 2 * MB, 200 * MB);
        for (i = 0; i < 200 * MB; i+= 4096)
                assert(p[i] == 0);
        pause();
        return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 01/10] thp: huge zero page: basic preparation

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.

We are not going to map the huge zero page until we can handle it
properly on all code paths.

is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..88e0a7a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
 <at>  <at>  -46,6 +46,7  <at>  <at>  static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
 static struct task_struct *khugepaged_thread __read_mostly;
+static unsigned long huge_zero_pfn __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
 <at>  <at>  -167,6 +168,28  <at>  <at>  out:
 	return err;
(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 03/10] thp: copy_huge_pmd(): copy huge zero page

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

It's easy to copy huge zero page. Just set destination pmd to huge zero
page.

It's safe to copy huge zero page since we have none yet :-p

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9dcb9e6..a534f84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
 <at>  <at>  -725,6 +725,18  <at>  <at>  static inline struct page *alloc_hugepage(int defrag)
 }
 #endif

+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+{
+	pmd_t entry;
+	entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+	entry = pmd_wrprotect(entry);
+	entry = pmd_mkhuge(entry);
+	set_pmd_at(mm, haddr, pmd, entry);
+	prepare_pmd_huge_pte(pgtable, mm);
+	mm->nr_ptes++;
(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 08/10] thp: setup huge zero page on non-write page fault

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

All code paths seems covered. Now we can map huge zero page on read page
fault.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 995894f..c788445 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
 <at>  <at>  -750,6 +750,16  <at>  <at>  int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_OOM;
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
+		if (!(flags & FAULT_FLAG_WRITE)) {
+			pgtable_t pgtable;
+			pgtable = pte_alloc_one(mm, haddr);
+			if (unlikely(!pgtable))
+				goto out;
+			spin_lock(&mm->page_table_lock);
+			set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+			spin_unlock(&mm->page_table_lock);
+			return 0;
+		}
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 					  vma, haddr, numa_node_id(), 0);
(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 05/10] thp: change_huge_pmd(): keep huge zero page write-protected

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

We want to get page fault on write attempt to huge zero page, so let's
keep it write-protected.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f5029d4..4001f1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
 <at>  <at>  -1248,6 +1248,8  <at>  <at>  int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		pmd_t entry;
 		entry = pmdp_get_and_clear(mm, addr, pmd);
 		entry = pmd_modify(entry, newprot);
+		if (is_huge_zero_pmd(entry))
+			entry = pmd_wrprotect(entry);
 		set_pmd_at(mm, addr, pmd, entry);
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		ret = 1;
--

-- 
1.7.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 07/10] thp: implement splitting pmd for huge zero page

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

We can't split huge zero page itself, but we can split the pmd which
points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 48ecc46..995894f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
 <at>  <at>  -1599,6 +1599,7  <at>  <at>  int split_huge_page(struct page *page)
 	struct anon_vma *anon_vma;
 	int ret = 1;

+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
 	BUG_ON(!PageAnon(page));
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
 <at>  <at>  -2503,6 +2504,32  <at>  <at>  static int khugepaged(void *none)
 	return 0;
 }

+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 02/10] thp: zap_huge_pmd(): zap huge zero pmd

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

We don't have a real page to zap in huge zero page case. Let's just
clear pmd and remove it from tlb.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |   27 +++++++++++++++++----------
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88e0a7a..9dcb9e6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
 <at>  <at>  -1071,16 +1071,23  <at>  <at>  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct page *page;
 		pgtable_t pgtable;
 		pgtable = get_pmd_huge_pte(tlb->mm);
-		page = pmd_page(*pmd);
-		pmd_clear(pmd);
-		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-		page_remove_rmap(page);
-		VM_BUG_ON(page_mapcount(page) < 0);
-		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-		VM_BUG_ON(!PageHead(page));
-		tlb->mm->nr_ptes--;
-		spin_unlock(&tlb->mm->page_table_lock);
-		tlb_remove_page(tlb, page);
+		if (is_huge_zero_pmd(*pmd)) {
+			pmd_clear(pmd);
(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 06/10] thp: change split_huge_page_pmd() interface

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

Pass vma instead of mm and add address parameter.

In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

This change is preparation to huge zero pmd splitting implementation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 Documentation/vm/transhuge.txt |    4 ++--
 arch/x86/kernel/vm86_32.c      |    2 +-
 fs/proc/task_mmu.c             |    2 +-
 include/linux/huge_mm.h        |   14 ++++++++++----
 mm/huge_memory.c               |   24 +++++++++++++++++++-----
 mm/memory.c                    |    4 ++--
 mm/mempolicy.c                 |    2 +-
 mm/mprotect.c                  |    2 +-
 mm/mremap.c                    |    2 +-
 mm/pagewalk.c                  |    2 +-
 10 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index f734bb2..677a599 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
 <at>  <at>  -276,7 +276,7  <at>  <at>  unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==

(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 10/10] thp: implement refcounting for huge zero page

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |  108 ++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 84 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
(Continue reading)

Eric Dumazet | 10 Sep 16:02 2012
Picon

Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page

On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>
> 
> H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> after the first allocation. Here's implementation of lockless refcounting
> for huge zero page.
> 
...

> +static unsigned long get_huge_zero_page(void)
> +{
> +	struct page *zero_page;
> +retry:
> +	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> +		return ACCESS_ONCE(huge_zero_pfn);
> +
> +	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> +	if (!zero_page)
> +		return 0;
> +	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> +		__free_page(zero_page);
> +		goto retry;
> +	}

This might break if preemption can happen here ?

The second thread might loop forever because huge_zero_refcount is 0,
and huge_zero_pfn not zero.

If preemption already disabled, a comment would be nice.
(Continue reading)

Kirill A. Shutemov | 10 Sep 16:44 2012
Picon

Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page

On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>
> > 
> > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > after the first allocation. Here's implementation of lockless refcounting
> > for huge zero page.
> > 
> ...
> 
> > +static unsigned long get_huge_zero_page(void)
> > +{
> > +	struct page *zero_page;
> > +retry:
> > +	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > +		return ACCESS_ONCE(huge_zero_pfn);
> > +
> > +	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > +	if (!zero_page)
> > +		return 0;
> > +	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > +		__free_page(zero_page);
> > +		goto retry;
> > +	}
> 
> This might break if preemption can happen here ?
> 
> The second thread might loop forever because huge_zero_refcount is 0,
> and huge_zero_pfn not zero.

(Continue reading)

Eric Dumazet | 10 Sep 16:48 2012
Picon

Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page

On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>
> > > 
> > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > after the first allocation. Here's implementation of lockless refcounting
> > > for huge zero page.
> > > 
> > ...
> > 
> > > +static unsigned long get_huge_zero_page(void)
> > > +{
> > > +	struct page *zero_page;
> > > +retry:
> > > +	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > > +		return ACCESS_ONCE(huge_zero_pfn);
> > > +
> > > +	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > > +	if (!zero_page)
> > > +		return 0;
> > > +	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > > +		__free_page(zero_page);
> > > +		goto retry;
> > > +	}
> > 
> > This might break if preemption can happen here ?
> > 
> > The second thread might loop forever because huge_zero_refcount is 0,
> > and huge_zero_pfn not zero.
(Continue reading)

Kirill A. Shutemov | 10 Sep 16:50 2012

Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page

On Mon, Sep 10, 2012 at 04:48:07PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> > On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> > > On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>
> > > > 
> > > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > > after the first allocation. Here's implementation of lockless refcounting
> > > > for huge zero page.
> > > > 
> > > ...
> > > 
> > > > +static unsigned long get_huge_zero_page(void)
> > > > +{
> > > > +	struct page *zero_page;
> > > > +retry:
> > > > +	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > > > +		return ACCESS_ONCE(huge_zero_pfn);
> > > > +
> > > > +	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > > > +	if (!zero_page)
> > > > +		return 0;
> > > > +	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > > > +		__free_page(zero_page);
> > > > +		goto retry;
> > > > +	}
> > > 
> > > This might break if preemption can happen here ?
> > > 
> > > The second thread might loop forever because huge_zero_refcount is 0,
(Continue reading)

Eric Dumazet | 10 Sep 16:57 2012
Picon

Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page

On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:

> Yes, disabling preemption before alloc_pages() and enabling after
> atomic_set() looks reasonable. Thanks.

In fact, as alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
might sleep, it would be better to disable preemption after calling it :

zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
if (!zero_page)
	return 0;
preempt_disable();
if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
	preempt_enable();
	__free_page(zero_page);
	goto retry;
}
atomic_set(&huge_zero_refcount, 2);
preempt_enable();

Kirill A. Shutemov | 10 Sep 17:07 2012

Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page

On Mon, Sep 10, 2012 at 04:57:59PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> 
> 
> > Yes, disabling preemption before alloc_pages() and enabling after
> > atomic_set() looks reasonable. Thanks.
> 
> In fact, as alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> might sleep, it would be better to disable preemption after calling it :

Yeah, I've alread thought about that. :)

> zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> if (!zero_page)
> 	return 0;
> preempt_disable();
> if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> 	preempt_enable();
> 	__free_page(zero_page);
> 	goto retry;
> }
> atomic_set(&huge_zero_refcount, 2);
> preempt_enable();
> 
> 

--

-- 
 Kirill A. Shutemov

--
(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 04/10] thp: do_huge_pmd_wp_page(): handle huge zero page

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

On right access to huge zero page we alloc a new page and clear it.

In fallback path we create a new table and set pte around fault address
to the newly allocated page. All other ptes set to normal zero page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 include/linux/mm.h |    8 ++++
 mm/huge_memory.c   |  102 ++++++++++++++++++++++++++++++++++++++++++++--------
 mm/memory.c        |    7 ----
 3 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..179a41c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
 <at>  <at>  -514,6 +514,14  <at>  <at>  static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 }
 #endif

+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+	extern unsigned long zero_pfn;
+	return zero_pfn;
+}
+#endif
+
(Continue reading)

Kirill A. Shutemov | 10 Sep 15:13 2012
Picon

[PATCH v2 09/10] thp: lazy huge zero page allocation

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.

cmpxchg() is used to avoid race on huge_zero_pfn initialization.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c788445..0981b09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
 <at>  <at>  -168,21 +168,23  <at>  <at>  out:
 	return err;
 }

-static int init_huge_zero_page(void)
+static int init_huge_zero_pfn(void)
 {
 	struct page *hpage;
+	unsigned long pfn;

 	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
 	if (!hpage)
 		return -ENOMEM;
-
(Continue reading)

Kirill A. Shutemov | 12 Sep 12:07 2012
Picon

[PATCH v3 10/10] thp: implement refcounting for huge zero page

From: "Kirill A. Shutemov" <kirill.shutemov <at> linux.intel.com>

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov <at> linux.intel.com>
---
 mm/huge_memory.c |  111 ++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
(Continue reading)

Andrea Arcangeli | 13 Sep 19:16 2012
Picon

Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page

Hi Kirill,

On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> -	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);

The page is likely as large as a pageblock so it's unlikely to create
much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
it would be more correct if __GFP_MOVABLE was clear, like
(GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
really movable (it's only reclaimable).

The xchg vs xchgcmp locking also looks good.

Reviewed-by: Andrea Arcangeli <aarcange <at> redhat.com>

Thanks,
Andrea
Kirill A. Shutemov | 13 Sep 19:37 2012

Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page

On Thu, Sep 13, 2012 at 07:16:13PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> > -	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> 
> The page is likely as large as a pageblock so it's unlikely to create
> much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
> it would be more correct if __GFP_MOVABLE was clear, like
> (GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
> really movable (it's only reclaimable).

Good point. I'll update the patchset.

> The xchg vs xchgcmp locking also looks good.
> 
> Reviewed-by: Andrea Arcangeli <aarcange <at> redhat.com>

Is it for the whole patchset? :)

--

-- 
 Kirill A. Shutemov
Andrea Arcangeli | 13 Sep 23:17 2012
Picon

Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page

Hi Kirill,

On Thu, Sep 13, 2012 at 08:37:58PM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 13, 2012 at 07:16:13PM +0200, Andrea Arcangeli wrote:
> > Hi Kirill,
> > 
> > On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> > > -	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > 
> > The page is likely as large as a pageblock so it's unlikely to create
> > much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
> > it would be more correct if __GFP_MOVABLE was clear, like
> > (GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
> > really movable (it's only reclaimable).
> 
> Good point. I'll update the patchset.
> 
> > The xchg vs xchgcmp locking also looks good.
> > 
> > Reviewed-by: Andrea Arcangeli <aarcange <at> redhat.com>
> 
> Is it for the whole patchset? :)

It was meant for this one, but I reviewed the whole patchset and it
looks fine to me, so in this case it can apply to the whole patchset ;)

Gmane