Cody P Schafer | 28 Feb 03:41 2013
Picon

[RFC] DNUMA: Runtime NUMA memory layout reconfiguration

These patches allow the NUMA memory layout (meaning the mapping of a page to a
node) to be changed at runtime in place (without hotplugging).

= Why/when is this useful? =

In virtual machines (VMs) running on NUMA systems both [a] if/when the
hypervisor decides to move their backing memory around (compacting,
prioritizing another VMs desired layout, etc) and [b] in general for
migration of VMs.

The hardware is _already_ changing the NUMA layout underneath us. We have
powerpc64 systems with firmware that currently move the backing memory around,
and have the ability to notify Linux of new NUMA info.

= Code & testing =

web:
	https://github.com/jmesmon/linux/tree/dnuma/v26
git:
	https://github.com/jmesmon/linux.git dnuma/v26

commit range:
	7e4f3230c9161706ebe9d37d774398082dc352de^..01e16461cf4a914feb1a34ed8dd7b28f3e842645

Some patches are marked "XXX: ...", they are only for testing or
temporary documentation purposes.

A debugfs interface allows the NUMA memory layout to be changed.  Basically,
you don't need to have wierd systems to test this, in fact, I've done all my
testing so far in plain old qemu-i386.
(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration

Some people asked me to send the email patches for this instead of just posting a git tree link

For reference, this is the original message:
	http://lkml.org/lkml/2013/2/27/374

--

 arch/x86/Kconfig                 |   1 -
 arch/x86/include/asm/sparsemem.h |   4 +-
 arch/x86/mm/numa.c               |  32 +++-
 include/linux/dnuma.h            |  96 +++++++++++
 include/linux/memlayout.h        | 111 +++++++++++++
 include/linux/memory_hotplug.h   |   4 +
 include/linux/mm.h               |   7 +-
 include/linux/page-flags.h       |  18 ++
 include/linux/rbtree.h           |  11 ++
 init/main.c                      |   2 +
 lib/rbtree.c                     |  40 +++++
 mm/Kconfig                       |  44 +++++
 mm/Makefile                      |   2 +
 mm/dnuma.c                       | 351 +++++++++++++++++++++++++++++++++++++++
 mm/internal.h                    |  13 +-
 mm/memlayout-debugfs.c           | 323 +++++++++++++++++++++++++++++++++++
 mm/memlayout-debugfs.h           |  35 ++++
 mm/memlayout.c                   | 267 +++++++++++++++++++++++++++++
 mm/memory_hotplug.c              |  53 +++---
 mm/page_alloc.c                  | 112 +++++++++++--
 20 files changed, 1486 insertions(+), 40 deletions(-)

--
(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 02/24] XXX: x86/Kconfig: simplify NUMA config for NUMA_EMU on X86_32.

NUMA_EMU depends on NUMA.
NUMA depends on (X86_64 || (X86_32 && ( list of extended platforms))).

This forced enabling an extended platform when using numa emulation on
x86_32, which is silly.

Remoing the list of extended platforms (plus EXPERIMENTAL) results in
NUMA depending on X86_64 || X86_32, so simply remove all dependencies
from (except SMP) from NUMA.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6a93833..58cd8fb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
 <at>  <at>  -1228,7 +1228,6  <at>  <at>  config DIRECT_GBPAGES
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
 	depends on SMP
-	depends on X86_64 || (X86_32 && HIGHMEM64G && (X86_NUMAQ || X86_BIGSMP || X86_SUMMIT && ACPI))
 	default y if (X86_NUMAQ || X86_SUMMIT || X86_BIGSMP)
 	---help---
 	  Enable NUMA (Non Uniform Memory Access) support.
--

-- 
1.8.1.1

(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 05/24] rbtree: add rbtree_postorder_for_each_entry_safe() helper.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/rbtree.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 2879e96..8ff52b2 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
 <at>  <at>  -85,4 +85,11  <at>  <at>  static inline void rb_link_node(struct rb_node * node, struct rb_node * parent,
 	*rb_link = node;
 }

+#define rbtree_postorder_for_each_entry_safe(pos, n, root, field)		\
+	for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),	\
+	      n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field);	\
+	     &pos->field;							\
+	     pos = n,								\
+	      n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field))
+
 #endif	/* _LINUX_RBTREE_H */
--

-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 07/24] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h

Export ensure_zone_is_initialized() so that it can be used to initialize
new zones within the dynamic numa code.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/internal.h       | 8 ++++++++
 mm/memory_hotplug.c | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1c0c4cc..6c63752 100644
--- a/mm/internal.h
+++ b/mm/internal.h
 <at>  <at>  -105,6 +105,14  <at>  <at>  extern void prep_compound_page(struct page *page, unsigned long order);
 extern bool is_free_buddy_page(struct page *page);
 #endif

+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * in mm/memory_hotplug.c
+ */
+extern int ensure_zone_is_initialized(struct zone *zone,
+			unsigned long start_pfn, unsigned long num_pages);
+#endif
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA

 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9e4c32b..9f43c80 100644
(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 06/24] mm/memory_hotplug: factor out zone+pgdat growth.

Create a new function grow_pgdat_and_zone() which handles locking +
growth of a zone & the pgdat which it is associated with.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |  3 +++
 mm/memory_hotplug.c            | 17 +++++++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index b6a3be7..cd393014 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
 <at>  <at>  -78,6 +78,9  <at>  <at>  static inline void zone_seqlock_init(struct zone *zone)
 {
 	seqlock_init(&zone->span_seqlock);
 }
+extern void grow_pgdat_and_zone(struct zone *zone, unsigned long start_pfn,
+				unsigned long end_pfn);
+
 extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 102c06a..9e4c32b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
 <at>  <at>  -390,13 +390,22  <at>  <at>  static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 					pgdat->node_start_pfn;
 }
(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 08/24] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats

Use the *_is_empty() helpers to be more clear about what we're actually
checking for.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9f43c80..eae4a2a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
 <at>  <at>  -242,7 +242,7  <at>  <at>  static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
 	zone_span_writelock(zone);

 	old_zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	if (!zone->spanned_pages || start_pfn < zone->zone_start_pfn)
+	if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
 		zone->zone_start_pfn = start_pfn;

 	zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
 <at>  <at>  -383,7 +383,7  <at>  <at>  static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 	unsigned long old_pgdat_end_pfn =
 		pgdat->node_start_pfn + pgdat->node_spanned_pages;

-	if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
+	if (pgdat_is_empty(pgdat) || start_pfn < pgdat->node_start_pfn)
 		pgdat->node_start_pfn = start_pfn;

 	pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 01/24] XXX: reduce MAX_PHYSADDR_BITS & MAX_PHYSMEM_BITS in PAE.

This is a hack I use to allow PAE to be enabled & still fit the node
into the pageflags (PAE is enabled as a workaround for a kvm bug).

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 arch/x86/include/asm/sparsemem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 4517d6b..548e612 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
 <at>  <at>  -17,8 +17,8  <at>  <at> 
 #ifdef CONFIG_X86_32
 # ifdef CONFIG_X86_PAE
 #  define SECTION_SIZE_BITS	29
-#  define MAX_PHYSADDR_BITS	36
-#  define MAX_PHYSMEM_BITS	36
+#  define MAX_PHYSADDR_BITS	32
+#  define MAX_PHYSMEM_BITS	32
 # else
 #  define SECTION_SIZE_BITS	26
 #  define MAX_PHYSADDR_BITS	32
--

-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 04/24] rbtree: add postorder iteration functions.

Add postorder iteration functions for rbtree. These are useful for
safely freeing an entire rbtree without modifying the tree at all.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/rbtree.h |  4 ++++
 lib/rbtree.c           | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 0022c1b..2879e96 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
 <at>  <at>  -68,6 +68,10  <at>  <at>  extern struct rb_node *rb_prev(const struct rb_node *);
 extern struct rb_node *rb_first(const struct rb_root *);
 extern struct rb_node *rb_last(const struct rb_root *);

+/* Postorder iteration - always visit the parent after it's children */
+extern struct rb_node *rb_first_postorder(const struct rb_root *);
+extern struct rb_node *rb_next_postorder(const struct rb_node *);
+
 /* Fast replacement of a single node without remove/rebalance/add/rebalance */
 extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, 
 			    struct rb_root *root);
diff --git a/lib/rbtree.c b/lib/rbtree.c
index c0e31fe..65f4eff 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
 <at>  <at>  -518,3 +518,43  <at>  <at>  void rb_replace_node(struct rb_node *victim, struct rb_node *new,
 	*new = *victim;
(Continue reading)

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 03/24] XXX: memory_hotplug locking note in online_pages.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/memory_hotplug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b81a367b..102c06a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
 <at>  <at>  -984,6 +984,7  <at>  <at>  int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ

 	zone->managed_pages += onlined_pages;
 	zone->present_pages += onlined_pages;
+	/* FIXME: should be protected by pgdat_resize_lock() */
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	if (onlined_pages) {
 		node_states_set_node(zone_to_nid(zone), &arg);
--

-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Cody P Schafer | 28 Feb 21:44 2013
Picon

[PATCH 09/24] mm: add nid_zone() helper

Add nid_zone(), which returns the zone corresponding to a given nid & zonenum.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/mm.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e7c3f9a..562304a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
 <at>  <at>  -707,9 +707,14  <at>  <at>  static inline void page_nid_reset_last(struct page *page)
 }
 #endif

+static inline struct zone *nid_zone(int nid, enum zone_type zonenum)
+{
+	return &NODE_DATA(nid)->node_zones[zonenum];
+}
+
 static inline struct zone *page_zone(const struct page *page)
 {
-	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
+	return nid_zone(page_to_nid(page), page_zonenum(page));
 }

 #ifdef SECTION_IN_PAGE_FLAGS
--

-- 
1.8.1.1

(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 10/24] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled.

Add return_pages_to_zone(), which uses return_page_to_zone().
It is a minimized version of __free_pages_ok() which handles adding
pages which have been removed from another zone into a new zone.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/internal.h   |  5 ++++-
 mm/page_alloc.c | 17 +++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 6c63752..b075e34 100644
--- a/mm/internal.h
+++ b/mm/internal.h
 <at>  <at>  -104,6 +104,10  <at>  <at>  extern void prep_compound_page(struct page *page, unsigned long order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+void return_pages_to_zone(struct page *page, unsigned int order,
+			  struct zone *zone);
+#endif

 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
 <at>  <at>  -114,7 +118,6  <at>  <at>  extern int ensure_zone_is_initialized(struct zone *zone,
 #endif

 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-
(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 12/24] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone

When dynamic numa is enabled, the last or first page in a pageblock may
have been transplanted to a new zone (or may not yet be transplanted to
a new zone).

Disable a BUG_ON() which checks that the start_page and end_page are in
the same zone, if they are not in the proper zone they will simply be
skipped.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 972d7cc..274826c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -966,13 +966,16  <at>  <at>  int move_freepages(struct zone *zone,
 	int pages_moved = 0;
 	int zone_nid = zone_to_nid(zone);

-#ifndef CONFIG_HOLES_IN_ZONE
+#if !defined(CONFIG_HOLES_IN_ZONE) && !defined(CONFIG_DYNAMIC_NUMA)
 	/*
-	 * page_zone is not safe to call in this context when
-	 * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
-	 * anyway as we check zone boundaries in move_freepages_block().
-	 * Remove at a later date when no bug reports exist related to
-	 * grouping pages by mobility
+	 * With CONFIG_HOLES_IN_ZONE set, this check is unsafe as start_page or
(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 14/24] memory_hotplug: factor out locks in mem_online_cpu()

In dynamic numa, when onlining nodes, lock_memory_hotplug() is already
held when mem_online_node()'s functionality is needed.

Factor out the locking and create a new function __mem_online_node() to
allow reuse.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |  1 +
 mm/memory_hotplug.c            | 29 ++++++++++++++++-------------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index cd393014..391824d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
 <at>  <at>  -248,6 +248,7  <at>  <at>  static inline int is_mem_section_removable(unsigned long pfn,
 static inline void try_offline_node(int nid) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */

+extern int __mem_online_node(int nid);
 extern int mem_online_node(int nid);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index eae4a2a..7b0ab4f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
 <at>  <at>  -1058,26 +1058,29  <at>  <at>  static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
 	return;
(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 13/24] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup.

Add a pageflag called "lookup_node"/ PG_lookup_node / Page*LookupNode().

Used by dynamic numa to indicate when a page has a new node assignment
waiting for it.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/page-flags.h | 18 ++++++++++++++++++
 mm/page_alloc.c            |  3 +++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..e0241d8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
 <at>  <at>  -109,6 +109,9  <at>  <at>  enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+	PG_lookup_node,		/* need to do an extra lookup to determine actual node */
+#endif
 	__NR_PAGEFLAGS,

 	/* Filesystems */
 <at>  <at>  -275,6 +278,17  <at>  <at>  PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif

+/* Setting is unconditional, simply leads to an extra lookup.
(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 17/24] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok()

__free_pages_ok() handles higher order (order != 0) pages. Transplant
hook is added here as this is where the struct zone to free to is
decided.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5eeb547..5c7930f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -59,6 +59,7  <at>  <at> 
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
 #include <linux/sched/rt.h>
+#include <linux/dnuma.h>

 #include <asm/tlbflush.h>
 #include <asm/div64.h>
 <at>  <at>  -739,6 +740,13  <at>  <at>  static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	int dest_nid = dnuma_page_needs_move(page);
+	struct zone *zone;
+
+	if (dest_nid != NUMA_NO_NODE)
+		zone = nid_zone(dest_nid, page_zonenum(page));
(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 16/24] mm: memlayout+dnuma: add debugfs interface

Add a debugfs interface to dnuma/memlayout. It keeps track of a
variable backlog of memory layouts, provides some statistics on dnuma
moved pages & cache performance, and allows the setting of a new global
memlayout.

TODO: split out statistics, backlog, & write interfaces from eachother.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/memlayout.h |   1 +
 mm/Kconfig                |  25 ++++
 mm/Makefile               |   1 +
 mm/dnuma.c                |   2 +
 mm/memlayout-debugfs.c    | 323 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memlayout-debugfs.h    |  35 +++++
 mm/memlayout.c            |  17 ++-
 7 files changed, 402 insertions(+), 2 deletions(-)
 create mode 100644 mm/memlayout-debugfs.c
 create mode 100644 mm/memlayout-debugfs.h

diff --git a/include/linux/memlayout.h b/include/linux/memlayout.h
index eeb88e0..499ab4d 100644
--- a/include/linux/memlayout.h
+++ b/include/linux/memlayout.h
 <at>  <at>  -53,6 +53,7  <at>  <at>  struct memlayout {
 };

 extern __rcu struct memlayout *pfn_to_node_map;
+extern struct mutex memlayout_lock; /* update-side lock */

(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 19/24] page_alloc: transplant pages that are being flushed from the per-cpu lists

In free_pcppages_bulk(), check if a page needs to be moved to a new
node/zone & then perform the transplant (in a slightly defered manner).

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5579eda..11947c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -650,13 +650,14  <at>  <at>  static void free_pcppages_bulk(struct zone *zone, int count,
 	int migratetype = 0;
 	int batch_free = 0;
 	int to_free = count;
+	struct page *pos, *page;
+	LIST_HEAD(need_move);

 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;

 	while (to_free) {
-		struct page *page;
 		struct list_head *list;

 		/*
 <at>  <at>  -679,11 +680,23  <at>  <at>  static void free_pcppages_bulk(struct zone *zone, int count,

(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 18/24] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page()

free_hot_cold_page() is used for order == 0 pages, and is where the
page's zone is decided.

In the normal case, these pages are freed to the per-cpu lists. When a
page needs transplanting (ie: the actual node it belongs to has changed,
and it needs to be moved to another zone), the pcp lists are skipped &
the page is freed via free_one_page().

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c7930f..5579eda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -1364,6 +1364,7  <at>  <at>  void mark_free_pages(struct zone *zone)
  */
 void free_hot_cold_page(struct page *page, int cold)
 {
+	int dest_nid;
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 <at>  <at>  -1377,6 +1378,15  <at>  <at>  void free_hot_cold_page(struct page *page, int cold)
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);

+	dest_nid = dnuma_page_needs_move(page);
(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 15/24] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes

On certain systems, the hypervisor can (and will) relocate physical
addresses as seen in a VM between real NUMA nodes. For example, IBM's
Power systems which are using PHYP (their proprietary hypervisor).

This change set introduces the infrastructure for tracking & dynamically
changing "memory layouts" (or "memlayouts"): the mapping between page
ranges & the actual backing NUMA node.

A memlayout is an rbtree which maps pfns (really, ranges of pfns) to a
node. This mapping (combined with the LookupNode pageflag) is used to
"transplant" (move pages between nodes) pages when they are freed back
to the page allocator.

Additionally, when a new memlayout is commited the currently free pages
that are now in the wrong zone's freelist are immidiately transplanted.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/dnuma.h     |  96 +++++++++++++
 include/linux/memlayout.h | 110 +++++++++++++++
 mm/Kconfig                |  19 +++
 mm/Makefile               |   1 +
 mm/dnuma.c                | 349 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memlayout.c            | 238 +++++++++++++++++++++++++++++++
 6 files changed, 813 insertions(+)
 create mode 100644 include/linux/dnuma.h
 create mode 100644 include/linux/memlayout.h
 create mode 100644 mm/dnuma.c
 create mode 100644 mm/memlayout.c

(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 20/24] x86: memlayout: add a arch specific inital memlayout setter.

On x86, we have numa_info specifically to track the numa layout, which
is precisely the data memlayout needs, so use it to create an initial
memlayout.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ff3633c..a2a8dd5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
 <at>  <at>  -11,6 +11,7  <at>  <at> 
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/dnuma.h>

 #include <asm/e820.h>
 #include <asm/proto.h>
 <at>  <at>  -32,6 +33,29  <at>  <at>  __initdata
 #endif
 ;

+#ifdef CONFIG_DYNAMIC_NUMA
+void __init memlayout_global_init(void)
+{
+	struct numa_meminfo *mi = &numa_meminfo;
+	int i;
(Continue reading)

Cody P Schafer | 28 Feb 22:26 2013
Picon

[PATCH 11/24] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences.

With dynamic numa, pages are going to be gradully moved from one node to
another, causing the page ranges that move_freepages() examines to
contain pages that actually belong to another node.

When dynamic numa is enabled, we skip these pages instead of VM_BUGing
out on them.

This additionally moves the VM_BUG_ON() (which detects a change in node)
so that it follows the pfn_valid_within() check.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbc9b6e..972d7cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -964,6 +964,7  <at>  <at>  int move_freepages(struct zone *zone,
 	struct page *page;
 	unsigned long order;
 	int pages_moved = 0;
+	int zone_nid = zone_to_nid(zone);

 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
 <at>  <at>  -977,14 +978,24  <at>  <at>  int move_freepages(struct zone *zone,
 #endif

(Continue reading)

Cody P Schafer | 28 Feb 22:57 2013
Picon

[PATCH 21/24] init/main: call memlayout_global_init() in start_kernel().

memlayout_global_init() initializes the first memlayout, which is
assumed to match the initial page-flag nid settings.

This is done in start_kernel() as the initdata used to populate the
memlayout is purged from memory early in the boot process (XXX: When?).

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 init/main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/init/main.c b/init/main.c
index 63534a1..a1c2094 100644
--- a/init/main.c
+++ b/init/main.c
 <at>  <at>  -72,6 +72,7  <at>  <at> 
 #include <linux/ptrace.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
+#include <linux/memlayout.h>

 #include <asm/io.h>
 #include <asm/bugs.h>
 <at>  <at>  -618,6 +619,7  <at>  <at>  asmlinkage void __init start_kernel(void)
 	security_init();
 	dbg_late_init();
 	vfs_caches_init(totalram_pages);
+	memlayout_global_init();
 	signals_init();
 	/* rootfs populating might need page-writeback */
(Continue reading)

Cody P Schafer | 28 Feb 22:57 2013
Picon

[PATCH 23/24] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid.

When a memlayout is tracked (ie: CONFIG_DYNAMIC_NUMA is enabled), rather
than iterate over numa_meminfo, a lookup can be done using memlayout.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a2a8dd5..1ed76d5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
 <at>  <at>  -28,7 +28,7  <at>  <at>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);

 static struct numa_meminfo numa_meminfo
-#ifndef CONFIG_MEMORY_HOTPLUG
+#if !defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DYNAMIC_NUMA)
 __initdata
 #endif
 ;
 <at>  <at>  -832,7 +832,7  <at>  <at>  EXPORT_SYMBOL(cpumask_of_node);

 #endif	/* !CONFIG_DEBUG_PER_CPU_MAPS */

-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) && !defined(CONFIG_DYNAMIC_NUMA)
 int memory_add_physaddr_to_nid(u64 start)
 {
 	struct numa_meminfo *mi = &numa_meminfo;
(Continue reading)

Cody P Schafer | 28 Feb 22:57 2013
Picon

[PATCH 24/24] XXX: x86/mm/numa: Avoid spamming warnings due to lack of cpu reconfig

the code wants to map a node id to a cpu mask, but we don't update the
arch specific cpu masks when onlining a new node. For now, avoid this
warning (as it is expected) when DYNAMIC_NUMA is enabled.

Modifying __mem_online_node() to fix this up would be ideal.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1ed76d5..e9a50df 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
 <at>  <at>  -813,10 +813,14  <at>  <at>  void __cpuinit numa_remove_cpu(int cpu)
 const struct cpumask *cpumask_of_node(int node)
 {
 	if (node >= nr_node_ids) {
+		/* XXX: this ifdef should be removed when proper cpu to node
+		 * mapping updates are added */
+#ifndef CONFIG_DYNAMIC_NUMA
 		printk(KERN_WARNING
 			"cpumask_of_node(%d): node > nr_node_ids(%d)\n",
 			node, nr_node_ids);
 		dump_stack();
+#endif
 		return cpu_none_mask;
 	}
 	if (node_to_cpumask_map[node] == NULL) {
(Continue reading)

Cody P Schafer | 28 Feb 22:57 2013
Picon

[PATCH 22/24] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/memlayout.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/memlayout.c b/mm/memlayout.c
index 5fef032..b432b3a 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
 <at>  <at>  -249,3 +249,19  <at>  <at>  void memlayout_global_init(void)

 	memlayout_commit(ml);
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * Provides a default memory_add_physaddr_to_nid() for memory hotplug, unless
+ * overridden by the arch.
+ */
+__weak
+int memory_add_physaddr_to_nid(u64 start)
+{
+	int nid = memlayout_pfn_to_nid(PFN_DOWN(start));
+	if (nid == NUMA_NO_NODE)
+		return 0;
+	return nid;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+#endif
--

-- 
(Continue reading)

Simon Jeons | 4 Apr 07:28 2013
Picon

Re: [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration

Hi Cody,
On 03/01/2013 04:44 AM, Cody P Schafer wrote:
> Some people asked me to send the email patches for this instead of just posting a git tree link
>
> For reference, this is the original message:
> 	http://lkml.org/lkml/2013/2/27/374

Could you show me your test codes?

> --
>
>   arch/x86/Kconfig                 |   1 -
>   arch/x86/include/asm/sparsemem.h |   4 +-
>   arch/x86/mm/numa.c               |  32 +++-
>   include/linux/dnuma.h            |  96 +++++++++++
>   include/linux/memlayout.h        | 111 +++++++++++++
>   include/linux/memory_hotplug.h   |   4 +
>   include/linux/mm.h               |   7 +-
>   include/linux/page-flags.h       |  18 ++
>   include/linux/rbtree.h           |  11 ++
>   init/main.c                      |   2 +
>   lib/rbtree.c                     |  40 +++++
>   mm/Kconfig                       |  44 +++++
>   mm/Makefile                      |   2 +
>   mm/dnuma.c                       | 351 +++++++++++++++++++++++++++++++++++++++
>   mm/internal.h                    |  13 +-
>   mm/memlayout-debugfs.c           | 323 +++++++++++++++++++++++++++++++++++
>   mm/memlayout-debugfs.h           |  35 ++++
>   mm/memlayout.c                   | 267 +++++++++++++++++++++++++++++
>   mm/memory_hotplug.c              |  53 +++---
(Continue reading)

Cody P Schafer | 4 Apr 21:07 2013
Picon

Re: [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration

On 04/03/2013 10:28 PM, Simon Jeons wrote:
> Hi Cody,
> On 03/01/2013 04:44 AM, Cody P Schafer wrote:
>> Some people asked me to send the email patches for this instead of
>> just posting a git tree link
>>
>> For reference, this is the original message:
>>     http://lkml.org/lkml/2013/2/27/374
>
> Could you show me your test codes?
>

Sure, I linked to it in the original email

 >	https://raw.github.com/jmesmon/trifles/master/bin/dnuma-test

I generally run something like `dnuma-test s 1 3 512`, which creates 
stripes with size='512 pages' and distributes them between nodes 1, 2, 
and 3.

Also, this patchset has some major issues (not updating the watermarks, 
for example). I've been working on ironing them out, and plan on sending 
another patchset out "soon". Current tree is 
https://github.com/jmesmon/linux/tree/dnuma/v31 (keep in mind that this 
has a few commits in it that I just use for development).

>> --
>>
>>   arch/x86/Kconfig                 |   1 -
>>   arch/x86/include/asm/sparsemem.h |   4 +-
(Continue reading)


Gmane