Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 00/25] Dynamic NUMA: Runtime NUMA memory layout reconfiguration

These patches allow the NUMA memory layout (meaning which node each physical
page belongs to, the mapping from physical pages to NUMA nodes) to be changed
at runtime in place (without hotplugging).

Depends on "mm: avoid duplication of setup_nr_node_ids()",
http://comments.gmane.org/gmane.linux.kernel.mm/96880, which is merged into the
current MMOTM.

TODO:

 - Update sysfs node information when reconfiguration occurs
 - Currently, I use pageflag setters without "owning" pages which could cause
   loss of pageflag updates when combined with non-atomic pageflag users in
   mm/*. Some options for solving this: (a) make all pageflags access atomic,
   (b) use pageblock flags, (c) use bits in a new bitmap, or (d) attempt to work
   around races in a similar way to memory-failure.

= Why/when is this useful? =

In virtual machines (VMs) running on NUMA systems both [a] if/when the
hypervisor decides to move their backing memory around (compacting,
prioritizing another VMs desired layout, etc) and [b] in general for
migration of VMs.

The hardware is _already_ changing the NUMA layout underneath us. We have
powerpc64 systems with firmware that currently move the backing memory around,
and have the ability to notify Linux of new NUMA info.

= How are you managing to do this? =

(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 01/25] rbtree: add postorder iteration functions.

Add postorder iteration functions for rbtree. These are useful for
safely freeing an entire rbtree without modifying the tree at all.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/rbtree.h |  4 ++++
 lib/rbtree.c           | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 0022c1b..2879e96 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
 <at>  <at>  -68,6 +68,10  <at>  <at>  extern struct rb_node *rb_prev(const struct rb_node *);
 extern struct rb_node *rb_first(const struct rb_root *);
 extern struct rb_node *rb_last(const struct rb_root *);

+/* Postorder iteration - always visit the parent after it's children */
+extern struct rb_node *rb_first_postorder(const struct rb_root *);
+extern struct rb_node *rb_next_postorder(const struct rb_node *);
+
 /* Fast replacement of a single node without remove/rebalance/add/rebalance */
 extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, 
 			    struct rb_root *root);
diff --git a/lib/rbtree.c b/lib/rbtree.c
index c0e31fe..65f4eff 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
 <at>  <at>  -518,3 +518,43  <at>  <at>  void rb_replace_node(struct rb_node *victim, struct rb_node *new,
 	*new = *victim;
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 02/25] rbtree: add rbtree_postorder_for_each_entry_safe() helper.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/rbtree.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 2879e96..1b239ca 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
 <at>  <at>  -85,4 +85,12  <at>  <at>  static inline void rb_link_node(struct rb_node * node, struct rb_node * parent,
 	*rb_link = node;
 }

+#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \
+	for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\
+	      n = rb_entry(rb_next_postorder(&pos->field), \
+		      typeof(*pos), field); \
+	     &pos->field; \
+	     pos = n, \
+	      n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field))
+
 #endif	/* _LINUX_RBTREE_H */
--

-- 
1.8.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 03/25] mm/memory_hotplug: factor out zone+pgdat growth.

Create a new function grow_pgdat_and_zone() which handles locking +
growth of a zone & the pgdat which it is associated with.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |  3 +++
 mm/memory_hotplug.c            | 17 +++++++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index b6a3be7..cd393014 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
 <at>  <at>  -78,6 +78,9  <at>  <at>  static inline void zone_seqlock_init(struct zone *zone)
 {
 	seqlock_init(&zone->span_seqlock);
 }
+extern void grow_pgdat_and_zone(struct zone *zone, unsigned long start_pfn,
+				unsigned long end_pfn);
+
 extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 46de32a..8f4d8d3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
 <at>  <at>  -390,13 +390,22  <at>  <at>  static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 					pgdat->node_start_pfn;
 }
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 04/25] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h

Export ensure_zone_is_initialized() so that it can be used to initialize
new zones within the dynamic numa code.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/internal.h       | 8 ++++++++
 mm/memory_hotplug.c | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8562de0..b11e574 100644
--- a/mm/internal.h
+++ b/mm/internal.h
 <at>  <at>  -105,6 +105,14  <at>  <at>  extern void prep_compound_page(struct page *page, unsigned long order);
 extern bool is_free_buddy_page(struct page *page);
 #endif

+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * in mm/memory_hotplug.c
+ */
+extern int ensure_zone_is_initialized(struct zone *zone,
+			unsigned long start_pfn, unsigned long num_pages);
+#endif
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA

 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8f4d8d3..df04c36 100644
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 05/25] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats

Use the *_is_empty() helpers to be more clear about what we're actually
checking for.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index df04c36..deea8c2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
 <at>  <at>  -242,7 +242,7  <at>  <at>  static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
 	zone_span_writelock(zone);

 	old_zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	if (!zone->spanned_pages || start_pfn < zone->zone_start_pfn)
+	if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
 		zone->zone_start_pfn = start_pfn;

 	zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
 <at>  <at>  -383,7 +383,7  <at>  <at>  static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 	unsigned long old_pgdat_end_pfn =
 		pgdat->node_start_pfn + pgdat->node_spanned_pages;

-	if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
+	if (pgdat_is_empty(pgdat) || start_pfn < pgdat->node_start_pfn)
 		pgdat->node_start_pfn = start_pfn;

 	pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 06/25] mm: add nid_zone() helper

Add nid_zone(), which returns the zone corresponding to a given nid & zonenum.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/mm.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9ddae00..1b6abae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
 <at>  <at>  -708,9 +708,14  <at>  <at>  static inline void page_nid_reset_last(struct page *page)
 }
 #endif

+static inline struct zone *nid_zone(int nid, enum zone_type zonenum)
+{
+	return &NODE_DATA(nid)->node_zones[zonenum];
+}
+
 static inline struct zone *page_zone(const struct page *page)
 {
-	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
+	return nid_zone(page_to_nid(page), page_zonenum(page));
 }

 #ifdef SECTION_IN_PAGE_FLAGS
--

-- 
1.8.2.1

(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 08/25] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences.

With dynamic numa, pages are going to be gradully moved from one node to
another, causing the page ranges that move_freepages() examines to
contain pages that actually belong to another node.

When dynamic numa is enabled, we skip these pages instead of VM_BUGing
out on them.

This additionally moves the VM_BUG_ON() (which detects a change in node)
so that it follows the pfn_valid_within() check.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1fbf5f2..75192eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -957,6 +957,7  <at>  <at>  int move_freepages(struct zone *zone,
 	struct page *page;
 	unsigned long order;
 	int pages_moved = 0;
+	int zone_nid = zone_to_nid(zone);

 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
 <at>  <at>  -970,14 +971,24  <at>  <at>  int move_freepages(struct zone *zone,
 #endif

(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 09/25] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone

When dynamic numa is enabled, the last or first page in a pageblock may
have been transplanted to a new zone (or may not yet be transplanted to
a new zone).

Disable a BUG_ON() which checks that the start_page and end_page are in
the same zone, if they are not in the proper zone they will simply be
skipped.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 75192eb..95e4a23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -959,13 +959,16  <at>  <at>  int move_freepages(struct zone *zone,
 	int pages_moved = 0;
 	int zone_nid = zone_to_nid(zone);

-#ifndef CONFIG_HOLES_IN_ZONE
+#if !defined(CONFIG_HOLES_IN_ZONE) && !defined(CONFIG_DYNAMIC_NUMA)
 	/*
-	 * page_zone is not safe to call in this context when
-	 * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
-	 * anyway as we check zone boundaries in move_freepages_block().
-	 * Remove at a later date when no bug reports exist related to
-	 * grouping pages by mobility
+	 * With CONFIG_HOLES_IN_ZONE set, this check is unsafe as start_page or
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 10/25] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup.

Add a pageflag called "lookup_node"/ PG_lookup_node / Page*LookupNode().

Used by dynamic numa to indicate when a page has a new node assignment
waiting for it.

FIXME: This also exempts PG_lookup_node from PAGE_FLAGS_CHECK_AT_PREP
due to the asynchronous usage of PG_lookup_node, which needs to be
avoided.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/page-flags.h | 19 +++++++++++++++++++
 mm/page_alloc.c            |  3 +++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..09dd94e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
 <at>  <at>  -109,6 +109,9  <at>  <at>  enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+	PG_lookup_node,		/* extra lookup required to find real node */
+#endif
 	__NR_PAGEFLAGS,

 	/* Filesystems */
 <at>  <at>  -275,6 +278,17  <at>  <at>  PAGEFLAG_FALSE(HWPoison)
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 07/25] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled.

Add return_pages_to_zone(), which uses return_page_to_zone().
It is a minimized version of __free_pages_ok() which handles adding
pages which have been removed from another zone into a new zone.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/internal.h   |  5 ++++-
 mm/page_alloc.c | 17 +++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index b11e574..a70c77b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
 <at>  <at>  -104,6 +104,10  <at>  <at>  extern void prep_compound_page(struct page *page, unsigned long order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+void return_pages_to_zone(struct page *page, unsigned int order,
+			  struct zone *zone);
+#endif

 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
 <at>  <at>  -114,7 +118,6  <at>  <at>  extern int ensure_zone_is_initialized(struct zone *zone,
 #endif

 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 11/25] memory_hotplug: factor out locks in mem_online_cpu()

In dynamic numa, when onlining nodes, lock_memory_hotplug() is already
held when mem_online_node()'s functionality is needed.

Factor out the locking and create a new function __mem_online_node() to
allow reuse.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |  1 +
 mm/memory_hotplug.c            | 29 ++++++++++++++++-------------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index cd393014..391824d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
 <at>  <at>  -248,6 +248,7  <at>  <at>  static inline int is_mem_section_removable(unsigned long pfn,
 static inline void try_offline_node(int nid) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */

+extern int __mem_online_node(int nid);
 extern int mem_online_node(int nid);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index deea8c2..f5ea9b7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
 <at>  <at>  -1058,26 +1058,29  <at>  <at>  static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
 	return;
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 12/25] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes

On some systems, the hypervisor can (and will) relocate physical
addresses as seen in a VM between real NUMA nodes. For example, IBM
Power systems which are using particular revisions of PHYP (IBM's
proprietary hypervisor)

This change set introduces the infrastructure for tracking & dynamically
changing "memory layouts" (or "memlayouts"): the mapping between page
ranges & the actual backing NUMA node.

A memlayout is stored as an rbtree which maps pfns (really, ranges of
pfns) to a node. This mapping (combined with the LookupNode pageflag) is
used to "transplant" (move pages between nodes) pages when they are
freed back to the page allocator.

Additionally, when a new memlayout is commited the currently free pages
that are now in the 'wrong' zone's freelist are immidiately transplanted.

Hooks that tie it into the page alloctor to actually perform the
"transplant on free" are in later patches.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/dnuma.h     |  97 ++++++++++
 include/linux/memlayout.h | 126 +++++++++++++
 mm/Kconfig                |  24 +++
 mm/Makefile               |   1 +
 mm/dnuma.c                | 439 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memlayout.c            | 237 +++++++++++++++++++++++++
 6 files changed, 924 insertions(+)
 create mode 100644 include/linux/dnuma.h
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 14/25] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok()

__free_pages_ok() handles higher order (order != 0) pages. Transplant
hook is added here as this is where the struct zone to free to is
decided.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4628443..f8ae178 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -59,6 +59,7  <at>  <at> 
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
 #include <linux/sched/rt.h>
+#include <linux/dnuma.h>

 #include <asm/tlbflush.h>
 #include <asm/div64.h>
 <at>  <at>  -732,6 +733,13  <at>  <at>  static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	int dest_nid = dnuma_page_needs_move(page);
+	struct zone *zone;
+
+	if (dest_nid != NUMA_NO_NODE)
+		zone = nid_zone(dest_nid, page_zonenum(page));
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 13/25] mm: memlayout+dnuma: add debugfs interface

Add a debugfs interface to dnuma/memlayout. It keeps track of a
variable backlog of memory layouts, provides some statistics on dnuma
moved pages & cache performance, and allows the setting of a new global
memlayout.

TODO: split out statistics, backlog, & write interfaces from eachother.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 include/linux/dnuma.h     |   2 +-
 include/linux/memlayout.h |   7 +
 mm/Kconfig                |  30 ++++
 mm/Makefile               |   1 +
 mm/dnuma.c                |   4 +-
 mm/memlayout-debugfs.c    | 339 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memlayout-debugfs.h    |  39 ++++++
 mm/memlayout.c            |  20 ++-
 8 files changed, 436 insertions(+), 6 deletions(-)
 create mode 100644 mm/memlayout-debugfs.c
 create mode 100644 mm/memlayout-debugfs.h

diff --git a/include/linux/dnuma.h b/include/linux/dnuma.h
index 029a984..7a33131 100644
--- a/include/linux/dnuma.h
+++ b/include/linux/dnuma.h
 <at>  <at>  -64,7 +64,7  <at>  <at>  static inline int dnuma_page_needs_move(struct page *page)
 	return new_nid;
 }

-void dnuma_post_free_to_new_zone(struct page *page, int order);
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 18/25] init/main: call memlayout_global_init() in start_kernel().

memlayout_global_init() initializes the first memlayout, which is
assumed to match the initial page-flag nid settings.

This is done in start_kernel() as the initdata used to populate the
memlayout is purged from memory early in the boot process (XXX: When?).

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 init/main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/init/main.c b/init/main.c
index 63534a1..a1c2094 100644
--- a/init/main.c
+++ b/init/main.c
 <at>  <at>  -72,6 +72,7  <at>  <at> 
 #include <linux/ptrace.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
+#include <linux/memlayout.h>

 #include <asm/io.h>
 #include <asm/bugs.h>
 <at>  <at>  -618,6 +619,7  <at>  <at>  asmlinkage void __init start_kernel(void)
 	security_init();
 	dbg_late_init();
 	vfs_caches_init(totalram_pages);
+	memlayout_global_init();
 	signals_init();
 	/* rootfs populating might need page-writeback */
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 22/25] mm/page_alloc: in page_outside_zone_boundaries(), avoid premature decisions.

With some code that expands the zone boundaries, VM_BUG_ON(bad_range()) was being triggered.

Previously, page_outside_zone_boundaries() decided that once it detected
a page outside the boundaries, it was certainly outside even if the
seqlock indicated the data was invalid & needed to be reread. This
methodology _almost_ works because zones are only ever grown. However,
becase the zone span is stored as a start and a length, some expantions
momentarily appear as shifts to the left (when the zone_start_pfn is
assigned prior to zone_spanned_pages).

If we want to remove the seqlock around zone_start_pfn & zone
spanned_pages, always writing the spanned_pages first, issuing a memory
barrier, and then writing the new zone_start_pfn _may_ work. The concern
there is that we could be seen as shrinking the span when zone_start_pfn
is written (the entire span would shift to the left). As there will be
no pages in the exsess span that actually belong to the zone being
manipulated, I don't expect there to be issues.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97bdf6b..a54baa9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -238,12 +238,13  <at>  <at>  bool oom_killer_disabled __read_mostly;
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 25/25] mm: add a early_param "extra_nr_node_ids" to increase nr_node_ids above the minimum by a percentage.

For dynamic numa, sometimes the hypervisor we're running under will want
to split a single NUMA node into multiple NUMA nodes. If the number of
numa nodes is limited to the number avaliable when the system booted (as
it is on x86), we may not be able to fully adopt the new memory layout
provided by the hypervisor.

This option allows reserving some extra node ids as a percentage of the
boot time node ids. While not perfect (idealy nr_node_ids would be fully
dynamic), this allows decent functionality without invasive changes to
the SL{U,A}B allocators.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 Documentation/kernel-parameters.txt |  6 ++++++
 mm/page_alloc.c                     | 24 ++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..b0523d8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
 <at>  <at>  -2033,6 +2033,12  <at>  <at>  bytes respectively. Such letter suffixes can also be entirely omitted.
 			use hotplug cpu feature to put more cpu back to online.
 			just like you compile the kernel NR_CPUS=n

+	extra_nr_node_ids= [NUMA] Increase the maximum number of NUMA nodes
+			above the number detected at boot by the specified
+			percentage (rounded up). For example:
+			extra_nr_node_ids=100 would double the number of
+			node_ids avaliable (up to a max of MAX_NUMNODES).
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 24/25] mm/page_alloc: use manage_pages instead of present pages when calculating default_zonelist_order()

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20304cb..686d8f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -3488,8 +3488,8  <at>  <at>  static int default_zonelist_order(void)
 			z = &NODE_DATA(nid)->node_zones[zone_type];
 			if (populated_zone(z)) {
 				if (zone_type < ZONE_NORMAL)
-					low_kmem_size += z->present_pages;
-				total_size += z->present_pages;
+					low_kmem_size += z->managed_pages;
+				total_size += z->managed_pages;
 			} else if (zone_type == ZONE_NORMAL) {
 				/*
 				 * If any node has only lowmem, then node order
 <at>  <at>  -3519,8 +3519,8  <at>  <at>  static int default_zonelist_order(void)
 			z = &NODE_DATA(nid)->node_zones[zone_type];
 			if (populated_zone(z)) {
 				if (zone_type < ZONE_NORMAL)
-					low_kmem_size += z->present_pages;
-				total_size += z->present_pages;
+					low_kmem_size += z->managed_pages;
+				total_size += z->managed_pages;
 			}
 		}
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 16/25] page_alloc: transplant pages that are being flushed from the per-cpu lists

In free_pcppages_bulk(), check if a page needs to be moved to a new
node/zone & then perform the transplant (in a slightly defered manner).

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 98ac7c6..97bdf6b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -643,13 +643,14  <at>  <at>  static void free_pcppages_bulk(struct zone *zone, int count,
 	int migratetype = 0;
 	int batch_free = 0;
 	int to_free = count;
+	struct page *pos, *page;
+	LIST_HEAD(need_move);

 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;

 	while (to_free) {
-		struct page *page;
 		struct list_head *list;

 		/*
 <at>  <at>  -672,11 +673,23  <at>  <at>  static void free_pcppages_bulk(struct zone *zone, int count,

(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 19/25] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/memlayout.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/memlayout.c b/mm/memlayout.c
index 45e7df6..4dc6706 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
 <at>  <at>  -247,3 +247,19  <at>  <at>  void memlayout_global_init(void)

 	memlayout_commit(ml);
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * Provides a default memory_add_physaddr_to_nid() for memory hotplug, unless
+ * overridden by the arch.
+ */
+__weak
+int memory_add_physaddr_to_nid(u64 start)
+{
+	int nid = memlayout_pfn_to_nid(PFN_DOWN(start));
+	if (nid == NUMA_NO_NODE)
+		return 0;
+	return nid;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+#endif
--

-- 
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 15/25] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page()

free_hot_cold_page() is used for order == 0 pages, and is where the
page's zone is decided.

In the normal case, these pages are freed to the per-cpu lists. When a
page needs transplanting (ie: the actual node it belongs to has changed,
and it needs to be moved to another zone), the pcp lists are skipped &
the page is freed via free_one_page().

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8ae178..98ac7c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -1357,6 +1357,7  <at>  <at>  void mark_free_pages(struct zone *zone)
  */
 void free_hot_cold_page(struct page *page, int cold)
 {
+	int dest_nid;
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 <at>  <at>  -1370,6 +1371,15  <at>  <at>  void free_hot_cold_page(struct page *page, int cold)
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);

+	dest_nid = dnuma_page_needs_move(page);
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 17/25] x86: memlayout: add a arch specific inital memlayout setter.

On x86, we have numa_info specifically to track the numa layout, which
is precisely the data memlayout needs, so use it to create an initial
memlayout.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a71c4e2..75819ef 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
 <at>  <at>  -11,6 +11,7  <at>  <at> 
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/dnuma.h>

 #include <asm/e820.h>
 #include <asm/proto.h>
 <at>  <at>  -32,6 +33,33  <at>  <at>  __initdata
 #endif
 ;

+#ifdef CONFIG_DYNAMIC_NUMA
+void __init memlayout_global_init(void)
+{
+	struct numa_meminfo *mi = &numa_meminfo;
+	int i;
(Continue reading)

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 21/25] mm/memory_hotplug: VM_BUG if nid is too large.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/memory_hotplug.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f5ea9b7..5fcd29e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
 <at>  <at>  -1063,6 +1063,8  <at>  <at>  int __mem_online_node(int nid)
 	pg_data_t *pgdat;
 	int ret;

+	VM_BUG_ON(nid >= nr_node_ids);
+
 	pgdat = hotadd_new_pgdat(nid, 0);
 	if (!pgdat)
 		return -ENOMEM;
--

-- 
1.8.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 23/25] mm/page_alloc: make pr_err() in page_outside_zone_boundaries() more useful

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 mm/page_alloc.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a54baa9..20304cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
 <at>  <at>  -253,8 +253,11  <at>  <at>  static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 	} while (zone_span_seqretry(zone, seq));

 	if (ret)
-		pr_err("page %lu outside zone [ %lu - %lu ]\n",
-			pfn, start_pfn, start_pfn + sp);
+		pr_err("page with pfn %05lx outside zone %s with pfn range {%05lx-%05lx} in node %d with pfn range {%05lx-%05lx}\n",
+			pfn, zone->name, start_pfn, start_pfn + sp,
+			zone->zone_pgdat->node_id,
+			zone->zone_pgdat->node_start_pfn,
+			pgdat_end_pfn(zone->zone_pgdat));

 	return ret;
 }
--

-- 
1.8.2.1

Cody P Schafer | 12 Apr 03:13 2013
Picon

[RFC PATCH v2 20/25] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid.

When a memlayout is tracked (ie: CONFIG_DYNAMIC_NUMA is enabled), rather
than iterate over numa_meminfo, a lookup can be done using memlayout.

Signed-off-by: Cody P Schafer <cody <at> linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 75819ef..f1609c0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
 <at>  <at>  -28,7 +28,7  <at>  <at>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);

 static struct numa_meminfo numa_meminfo
-#ifndef CONFIG_MEMORY_HOTPLUG
+#if !defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DYNAMIC_NUMA)
 __initdata
 #endif
 ;
 <at>  <at>  -832,7 +832,7  <at>  <at>  EXPORT_SYMBOL(cpumask_of_node);

 #endif	/* !CONFIG_DEBUG_PER_CPU_MAPS */

-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) && !defined(CONFIG_DYNAMIC_NUMA)
 int memory_add_physaddr_to_nid(u64 start)
 {
 	struct numa_meminfo *mi = &numa_meminfo;
(Continue reading)


Gmane