Wu Fengguang | 1 Dec 11:14
Picon

[PATCH 00/12] Balancing the scan rate of major caches

Hi all,

This patch balances the aging rates of active_list/inactive_list/slab.

It started out as an effort to enable the adaptive read-ahead to handle large
number of concurrent readers. Then I found it involves much more stuffs, and
deserves a standalone patchset to address the balancing problem as a whole.

The whole picture of balancing:

- In each node, inactive_list scan rates are synced with each other.
  It is done in the direct/kswapd reclaim path.

- In each zone, active_list scan rate always follows that of inactive_list.

- Slab cache scan rates always follow that of the current node.
  If the shrinkers are not NUMA aware, they will effectly sync scan rates
  with that of the most scanned node.

The patches can be grouped as follows:

- balancing stuffs
vm-kswapd-incmin.patch
mm-balance-zone-aging-supporting-facilities.patch
mm-balance-zone-aging-in-direct-reclaim.patch
mm-balance-zone-aging-in-kswapd-reclaim.patch
mm-balance-slab-aging.patch
mm-balance-active-inactive-list-aging.patch

- pure code cleanups
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 01/12] vm: kswapd incmin

Explicitly teach kswapd about the incremental min logic instead of just scanning
all zones under the first low zone. This should keep more even pressure applied
on the zones.

Signed-off-by: Nick Piggin <npiggin <at> suse.de>
Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 mm/vmscan.c |  111 ++++++++++++++++++++----------------------------------------
 1 files changed, 37 insertions(+), 74 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1310,101 +1310,65 @@ loop_again:
 	}

 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
+		int first_low_zone = 0;
+
+		all_zones_ok = 1;
+		sc.nr_scanned = 0;
+		sc.nr_reclaimed = 0;
+		sc.priority = priority;
+		sc.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX;

 		/* The swap token gets in the way of swapout... */
 		if (!priority)
 			disable_swap_token();
(Continue reading)

Andrew Morton | 1 Dec 11:33
Favicon

Re: [PATCH 01/12] vm: kswapd incmin

Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
>
> Explicitly teach kswapd about the incremental min logic instead of just scanning
>  all zones under the first low zone. This should keep more even pressure applied
>  on the zones.

I spat this back a while ago.  See the changelog (below) for the logic
which you're removing.

This change appears to go back to performing reclaim in the highmem->lowmem
direction.  Page reclaim might go all lumpy again.

Shouldn't first_low_zone be initialised to ZONE_HIGHMEM (or pgdat->nr_zones
- 1) rather than to 0, or something?  I don't understand why we're passing
zero as the classzone_idx into zone_watermark_ok() in the first go around
the loop.

And this bit, which Nick didn't reply to (wimp!).  I think it's a bug.

Looking at it, I am confused.

 In the first loop:

 			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
 				struct zone *zone = pgdat->node_zones + i;
 	...
 				if (!zone_watermark_ok(zone, order,
 						zone->pages_high, 0, 0)) {
 					end_zone = i;
 					goto scan;
(Continue reading)

Wu Fengguang | 1 Dec 12:40
Picon

Re: [PATCH 01/12] vm: kswapd incmin

On Thu, Dec 01, 2005 at 02:33:30AM -0800, Andrew Morton wrote:
> I spat this back a while ago.  See the changelog (below) for the logic
> which you're removing.
>
> This change appears to go back to performing reclaim in the highmem->lowmem
> direction.  Page reclaim might go all lumpy again.
> 
> Shouldn't first_low_zone be initialised to ZONE_HIGHMEM (or pgdat->nr_zones
> - 1) rather than to 0, or something?  I don't understand why we're passing
> zero as the classzone_idx into zone_watermark_ok() in the first go around
> the loop.

Sorry to note that I'm mainly taking its zone-range --> zones-under-watermark
cleanups. The scan order is reverted back to DMA->HighMem in
mm-balance-zone-aging-in-kswapd-reclaim.patch, and the first_low_zone logic is
also replaced with a quite different one there.

My thinking is that the overall reclaim-for-watermark should be weakened and
just do minimal watermark-safeguard work, so that it will not be a major force
of imbalance.

Assume there are three zones. The dynamics goes something like:

HighMem exhausted --> reclaim from it --> become more aged --> reclaim the
other two zones for aging

DMA reclaimed --> age leaps ahead --> reclaim Normal zone for aging, while
HighMem is being reclaimed for watermark

In the kswapd path, if there are N rounds of reclaim-for-watermark with
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 04/12] mm: balance zone aging in kswapd reclaim path

The kswapd reclaim has had one single goal:
	reclaim from zones to make their watermarks ok.

Now add another weak goal(it will not set all_zones_ok=0):
	reclaim from the least aged zone to help balance the aging rates.

Two major aspects of this algorithm:
- reclaim the least aged zone unless it catches up with the most aged zone
- reclaim for weaker watermark by calling watermark_ok() with classzone_idx=0

That garuantees reclaims-for-aging to be more than reclaims-for-watermark if
there is ever a big imbalance, thus eliminates the chance of growing gaps.

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 mm/vmscan.c |   39 ++++++++++++++++++++++++++++++---------
 1 files changed, 30 insertions(+), 9 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1356,6 +1356,8 @@ static int balance_pgdat(pg_data_t *pgda
 	int total_scanned, total_reclaimed;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
+	struct zone *youngest_zone = NULL;
+	struct zone *oldest_zone = NULL;

 loop_again:
 	total_scanned = 0;
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 05/12] mm: balance slab aging

The current slab shrinking code is way too fragile.
Let it manage aging pace by itself, and provide a simple and robust interface.

The design considerations:
- use the same syncing facilities as that of the zones
- keep the age of slabs in line with that of the largest zone
  this in effect makes aging rate of slabs follow that of the most aged node.

- reserve a minimal number of unused slabs
  the size of reservation depends on vm pressure

- shrink more slab caches only when vm pressure is high
  the old logic, `mmap pages found' - `shrink more caches' - `avoid swapping',
  sounds not quite logical, so the code is removed.

- let sc->nr_scanned record the exact number of cold pages scanned
  it is no longer used by the slab cache shrinking algorithm, but good for other
  algorithms(e.g. the active_list/inactive_list balancing).

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 include/linux/mm.h |    4 +
 mm/vmscan.c        |  118 +++++++++++++++++++++++------------------------------
 2 files changed, 55 insertions(+), 67 deletions(-)

--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -798,7 +798,9 @@ struct shrinker {
 	shrinker_t		shrinker;
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 03/12] mm: balance zone aging in direct reclaim path

Add 10 extra priorities to the direct page reclaim path, which makes 10 round of
balancing effort(reclaim only from the least aged local/headless zone) before
falling back to the reclaim-all scheme.

Ten rounds should be enough to get enough free pages in normal cases, which
prevents unnecessarily disturbing remote nodes. If further restrict the first
round of page allocation to local zones, we might get what the early zone
reclaim patch want: memory affinity/locality.

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 mm/vmscan.c |   31 ++++++++++++++++++++++++++++---
 1 files changed, 28 insertions(+), 3 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1186,6 +1186,7 @@ static void
 shrink_caches(struct zone **zones, struct scan_control *sc)
 {
 	int i;
+	struct zone *z = NULL;

 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
@@ -1200,11 +1201,34 @@ shrink_caches(struct zone **zones, struc
 		if (zone->prev_priority > sc->priority)
 			zone->prev_priority = sc->priority;

-		if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 07/12] mm: remove unnecessary variable and loop

shrink_cache() and refill_inactive_zone() do not need loops.

Simplify them to scan one chunk at a time.

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 mm/vmscan.c |   92 ++++++++++++++++++++++++++++--------------------------------
 1 files changed, 43 insertions(+), 49 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -899,63 +899,58 @@ static void shrink_cache(struct zone *zo
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	int max_scan = sc->nr_to_scan;
+	struct page *page;
+	int nr_taken;
+	int nr_scan;
+	int nr_freed;

 	pagevec_init(&pvec, 1);

 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	while (max_scan > 0) {
-		struct page *page;
-		int nr_taken;
-		int nr_scan;
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 02/12] mm: supporting variables and functions for balanced zone aging

The zone aging rates are currently imbalanced, the gap can be as large as 3
times, which can severely damage read-ahead requests and shorten their
effective life time.

This patch adds three variables in struct zone
	- aging_total
	- aging_milestone
	- page_age
to keep track of page aging rate, and keep it in sync on page reclaim time.

The aging_total is just a per-zone counter-part to the per-cpu
pgscan_{kswapd,direct}_{zone name}. But it is not direct comparable between
zones, so the aging_milestone/page_age are maintained based on aging_total.

The page_age is a normalized value that can be direct compared between zones
with the helper macro pages_more_aged(). The goal of balancing logics are to
keep this normalized value in sync between zones.

One can check the balanced aging progress by running:
                        tar c / | cat > /dev/null &
                        watch -n1 'grep "age " /proc/zoneinfo'

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 include/linux/mmzone.h |   14 ++++++++++++++
 mm/page_alloc.c        |   11 +++++++++++
 mm/vmscan.c            |   39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 64 insertions(+)

(Continue reading)

Andrew Morton | 1 Dec 11:37
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
>
>  The zone aging rates are currently imbalanced,

ZONE_DMA is out of whack.  It shouldn't be, and I'm not aware of anyone
getting in and working out why.  I certainly wouldn't want to go and add
all this stuff without having a good understanding of _why_ it's out of
whack.  Perhaps it's just some silly bug, like the thing I pointed at in
the previous email.

> the gap can be as large as 3 times,

What's the testcase?

> which can severely damage read-ahead requests and shorten their
>  effective life time.

Have you any performance numbers for this?
Wu Fengguang | 1 Dec 13:11
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 02:37:14AM -0800, Andrew Morton wrote:
> Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
> >
> >  The zone aging rates are currently imbalanced,
> 
> ZONE_DMA is out of whack.  It shouldn't be, and I'm not aware of anyone
> getting in and working out why.  I certainly wouldn't want to go and add
> all this stuff without having a good understanding of _why_ it's out of
> whack.  Perhaps it's just some silly bug, like the thing I pointed at in
> the previous email.

Yep, my rule is that if ever the DMA zone is reclaimed for watermark, it will
be running wild ;) So I leave it out by setting classzone_idx=0, and let the
age balancing code to catch it up. This scheme works fine: tested to be OK from
64M to 2G memory.

> > the gap can be as large as 3 times,
> 
> What's the testcase?
> 
> > which can severely damage read-ahead requests and shorten their
> >  effective life time.
> 
> Have you any performance numbers for this?

That's months ago, if I remember it right, the number of concurrent readers the
adaptive read-ahead code can handle without much thrashing was raised from ~100
to 800 with the balancing work.

This is my original announce back then:
(Continue reading)

Marcelo Tosatti | 1 Dec 23:28
Picon
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Hi Andrew,

On Thu, Dec 01, 2005 at 02:37:14AM -0800, Andrew Morton wrote:
> Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
> >
> >  The zone aging rates are currently imbalanced,
> 
> ZONE_DMA is out of whack.  It shouldn't be, and I'm not aware of anyone
> getting in and working out why.  I certainly wouldn't want to go and add
> all this stuff without having a good understanding of _why_ it's out of
> whack.  Perhaps it's just some silly bug, like the thing I pointed at in
> the previous email.

I think that the problem is caused by the interaction between 
the way reclaiming is quantified and parallel allocators.

The zones have different sizes, and each zone reclaim iteration
scans the same number of pages. It is unfair.

On top of that, kswapd is likely to block while doing its job, 
which means that allocators have a chance to run.

It seems that scaling the number of isolated pages to zone 
size fixes the unbalancing problem, making the Normal zone
be _more_ scanned than DMA. Which is expected since the
lower zone protection logic decreases allocation pressure
from DMA sending it straight to the Normal zone (therefore 
zeroing lower_zone_protection should make the scanning 
proportionally equal).

(Continue reading)

Andrew Morton | 2 Dec 00:03
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Marcelo Tosatti <marcelo.tosatti <at> cyclades.com> wrote:
>
> Hi Andrew,
> 
> On Thu, Dec 01, 2005 at 02:37:14AM -0800, Andrew Morton wrote:
> > Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
> > >
> > >  The zone aging rates are currently imbalanced,
> > 
> > ZONE_DMA is out of whack.  It shouldn't be, and I'm not aware of anyone
> > getting in and working out why.  I certainly wouldn't want to go and add
> > all this stuff without having a good understanding of _why_ it's out of
> > whack.  Perhaps it's just some silly bug, like the thing I pointed at in
> > the previous email.
> 
> I think that the problem is caused by the interaction between 
> the way reclaiming is quantified and parallel allocators.

Could be.  But what about the bug which I think is there?  That'll cause
overscanning of the DMA zone.

> The zones have different sizes, and each zone reclaim iteration
> scans the same number of pages. It is unfair.

Nope.  See how shrink_zone() bases nr_active and nr_inactive on
zone->nr_active and zone_nr_inactive.  These calculations are intended to
cause the number of scanned pages in each zone to be

	(zone->nr-active + zone->nr_inactive) >> sc->priority.

(Continue reading)

Wu Fengguang | 2 Dec 02:19
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 03:03:49PM -0800, Andrew Morton wrote:
> > > ZONE_DMA is out of whack.  It shouldn't be, and I'm not aware of anyone
> > > getting in and working out why.  I certainly wouldn't want to go and add
> > > all this stuff without having a good understanding of _why_ it's out of
> > > whack.  Perhaps it's just some silly bug, like the thing I pointed at in
> > > the previous email.
> > 
> > I think that the problem is caused by the interaction between 
> > the way reclaiming is quantified and parallel allocators.
> 
> Could be.  But what about the bug which I think is there?  That'll cause
> overscanning of the DMA zone.

Take for example these numbers:
--------------------------------------------------------------------------------
active/inactive sizes on 2.6.14-1-k7-smp:
43/1000         = 116 / 2645
819/1000        = 54023 / 65881

active/inactive scan rates:
dma      480/1000       = 31364 / (58377 + 6963)
normal   985/1000       = 719219 / (645051 + 84579)
high     0/1000         = 0 / (0 + 0)

             total       used       free     shared    buffers     cached
Mem:           503        497          6          0          0        328
-/+ buffers/cache:        168        335
Swap:          127          2        125
--------------------------------------------------------------------------------

(Continue reading)

Andrew Morton | 2 Dec 02:30
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
>
>    850         sc->nr_to_reclaim = sc->swap_cluster_max;
>      851         
>      852         while (nr_active || nr_inactive) {
>                          //...
>      860                 if (nr_inactive) {
>      861                         sc->nr_to_scan = min(nr_inactive,
>      862                                         (unsigned long)sc->swap_cluster_max);
>      863                         nr_inactive -= sc->nr_to_scan;
>      864                         shrink_cache(zone, sc);
>      865                         if (sc->nr_to_reclaim <= 0)
>      866                                 break;
>      867                 }
>      868         }
> 
>  Line 843 is the core of the scan balancing logic:
> 
>  priority                12      11      10
> 
>  On each call nr_scan_inactive is increased by:
>  DMA(2k pages)           +1      +2      +3
>  Normal(64k pages)      +17      +33     +65 
> 
>  Round it up to SWAP_CLUSTER_MAX=32, we get (scan batches/accumulate rounds):
>  DMA                     1/32    1/16    2/11
>  Normal                  2/2     2/1     3/1
>  DMA:Normal ratio        1:32    1:32    2:33
> 
>  This keeps the scan rate roughly balanced(i.e. 1:32) in low vm pressure.
(Continue reading)

Wu Fengguang | 2 Dec 03:04
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 05:30:15PM -0800, Andrew Morton wrote:
> >  But lines 865-866 together with line 846 make most shrink_zone() invocations
> >  only run one batch of scan. The numbers become:
> 
> True.  Need to go into a huddle with the changelogs, but I have a feeling
> that lines 865 and 866 aren't very important.  What happens if we remove
> them?

Maybe the answer is: can we accept to free 15M memory at one time for a 64G zone?
(Or can we simply increase the DEF_PRIORITY?)

btw, maybe it's time to lower the low_mem_reserve.
There should be no need to keep ~50M free memory with the balancing patch.

Regards,
Wu
Andrea Arcangeli | 2 Dec 03:18
Picon
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 10:04:07AM +0800, Wu Fengguang wrote:
> btw, maybe it's time to lower the low_mem_reserve.
> There should be no need to keep ~50M free memory with the balancing patch.

low_mem_reserve is indipendent from shrink_cache, because shrink_cache can't
free unfreeable pinned memory.

If you want to remove low_mem_reserve you'd better start by adding
migration of memory across the zones with pte updates etc... That would
at least mitigate the effect of anonymous memory w/o swap. But
low_mem_reserve is still needed for all other kind of allocations like
kmalloc or pci_alloc_consistent (i.e. not relocatable) etc...
Wu Fengguang | 2 Dec 03:37
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 03:18:11AM +0100, Andrea Arcangeli wrote:
> On Fri, Dec 02, 2005 at 10:04:07AM +0800, Wu Fengguang wrote:
> > btw, maybe it's time to lower the low_mem_reserve.
> > There should be no need to keep ~50M free memory with the balancing patch.
> 
> low_mem_reserve is indipendent from shrink_cache, because shrink_cache can't
> free unfreeable pinned memory.
> 
> If you want to remove low_mem_reserve you'd better start by adding
> migration of memory across the zones with pte updates etc... That would
> at least mitigate the effect of anonymous memory w/o swap. But
> low_mem_reserve is still needed for all other kind of allocations like
> kmalloc or pci_alloc_consistent (i.e. not relocatable) etc...

Thanks for the clarification, I was concerning too much ;)

Regards,
Wu
Andrea Arcangeli | 2 Dec 03:52
Picon
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 10:37:27AM +0800, Wu Fengguang wrote:
> Thanks for the clarification, I was concerning too much ;)

You're welcome. I'm also not concerned because the cost is linear with
the amount of memory (and the cost has an high bound, that is the size
of the lower zones, so it's not like the struct page that is a
percentage of ram guaranteed to be lost) so it's generally not
noticeable at runtime, and it's most important in the big systems (where
in turn the cost is higher).
Andrew Morton | 2 Dec 05:45
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Andrea Arcangeli <andrea <at> suse.de> wrote:
>
> low_mem_reserve

I've a suspicion that the addition of the dma32 zone might have
broken this.
Wu Fengguang | 2 Dec 07:38
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 08:45:49PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea <at> suse.de> wrote:
> >
> > low_mem_reserve
> 
> I've a suspicion that the addition of the dma32 zone might have
> broken this.

And there is a danger of (the last zone != the largest zone). This breaks my
assumption. Either we should remove the two lines in shrink_zone():

>      865                         if (sc->nr_to_reclaim <= 0)
>      866                                 break;

Or explicitly add more weight to the balancing efforts with
mm-add-weight-to-reclaim-for-aging.patch below.

Thanks,
Wu

Subject: mm: add more weight to reclaim for aging
Cc: Marcelo Tosatti <marcelo.tosatti <at> cyclades.com>, Magnus Damm <magnus.damm <at> gmail.com>
Cc: Nick Piggin <npiggin <at> suse.de>, Andrea Arcangeli <andrea <at> suse.de>

Let HighMem = the last zone, we get in normal cases:
- HighMem zone is the largest zone
- HighMem zone is mainly reclaimed for watermark, other zones is almost always
  reclaimed for aging
- While HighMem is reclaimed N times for watermark, other zones has N+1 chances
  to reclaim for aging
(Continue reading)

Nick Piggin | 2 Dec 03:27
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang wrote:
> On Thu, Dec 01, 2005 at 05:30:15PM -0800, Andrew Morton wrote:
> 
>>> But lines 865-866 together with line 846 make most shrink_zone() invocations
>>> only run one batch of scan. The numbers become:
>>
>>True.  Need to go into a huddle with the changelogs, but I have a feeling
>>that lines 865 and 866 aren't very important.  What happens if we remove
>>them?
> 
> 
> Maybe the answer is: can we accept to free 15M memory at one time for a 64G zone?
> (Or can we simply increase the DEF_PRIORITY?)
> 

0.02% of the memory? Why not? I think you should be more worried
about what happens when the priority winds up.

I think your proposal to synch reclaim rates between zones is fine
when all pages have similar properties, but could behave strangely
when you do have different requirements on different zones.

> btw, maybe it's time to lower the low_mem_reserve.
> There should be no need to keep ~50M free memory with the balancing patch.
> 

min_free_kbytes? This number really isn't anything to do with balancing
and more to do with the amount of reserve kept for things like GFP_ATOMIC
and recursive allocations. Let's not lower it ;)

(Continue reading)

Andrea Arcangeli | 2 Dec 03:36
Picon
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 01:27:06PM +1100, Nick Piggin wrote:
> min_free_kbytes? This number really isn't anything to do with balancing
> and more to do with the amount of reserve kept for things like GFP_ATOMIC
> and recursive allocations. Let's not lower it ;)

Agreed. Or at the very least that should be discussed in a separate
thread, it has no relation with shrink_cache changes or anything else
related to zone aging IMHO.
Wu Fengguang | 2 Dec 03:43
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 01:27:06PM +1100, Nick Piggin wrote:
> Wu Fengguang wrote:
> >On Thu, Dec 01, 2005 at 05:30:15PM -0800, Andrew Morton wrote:
> >
> >>>But lines 865-866 together with line 846 make most shrink_zone() 
> >>>invocations
> >>>only run one batch of scan. The numbers become:
> >>
> >>True.  Need to go into a huddle with the changelogs, but I have a feeling
> >>that lines 865 and 866 aren't very important.  What happens if we remove
> >>them?
> >
> >
> >Maybe the answer is: can we accept to free 15M memory at one time for a 
> >64G zone?
> >(Or can we simply increase the DEF_PRIORITY?)
> >
> 
> 0.02% of the memory? Why not? I think you should be more worried
> about what happens when the priority winds up.

Yes, sounds reasonable.

> I think your proposal to synch reclaim rates between zones is fine
> when all pages have similar properties, but could behave strangely
> when you do have different requirements on different zones.

Thanks.
That requirement might be addressed by disabling the feature on specific zones,
or attaching them with a shrinker.seeks like ratio, or something else...
(Continue reading)

Andrew Morton | 2 Dec 06:49
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
>
>      865                         if (sc->nr_to_reclaim <= 0)
>      866                                 break;
>      867                 }
>      868         }
> 
>  Line 843 is the core of the scan balancing logic:
> 
>  priority                12      11      10
> 
>  On each call nr_scan_inactive is increased by:
>  DMA(2k pages)           +1      +2      +3
>  Normal(64k pages)      +17      +33     +65 
> 
>  Round it up to SWAP_CLUSTER_MAX=32, we get (scan batches/accumulate rounds):
>  DMA                     1/32    1/16    2/11
>  Normal                  2/2     2/1     3/1
>  DMA:Normal ratio        1:32    1:32    2:33
> 
>  This keeps the scan rate roughly balanced(i.e. 1:32) in low vm pressure.
> 
>  But lines 865-866 together with line 846 make most shrink_zone() invocations
>  only run one batch of scan.

Yes, this seems to be the problem.  Sigh.  By the time 2.6.8 came around I
just didn't have time to do the amount of testing which any page reclaim
tweak necessitates.

From: Andrew Morton <akpm <at> osdl.org>
(Continue reading)

Wu Fengguang | 2 Dec 08:18
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 09:49:31PM -0800, Andrew Morton wrote:
> From: Andrew Morton <akpm <at> osdl.org>
> 
> Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
> 
>   The shrink_zone() logic can, under some circumstances, cause far too many
>   pages to be reclaimed.  Say, we're scanning at high priority and suddenly
>   hit a large number of reclaimable pages on the LRU.
> 
>   Change things so we bale out when SWAP_CLUSTER_MAX pages have been
>   reclaimed.
> 
> Problem is, this change caused significant imbalance in inter-zone scan
> balancing by truncating scans of larger zones.
> 
> Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
> balancing algorithm would require that if we're scanning 100 pages of
> ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
> cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
> reclaimed.  Thus effectively causing smaller zones to be scanned relatively
> harder than large ones.
> 
> Now I need to remember what the workload was which caused me to write this
> patch originally, then fix it up in a different way...

Maybe it's a situation like this:

__|____|________|________________|________________________________|________________________________________________________________|________________________________________________________________________________________________________________________________|________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ----------------------------------------------------------------------------------------------------------------------------------
        _: pinned chunk
(Continue reading)

Andrew Morton | 2 Dec 08:27
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
>
>  First we run into a large range of pinned chunks, which lowered the scan
>  priority.  And then there are plenty of reclaimable chunks, bomb...

It doesn't have to be that complex - the unreclaimable pages could be
referenced, or under writeback or even simply dirty.
Wu Fengguang | 10 Dec 12:59
Picon

[BUG 2.6.15-rc5] EXT3-fs error and soft lockup detected

Hello,

I got this message when exiting qemu:

[11266.262154] EXT3-fs error (device hda): ext3_free_blocks_sb: bit already cleared for block 318015
[11266.276897] Aborting journal on device hda.
[11266.283815] EXT3-fs error (device hda): ext3_free_blocks_sb: bit already cleared for block 318016
[11266.293567] EXT3-fs error (device hda): ext3_free_blocks_sb: bit already cleared for block 318017
[11266.303451] EXT3-fs error (device hda): ext3_free_blocks_sb: bit already cleared for block 318018
[11266.313478] EXT3-fs error (device hda): ext3_free_blocks_sb: bit already cleared for block 318019
[11266.323347] EXT3-fs error (device hda): ext3_free_blocks_sb: bit already cleared for block 318020
[11266.333543] EXT3-fs error (device hda) in ext3_reserve_inode_write: Journal has aborted
[11266.342839] EXT3-fs error (device hda) in ext3_reserve_inode_write: Journal has aborted
[11266.351870] EXT3-fs error (device hda) in ext3_orphan_del: Journal has aborted
[11266.360540] EXT3-fs error (device hda) in ext3_truncate: Journal has aborted
[11266.412056] __journal_remove_journal_head: freeing b_committed_data
[11266.421834] ext3_abort called.
[11266.425341] EXT3-fs error (device hda): ext3_journal_start_sb: Detected aborted journal
[11266.433776] Remounting filesystem read-only
[11269.300117] md: stopping all md devices.
[11269.304244] md: md0 switched to read-only mode.
[11280.827080] BUG: soft lockup detected on CPU#0!
[11280.831556]
[11280.833147] Pid: 4045, comm:               reboot
[11280.837736] EIP: 0060:[<c010eee0>] CPU: 0
[11280.841809] EIP is at delay_pit+0x20/0x30
[11280.845681]  EFLAGS: 00000207    Not tainted  (2.6.15-rc5)
[11280.850828] EAX: 0021fb30 EBX: 00000263 ECX: 01062560 EDX: c0382ce0
[11280.856882] ESI: c039e694 EDI: 00000001 EBP: f76bfe5c DS: 007b ES: 007b
[11280.863077] CR0: 8005003b CR2: b7eed1e0 CR3: 376b6000 CR4: 00000690
(Continue reading)

Marcelo Tosatti | 2 Dec 16:13
Picon
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 09:49:31PM -0800, Andrew Morton wrote:
> Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
> >
> >      865                         if (sc->nr_to_reclaim <= 0)
> >      866                                 break;
> >      867                 }
> >      868         }
> > 
> >  Line 843 is the core of the scan balancing logic:
> > 
> >  priority                12      11      10
> > 
> >  On each call nr_scan_inactive is increased by:
> >  DMA(2k pages)           +1      +2      +3
> >  Normal(64k pages)      +17      +33     +65 
> > 
> >  Round it up to SWAP_CLUSTER_MAX=32, we get (scan batches/accumulate rounds):
> >  DMA                     1/32    1/16    2/11
> >  Normal                  2/2     2/1     3/1
> >  DMA:Normal ratio        1:32    1:32    2:33
> > 
> >  This keeps the scan rate roughly balanced(i.e. 1:32) in low vm pressure.
> > 
> >  But lines 865-866 together with line 846 make most shrink_zone() invocations
> >  only run one batch of scan.
> 
> Yes, this seems to be the problem.  Sigh.  By the time 2.6.8 came around I
> just didn't have time to do the amount of testing which any page reclaim
> tweak necessitates.

(Continue reading)

Andrew Morton | 2 Dec 22:39
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Marcelo Tosatti <marcelo.tosatti <at> cyclades.com> wrote:
>
> 
> It all makes sense to me (Wu's description of the problem and your patch), 
> but still no good with reference to fair scanning.

Not so.  On a 4G x86 box doing a simple 8GB write this patch took the
highmem/normal scanning ratio from 0.7 to 3.5.  On that setup the highmem
zone has 3.6x as many pages as the normal zone, so it's bang-on-target.

There's not a lot of point in jumping straight into the complex stresstests
without having first tested the simple stuff.

> Moreover the patch hurts 
> interactivity _badly_, not sure why (ssh into the box with FFSB testcase 
> takes more than one minute to login, while vanilla takes few dozens of seconds). 

Well, we know that the revert reintroduces an overscanning problem.

How are you invoking FFSB?  Exactly?  On what sort of machine, with how
much memory?

> Follows an interesting part of "diff -u 2614-vanilla.vmstat 2614-akpm.vmstat"
> (they were not retrieve at the exact same point in the benchmark run, but 
> that should not matter much):
> 
> -slabs_scanned 37632
> -kswapd_steal 731859
> -kswapd_inodesteal 1363
> -pageoutrun 26573
(Continue reading)

Marcelo Tosatti | 3 Dec 01:26
Picon
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 01:39:17PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti <at> cyclades.com> wrote:
> >
> > 
> > It all makes sense to me (Wu's description of the problem and your patch), 
> > but still no good with reference to fair scanning.
> 
> Not so.  On a 4G x86 box doing a simple 8GB write this patch took the
> highmem/normal scanning ratio from 0.7 to 3.5.  On that setup the highmem
> zone has 3.6x as many pages as the normal zone, so it's bang-on-target.

Humpf!  What are the pgalloc dma/normal/highmem numbers under such test?

Does this machine need bounce buffers for disk I/O?

> There's not a lot of point in jumping straight into the complex stresstests
> without having first tested the simple stuff.

Its not a really complex stresstest, though yours is simpler. There are 10 
threads operating on 20 files. You can reproduce the load using the 
following FFSB profile (I remake the filesystem each time, results are 
pretty stable):

num_filesystems=1
num_threadgroups=1
directio=0
time=300

[filesystem0]
location=/mnt/hda4/
(Continue reading)

Wu Fengguang | 4 Dec 07:06
Picon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 10:26:14PM -0200, Marcelo Tosatti wrote:
> It seems very fragile (Wu's patches attempt to address that) in general: you
> tweak it here and watch it go nuts there.

The patch still has problems, and it can lead to more page allocations in
remote nodes.

For NUMA systems, basicly HPC applications want locality, and file servers
want cache consistency. More worse two types of applications can coexist in one
single system. The general solution may be classifying pages into two types:

local  pages: mostly local accessed, and low latency is first priority
global pages: for consistent file caching

Reclaims from global pages should be balanced globally to make a seamlessly
single global cache. We can allocate special zones to hold the global pages,
and make the reclaims from them in sync. Nick, are you working on this?

Thanks,
Wu
Marcelo Tosatti | 2 Dec 02:26
Picon
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging


On Thu, Dec 01, 2005 at 03:03:49PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti <at> cyclades.com> wrote:
> >
> > Hi Andrew,
> > 
> > On Thu, Dec 01, 2005 at 02:37:14AM -0800, Andrew Morton wrote:
> > > Wu Fengguang <wfg <at> mail.ustc.edu.cn> wrote:
> > > >
> > > >  The zone aging rates are currently imbalanced,
> > > 
> > > ZONE_DMA is out of whack.  It shouldn't be, and I'm not aware of anyone
> > > getting in and working out why.  I certainly wouldn't want to go and add
> > > all this stuff without having a good understanding of _why_ it's out of
> > > whack.  Perhaps it's just some silly bug, like the thing I pointed at in
> > > the previous email.
> > 
> > I think that the problem is caused by the interaction between 
> > the way reclaiming is quantified and parallel allocators.
> 
> Could be.  But what about the bug which I think is there?  That'll cause
> overscanning of the DMA zone. 

There were about 12Mb of inactive pages on the DMA zone. You're hypothesis 
was that there were no LRU pages to be scanned on DMA zone?

> > The zones have different sizes, and each zone reclaim iteration
> > scans the same number of pages. It is unfair.
> 
> Nope.  See how shrink_zone() bases nr_active and nr_inactive on
(Continue reading)

Andrew Morton | 2 Dec 04:40
Favicon

Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Marcelo Tosatti <marcelo.tosatti <at> cyclades.com> wrote:
>
>  > Could be.  But what about the bug which I think is there?  That'll cause
>  > overscanning of the DMA zone. 
> 
>  There were about 12Mb of inactive pages on the DMA zone. You're hypothesis 
>  was that there were no LRU pages to be scanned on DMA zone?

No, my hypothesis was that balance_pgdat() had a bug.  Looking at it again,
I don't see it any more..
Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 08/12] mm: remove swap_cluster_max from scan_control

The use of sc.swap_cluster_max is weird and redundant.

The callers should just set sc.priority/sc.nr_to_reclaim, and let
shrink_zone() decide the proper loop parameters.

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 mm/vmscan.c |   15 ++++-----------
 1 files changed, 4 insertions(+), 11 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -76,12 +76,6 @@ struct scan_control {

 	/* Can pages be swapped as part of reclaim? */
 	int may_swap;
-
-	/* This context's SWAP_CLUSTER_MAX. If freeing memory for
-	 * suspend, we effectively ignore SWAP_CLUSTER_MAX.
-	 * In this context, it doesn't matter that we scan the
-	 * whole list at once. */
-	int swap_cluster_max;
 };

 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -1125,7 +1119,6 @@ shrink_zone(struct zone *zone, struct sc
 	nr_inactive &= ~(SWAP_CLUSTER_MAX - 1);

 	sc->nr_to_scan = SWAP_CLUSTER_MAX;
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 10/12] mm: merge sc.may_writepage and sc.may_swap into sc.flags

Turn bool values into flags to make struct scan_control more compact.

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 mm/vmscan.c |   22 ++++++++++------------
 1 files changed, 10 insertions(+), 12 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -72,12 +72,12 @@ struct scan_control {
 	/* This context's GFP mask */
 	gfp_t gfp_mask;

-	int may_writepage;
-
-	/* Can pages be swapped as part of reclaim? */
-	int may_swap;
+	unsigned long flags;
 };

+#define SC_MAY_WRITEPAGE	0x1
+#define SC_MAY_SWAP		0x2	/* Can pages be swapped as part of reclaim? */
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))

 #ifdef ARCH_HAS_PREFETCH
@@ -487,7 +487,7 @@ static int shrink_list(struct list_head 
 		 * Try to allocate it some swap space here.
 		 */
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 09/12] mm: accumulate sc.nr_scanned/sc.nr_reclaimed

Now that there's no need to keep track of nr_scanned/nr_reclaimed for every
single round of shrink_zone(), remove the total_scanned/total_reclaimed and
let nr_scanned/nr_reclaimed accumulate between rounds.

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 mm/vmscan.c |   36 ++++++++++++++----------------------
 1 files changed, 14 insertions(+), 22 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1229,7 +1229,6 @@ int try_to_free_pages(struct zone **zone
 {
 	int priority;
 	int ret = 0;
-	int total_scanned = 0, total_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
 	int i;
@@ -1239,6 +1238,8 @@ int try_to_free_pages(struct zone **zone
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = 0;
 	sc.may_swap = 1;
+	sc.nr_scanned = 0;
+	sc.nr_reclaimed = 0;

 	inc_page_state(allocstall);

@@ -1254,8 +1255,6 @@ int try_to_free_pages(struct zone **zone
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 06/12] mm: balance active/inactive list scan rates

shrink_zone() has two major design goals:
1) let active/inactive lists have equal scan rates
2) do the scans in small chunks

But the implementation has some problems:
- reluctant to scan small zones
  the callers often have to dip into low priority to free memory.

- the balance is quite rough
  the break statement in the loop breaks it.

- may scan few pages in one batch
  refill_inactive_zone can be called twice to scan 32 and 1 pages.

The new design:
1) keep perfect balance
   let active_list follow inactive_list in scan rate

2) always scan in SWAP_CLUSTER_MAX sized chunks
   simple and efficient

3) will scan at least one chunk
   the expected behavior from the callers

The perfect balance may or may not yield better performance, though it
a) is a more understandable and dependable behavior
b) together with inter-zone balancing, makes the zoned memories consistent

The atomic reclaim_in_progress is there to prevent most concurrent reclaims.
If concurrent reclaims did happen, there will be no fatal errors.
(Continue reading)

Peter Zijlstra | 1 Dec 12:39
Picon

Re: [PATCH 06/12] mm: balance active/inactive list scan rates

On Thu, 2005-12-01 at 18:18 +0800, Wu Fengguang wrote:
> plain text document attachment
> (mm-balance-active-inactive-list-aging.patch)
> shrink_zone() has two major design goals:
> 1) let active/inactive lists have equal scan rates
> 2) do the scans in small chunks
> 

> The new design:
> 1) keep perfect balance
>    let active_list follow inactive_list in scan rate
> 
> 2) always scan in SWAP_CLUSTER_MAX sized chunks
>    simple and efficient
> 
> 3) will scan at least one chunk
>    the expected behavior from the callers
> 
> The perfect balance may or may not yield better performance, though it
> a) is a more understandable and dependable behavior
> b) together with inter-zone balancing, makes the zoned memories consistent

Nice, this patch effectively separates zone balancing from
active/inactive balancing. I was thinking about doing this this morning
in order to nicely abstract out all the page-replacement code.

Thanks!

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 11/12] mm: add page reclaim debug traces

Show the detailed steps of direct/kswapd page reclaim.

To enable the printk traces:
# echo y > /debug/debug_page_reclaim

Sample lines:

reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 32-32, age 2626, active to scan 6542,
hot+cold+free pages 8842+283558+352
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2626, active to scan 8018,
hot+cold+free pages 1693+200036+10360
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-64, age 2627, active to scan 7564,
hot+cold+free pages 8842+283526+384
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2627, active to scan 8296,
hot+cold+free pages 1693+200018+10360
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-63, age 2628, active to scan 8587,
hot+cold+free pages 8843+283495+416
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2628, active to scan 8574,
hot+cold+free pages 1693+200014+10392
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-63, age 2628, active to scan 9610,
hot+cold+free pages 8844+283465+448
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2628, active to scan 8852,
hot+cold+free pages 1693+199996+10424
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-64, age 2629, active to scan 10633,
hot+cold+free pages 8844+283433+480
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2629, active to scan 9130,
hot+cold+free pages 1693+199992+10456
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-64, age 2630, active to scan 11656,
hot+cold+free pages 8844+283401+512
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2630, active to scan 9408,
(Continue reading)

Wu Fengguang | 1 Dec 11:18
Picon

[PATCH 12/12] mm: fix minor scan count bugs

- in isolate_lru_pages(): reports one more scan. Fix it.
- in shrink_cache(): 0 pages taken does not mean 0 pages scanned. Fix it.

Signed-off-by: Wu Fengguang <wfg <at> mail.ustc.edu.cn>
---

 mm/vmscan.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -920,7 +920,8 @@ static int isolate_lru_pages(int nr_to_s
 	struct page *page;
 	int scan = 0;

-	while (scan++ < nr_to_scan && !list_empty(src)) {
+	while (scan < nr_to_scan && !list_empty(src)) {
+		scan++;
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);

@@ -967,14 +968,15 @@ static void shrink_cache(struct zone *zo
 	update_zone_age(zone, nr_scan);
 	spin_unlock_irq(&zone->lru_lock);

-	if (nr_taken == 0)
-		return;
-
 	sc->nr_scanned += nr_scan;
 	if (current_is_kswapd())
(Continue reading)


Gmane