KOSAKI Motohiro | 20 Aug 13:03
Favicon

[RFC][PATCH 0/2] Quicklist is slighly problematic.

Hi Cristoph,

Thank you for explain your quicklist plan at OLS.

So, I made summary to issue of quicklist.
if you have a bit time, Could you please read this mail and patches?
And, if possible, Could you please tell me your feeling?

--------------------------------------------------------------------

Now, Quicklist store some page in each CPU as cache.
(Each CPU has node_free_pages/16 pages)

and it is used for page table cache.
Then, exit() increase cache, the other hand fork() spent it.

So, if apache type (one parent and many child model) middleware run,
One CPU process fork(), Other CPU process the middleware work and exit().

At that time, One CPU don't have page table cache at all,
Others have maximum caches.

	QList_max = (#ofCPUs - 1) x Free / 16
	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)

So, How much quicklist spent memory at maximum case?
That is #CPUs proposional because it is per CPU cache but cache amount calculation doesn't use #ofCPUs.

	Above calculation mean

(Continue reading)

KOSAKI Motohiro | 20 Aug 13:07
Favicon

[RFC][PATCH 1/2] Show quicklist at meminfo

Now, Quicklist can spent several GB memory.
So, if end user can't hou much spent memory, he misunderstand to memory leak happend.

after this patch applied, /proc/meminfo output following.

% cat /proc/meminfo

MemTotal:        7701504 kB
MemFree:         5159040 kB
Buffers:          112960 kB
Cached:           337536 kB
SwapCached:            0 kB
Active:           218944 kB
Inactive:         350848 kB
Active(anon):     120832 kB
Inactive(anon):        0 kB
Active(file):      98112 kB
Inactive(file):   350848 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2031488 kB
SwapFree:        2031488 kB
Dirty:               320 kB
Writeback:             0 kB
AnonPages:        119488 kB
Mapped:            38528 kB
Slab:            1595712 kB
SReclaimable:      23744 kB
SUnreclaim:      1571968 kB
PageTables:        14336 kB
(Continue reading)

Andrew Morton | 20 Aug 20:35

Re: [RFC][PATCH 1/2] Show quicklist at meminfo

On Wed, 20 Aug 2008 20:07:06 +0900
KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:

> Now, Quicklist can spent several GB memory.
> So, if end user can't hou much spent memory, he misunderstand to memory leak happend.
> 
> 
> after this patch applied, /proc/meminfo output following.
> 
> % cat /proc/meminfo
> 
> MemTotal:        7701504 kB
> MemFree:         5159040 kB
> Buffers:          112960 kB
> Cached:           337536 kB
> SwapCached:            0 kB
> Active:           218944 kB
> Inactive:         350848 kB
> Active(anon):     120832 kB
> Inactive(anon):        0 kB
> Active(file):      98112 kB
> Inactive(file):   350848 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:       2031488 kB
> SwapFree:        2031488 kB
> Dirty:               320 kB
> Writeback:             0 kB
> AnonPages:        119488 kB
> Mapped:            38528 kB
(Continue reading)

KOSAKI Motohiro | 21 Aug 09:36
Favicon

Re: [RFC][PATCH 1/2] Show quicklist at meminfo

> quicklist_total_size() is racy against cpu hotplug.  That's OK for
> /proc/meminfo purposes (occasional transient inaccuracy?), but will it
> crash?  Not in the current implementation of per_cpu() afaict, but it
> might crash if we ever teach cpu hotunplug to free up the percpu
> resources.

First, Quicklist doesn't concern to cpu hotplug at all.
it is another quicklist problem.

Next, I think it doesn't cause crash. but I haven't any test.
So, I'll test cpu hotplug/unplug testing today.

I'll report result tommorow.

> I see no cpu hotplug handling in the quicklist code.  Do we leak all
> the hot-unplugged CPU's pages?

Yes.

Thanks!
KOSAKI Motohiro | 22 Aug 03:05
Favicon

Re: [RFC][PATCH 1/2] Show quicklist at meminfo

> > quicklist_total_size() is racy against cpu hotplug.  That's OK for
> > /proc/meminfo purposes (occasional transient inaccuracy?), but will it
> > crash?  Not in the current implementation of per_cpu() afaict, but it
> > might crash if we ever teach cpu hotunplug to free up the percpu
> > resources.
> 
> First, Quicklist doesn't concern to cpu hotplug at all.
> it is another quicklist problem.
> 
> Next, I think it doesn't cause crash. but I haven't any test.
> So, I'll test cpu hotplug/unplug testing today.
> 
> I'll report result tommorow.

OK.
I ran cpu hotplug/unplug coutinuous workload over 12H.
then, system crash doesn't happend.

So, I believe my patch is cpu unplug safe.

test method
--------------------------------------------------------------
1. open 7 terminal and following script run on each console.

CPU=cpuXXX; while true; do echo 0 > /sys/devices/system/cpu/$CPU/online; echo 1 > /sys/devi
ces/system/cpu/$CPU/online;done

2. open another console, following command run.

watch -n 1 cat /proc/meminfo
(Continue reading)

Andrew Morton | 22 Aug 06:28

Re: [RFC][PATCH 1/2] Show quicklist at meminfo

On Fri, 22 Aug 2008 10:05:45 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:

> > > quicklist_total_size() is racy against cpu hotplug.  That's OK for
> > > /proc/meminfo purposes (occasional transient inaccuracy?), but will it
> > > crash?  Not in the current implementation of per_cpu() afaict, but it
> > > might crash if we ever teach cpu hotunplug to free up the percpu
> > > resources.
> > 
> > First, Quicklist doesn't concern to cpu hotplug at all.
> > it is another quicklist problem.
> > 
> > Next, I think it doesn't cause crash. but I haven't any test.
> > So, I'll test cpu hotplug/unplug testing today.
> > 
> > I'll report result tommorow.
> 
> OK.
> I ran cpu hotplug/unplug coutinuous workload over 12H.
> then, system crash doesn't happend.
> 
> So, I believe my patch is cpu unplug safe.

err, which patch?

I presently have:

mm-show-quicklist-memory-usage-in-proc-meminfo.patch
mm-show-quicklist-memory-usage-in-proc-meminfo-fix.patch
mm-quicklist-shouldnt-be-proportional-to-number-of-cpus.patch
mm-quicklist-shouldnt-be-proportional-to-number-of-cpus-fix.patch
(Continue reading)

Robin Holt | 22 Aug 15:23
Favicon

Re: [RFC][PATCH 1/2] Show quicklist at meminfo

Christoph,

Could we maybe add a per_cpu off-node quicklist and just always free
that in check_pgt_cache?  That would get us back the freeing of off-node
page tables.

Thanks,
Robin

On Thu, Aug 21, 2008 at 09:28:47PM -0700, Andrew Morton wrote:
> On Fri, 22 Aug 2008 10:05:45 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:
> 
> > > > quicklist_total_size() is racy against cpu hotplug.  That's OK for
> > > > /proc/meminfo purposes (occasional transient inaccuracy?), but will it
> > > > crash?  Not in the current implementation of per_cpu() afaict, but it
> > > > might crash if we ever teach cpu hotunplug to free up the percpu
> > > > resources.
> > > 
> > > First, Quicklist doesn't concern to cpu hotplug at all.
> > > it is another quicklist problem.
> > > 
> > > Next, I think it doesn't cause crash. but I haven't any test.
> > > So, I'll test cpu hotplug/unplug testing today.
> > > 
> > > I'll report result tommorow.
> > 
> > OK.
> > I ran cpu hotplug/unplug coutinuous workload over 12H.
> > then, system crash doesn't happend.
> > 
(Continue reading)

Christoph Lameter | 22 Aug 15:56

Re: [RFC][PATCH 1/2] Show quicklist at meminfo

Robin Holt wrote:
> 
> Could we maybe add a per_cpu off-node quicklist and just always free
> that in check_pgt_cache?  That would get us back the freeing of off-node
> page tables.

Yes that is what I suggested and if you check your email from last year then
you will find an internal discussion and patches for such an approach.
KOSAKI Motohiro | 23 Aug 10:24
Favicon

Re: [RFC][PATCH 1/2] Show quicklist at meminfo

> > OK.
> > I ran cpu hotplug/unplug coutinuous workload over 12H.
> > then, system crash doesn't happend.
> > 
> > So, I believe my patch is cpu unplug safe.
> 
> err, which patch?
> 
> I presently have:
> 
> mm-show-quicklist-memory-usage-in-proc-meminfo.patch
> mm-show-quicklist-memory-usage-in-proc-meminfo-fix.patch
> mm-quicklist-shouldnt-be-proportional-to-number-of-cpus.patch
> mm-quicklist-shouldnt-be-proportional-to-number-of-cpus-fix.patch
> 
> Is that what you have?
> 
> I'll consolidate them into two patches and will append them here.  Please check.

Andrew, Thank you for your attention.

I test on

mm-show-quicklist-memory-usage-in-proc-meminfo.patch
mm-show-quicklist-memory-usage-in-proc-meminfo-fix.patch

and 

http://marc.info/?l=linux-mm&m=121931317407295&w=2 

(Continue reading)

Andrew Morton | 24 Aug 07:29

Re: [RFC][PATCH 1/2] Show quicklist at meminfo

On Sat, 23 Aug 2008 17:24:31 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:

> > > OK.
> > > I ran cpu hotplug/unplug coutinuous workload over 12H.
> > > then, system crash doesn't happend.
> > > 
> > > So, I believe my patch is cpu unplug safe.
> > 
> > err, which patch?
> > 
> > I presently have:
> > 
> > mm-show-quicklist-memory-usage-in-proc-meminfo.patch
> > mm-show-quicklist-memory-usage-in-proc-meminfo-fix.patch
> > mm-quicklist-shouldnt-be-proportional-to-number-of-cpus.patch
> > mm-quicklist-shouldnt-be-proportional-to-number-of-cpus-fix.patch
> > 
> > Is that what you have?
> > 
> > I'll consolidate them into two patches and will append them here.  Please check.
> 
> Andrew, Thank you for your attention.
> 
> I test on
> 
> mm-show-quicklist-memory-usage-in-proc-meminfo.patch
> mm-show-quicklist-memory-usage-in-proc-meminfo-fix.patch
> 
> and 
> 
(Continue reading)

KOSAKI Motohiro | 20 Aug 13:08
Favicon

[RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

When a test program which does task migration runs, my 8GB box spends 800MB of memory
for quicklist. This is not memory leak but doesn't seem good.

% cat /proc/meminfo

MemTotal:        7701568 kB
MemFree:         4724672 kB
(snip)
Quicklists:       844800 kB

because

- My machine spec is 
	number of numa node: 2
	number of cpus:      8 (4CPU x2 node)
        total mem:           8GB (4GB x2 node)
        free mem:            about 5GB

- Maximum quicklist usage is here

	 Number of CPUs per node            2    4    8   16
	 ==============================  ====================
	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%

- Then, 4.7GB x 16% ~= 880MB.
  So, Quicklist can use 800MB.

So, if following spec machine run that program

   CPUs: 64 (8cpu x 8node)
(Continue reading)

Christoph Lameter | 20 Aug 17:27

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

Looks good.

Acked-by: Christoph Lameter <cl <at> linux-foundation.org>
Andrew Morton | 21 Aug 08:46

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

On Wed, 20 Aug 2008 20:08:13 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:

> +	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));

sparc64 allmodconfig:

mm/quicklist.c: In function `max_pages':
mm/quicklist.c:44: error: invalid lvalue in unary `&'

we seem to have a made a spectacular mess of cpumasks lately.
David Miller | 21 Aug 09:13

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

From: Andrew Morton <akpm <at> linux-foundation.org>
Date: Wed, 20 Aug 2008 23:46:15 -0700

> On Wed, 20 Aug 2008 20:08:13 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:
> 
> > +	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));
> 
> sparc64 allmodconfig:
> 
> mm/quicklist.c: In function `max_pages':
> mm/quicklist.c:44: error: invalid lvalue in unary `&'
> 
> we seem to have a made a spectacular mess of cpumasks lately.

It should explode similarly on x86, since it also defines node_to_cpumask()
as an inline function.

IA64 seems to be one of the few platforms to define this as a macro
evaluating to the node-to-cpumask array entry, so it's clear what
platform Motohiro-san did build testing on :-)
KOSAKI Motohiro | 21 Aug 09:18
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

>> sparc64 allmodconfig:
>>
>> mm/quicklist.c: In function `max_pages':
>> mm/quicklist.c:44: error: invalid lvalue in unary `&'
>>
>> we seem to have a made a spectacular mess of cpumasks lately.
>
> It should explode similarly on x86, since it also defines node_to_cpumask()
> as an inline function.
>
> IA64 seems to be one of the few platforms to define this as a macro
> evaluating to the node-to-cpumask array entry, so it's clear what
> platform Motohiro-san did build testing on :-)

Thank you good advice.
I don't have sparc64 machine but I can get borrowing x86 machine.
So, I'll test on x86 today.
Andrew Morton | 21 Aug 09:27

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

On Thu, 21 Aug 2008 00:13:22 -0700 (PDT) David Miller <davem <at> davemloft.net> wrote:

> From: Andrew Morton <akpm <at> linux-foundation.org>
> Date: Wed, 20 Aug 2008 23:46:15 -0700
> 
> > On Wed, 20 Aug 2008 20:08:13 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:
> > 
> > > +	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));
> > 
> > sparc64 allmodconfig:
> > 
> > mm/quicklist.c: In function `max_pages':
> > mm/quicklist.c:44: error: invalid lvalue in unary `&'
> > 
> > we seem to have a made a spectacular mess of cpumasks lately.
> 
> It should explode similarly on x86, since it also defines node_to_cpumask()
> as an inline function.
> 
> IA64 seems to be one of the few platforms to define this as a macro
> evaluating to the node-to-cpumask array entry, so it's clear what
> platform Motohiro-san did build testing on :-)

Seems to compile OK on x86_32, x86_64, ia64 and powerpc for some reason.

This seems to fix things on sparc64:

--- a/mm/quicklist.c~mm-quicklist-shouldnt-be-proportional-to-number-of-cpus-fix
+++ a/mm/quicklist.c
@@ -28,7 +28,7 @@ static unsigned long max_pages(unsigned 
(Continue reading)

KOSAKI Motohiro | 21 Aug 09:31
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

>> IA64 seems to be one of the few platforms to define this as a macro
>> evaluating to the node-to-cpumask array entry, so it's clear what
>> platform Motohiro-san did build testing on :-)
>
> Seems to compile OK on x86_32, x86_64, ia64 and powerpc for some reason.
>
> This seems to fix things on sparc64:
>
> --- a/mm/quicklist.c~mm-quicklist-shouldnt-be-proportional-to-number-of-cpus-fix
> +++ a/mm/quicklist.c
> @@ -28,7 +28,7 @@ static unsigned long max_pages(unsigned
>        unsigned long node_free_pages, max;
>        int node = numa_node_id();
>        struct zone *zones = NODE_DATA(node)->node_zones;
> -       int num_cpus_per_node;
> +       cpumask_t node_cpumask;
>
>        node_free_pages =
>  #ifdef CONFIG_ZONE_DMA
> @@ -41,8 +41,8 @@ static unsigned long max_pages(unsigned
>
>        max = node_free_pages / FRACTION_OF_NODE_MEM;
>
> -       num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));
> -       max /= num_cpus_per_node;
> +       node_cpumask = node_to_cpumask(node);
> +       max /= cpus_weight_nr(node_cpumask);
>
>        return max(max, min_pages);
>  }
(Continue reading)

Peter Zijlstra | 21 Aug 11:32
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

On Thu, 2008-08-21 at 00:27 -0700, Andrew Morton wrote:
> On Thu, 21 Aug 2008 00:13:22 -0700 (PDT) David Miller <davem <at> davemloft.net> wrote:
> 
> > From: Andrew Morton <akpm <at> linux-foundation.org>
> > Date: Wed, 20 Aug 2008 23:46:15 -0700
> > 
> > > On Wed, 20 Aug 2008 20:08:13 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:
> > > 
> > > > +	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));
> > > 
> > > sparc64 allmodconfig:
> > > 
> > > mm/quicklist.c: In function `max_pages':
> > > mm/quicklist.c:44: error: invalid lvalue in unary `&'
> > > 
> > > we seem to have a made a spectacular mess of cpumasks lately.
> > 
> > It should explode similarly on x86, since it also defines node_to_cpumask()
> > as an inline function.
> > 
> > IA64 seems to be one of the few platforms to define this as a macro
> > evaluating to the node-to-cpumask array entry, so it's clear what
> > platform Motohiro-san did build testing on :-)
> 
> Seems to compile OK on x86_32, x86_64, ia64 and powerpc for some reason.
> 
> This seems to fix things on sparc64:
> 
> --- a/mm/quicklist.c~mm-quicklist-shouldnt-be-proportional-to-number-of-cpus-fix
> +++ a/mm/quicklist.c
(Continue reading)

KOSAKI Motohiro | 21 Aug 12:04
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

Hi Peter,

Thank you good point out!

> > @@ -41,8 +41,8 @@ static unsigned long max_pages(unsigned 
> >  
> >  	max = node_free_pages / FRACTION_OF_NODE_MEM;
> >  
> > -	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));
> > -	max /= num_cpus_per_node;
> > +	node_cpumask = node_to_cpumask(node);
> > +	max /= cpus_weight_nr(node_cpumask);
> >  
> >  	return max(max, min_pages);
> >  }
> 
> humm, I thought we wanted to keep cpumask_t stuff away from our stack -
> since on insanely large SGI boxen (/me looks at mike) the thing becomes
> 512 bytes.

Hm, interesting.
I think following patch fill your point, right?

but I worry about it works on sparc64...

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>

---
 mm/quicklist.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)
(Continue reading)

David Miller | 21 Aug 12:09

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

From: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
Date: Thu, 21 Aug 2008 19:04:28 +0900

> but I worry about it works on sparc64...

It should.
KOSAKI Motohiro | 21 Aug 12:13
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

> From: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
> Date: Thu, 21 Aug 2008 19:04:28 +0900
> 
> > but I worry about it works on sparc64...
> 
> It should.

Could you please confirm it?

David Miller | 21 Aug 12:26

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

From: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
Date: Thu, 21 Aug 2008 19:13:55 +0900

> > From: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
> > Date: Thu, 21 Aug 2008 19:04:28 +0900
> > 
> > > but I worry about it works on sparc64...
> > 
> > It should.
> 
> Could you please confirm it?

davem <at> sunset:~/src/GIT/net-2.6$ patch -p1 <diff
patching file mm/quicklist.c
davem <at> sunset:~/src/GIT/net-2.6$ make mm/quicklist.o
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  CALL    scripts/checksyscalls.sh
  CC      mm/quicklist.o
davem <at> sunset:~/src/GIT/net-2.6$ 
KOSAKI Motohiro | 21 Aug 12:22
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs


Sorry, following patch is crap.
please forget it.

I'll respin it soon.

> 
> ---
>  mm/quicklist.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Index: b/mm/quicklist.c
> ===================================================================
> --- a/mm/quicklist.c
> +++ b/mm/quicklist.c
> @@ -26,7 +26,10 @@ DEFINE_PER_CPU(struct quicklist, quickli
>  static unsigned long max_pages(unsigned long min_pages)
>  {
>  	unsigned long node_free_pages, max;
> -	struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
> +	int node = numa_node_id();
> +	struct zone *zones = NODE_DATA(node)->node_zones;
> +	int num_cpus_on_node;
> +	node_to_cpumask_ptr(cpumask_on_node, node);
>  
>  	node_free_pages =
>  #ifdef CONFIG_ZONE_DMA
> @@ -38,6 +41,10 @@ static unsigned long max_pages(unsigned 
>  		zone_page_state(&zones[ZONE_NORMAL], NR_FREE_PAGES);
>  
(Continue reading)

KOSAKI Motohiro | 21 Aug 14:02
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

> 
> Sorry, following patch is crap.
> please forget it.
> 
> I'll respin it soon.

Ah, it's a ok.
it is not crap.

node_to_cpumask_ptr() of generic arch makes local cpumask_t variable.

#define node_to_cpumask_ptr(v, node)                                    \
                cpumask_t _##v = node_to_cpumask(node);                 \
                const cpumask_t *v = &_##v

but gcc optimazer can erase it.
So, it doesn't consume any stack.
checkstack.pl doesn't outpu quicklist related function.

% objdump -d vmlinux | ./scripts/checkstack.pl
0xa000000100647a86 sn2_global_tlb_purge [vmlinux]:      2176
0xa000000100264e86 read_kcore [vmlinux]:                1360
0xa0000001001042a6 crash_save_cpu [vmlinux]:            1152
0xa0000001007869e6 e1000_check_options [vmlinux]:       1152
0xa00000010021b9c6 __mpage_writepage [vmlinux]:         1136
0xa00000010034e9c6 fat_alloc_clusters [vmlinux]:        1136
0xa0000001009c29c6 efi_uart_console_only [vmlinux]:     1136
0xa00000010034afa6 fat_add_entries [vmlinux]:           1088
0xa00000010034d186 fat_free_clusters [vmlinux]:         1088
0xa00000010051f396 tg3_get_estats [vmlinux]:            1072
(Continue reading)

Mike Travis | 25 Aug 20:48
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

KOSAKI Motohiro wrote:
> Hi Peter,
> 
> Thank you good point out!
> 
>>> @@ -41,8 +41,8 @@ static unsigned long max_pages(unsigned 
>>>  
>>>  	max = node_free_pages / FRACTION_OF_NODE_MEM;
>>>  
>>> -	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));
>>> -	max /= num_cpus_per_node;
>>> +	node_cpumask = node_to_cpumask(node);
>>> +	max /= cpus_weight_nr(node_cpumask);
>>>  
>>>  	return max(max, min_pages);
>>>  }
>> humm, I thought we wanted to keep cpumask_t stuff away from our stack -
>> since on insanely large SGI boxen (/me looks at mike) the thing becomes
>> 512 bytes.
> 
> Hm, interesting.
> I think following patch fill your point, right?
> 
> but I worry about it works on sparc64...
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
> 
> ---
>  mm/quicklist.c |    9 ++++++++-
(Continue reading)

KOSAKI Motohiro | 26 Aug 01:33
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

> > +	int node = numa_node_id();
> > +	struct zone *zones = NODE_DATA(node)->node_zones;
> > +	int num_cpus_on_node;
> > +	node_to_cpumask_ptr(cpumask_on_node, node);
> >  
> >  	node_free_pages =
> >  #ifdef CONFIG_ZONE_DMA
> > @@ -38,6 +41,10 @@ static unsigned long max_pages(unsigned 
> >  		zone_page_state(&zones[ZONE_NORMAL], NR_FREE_PAGES);
> >  
> >  	max = node_free_pages / FRACTION_OF_NODE_MEM;
> > +
> > +	num_cpus_on_node = cpus_weight_nr(*cpumask_on_node);
> > +	max /= num_cpus_on_node;
> > +
> >  	return max(max, min_pages);
> 
> Exactly!  And (many thanks to them!) the sparc maintainers have
> implemented a similar internal function definition for node_to_cpumask_ptr().

Can I think get your Ack?

Mike Travis | 26 Aug 22:35
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

KOSAKI Motohiro wrote:
>>> +	int node = numa_node_id();
>>> +	struct zone *zones = NODE_DATA(node)->node_zones;
>>> +	int num_cpus_on_node;
>>> +	node_to_cpumask_ptr(cpumask_on_node, node);
>>>  
>>>  	node_free_pages =
>>>  #ifdef CONFIG_ZONE_DMA
>>> @@ -38,6 +41,10 @@ static unsigned long max_pages(unsigned 
>>>  		zone_page_state(&zones[ZONE_NORMAL], NR_FREE_PAGES);
>>>  
>>>  	max = node_free_pages / FRACTION_OF_NODE_MEM;
>>> +
>>> +	num_cpus_on_node = cpus_weight_nr(*cpumask_on_node);
>>> +	max /= num_cpus_on_node;
>>> +
>>>  	return max(max, min_pages);
>> Exactly!  And (many thanks to them!) the sparc maintainers have
>> implemented a similar internal function definition for node_to_cpumask_ptr().
> 
> Can I think get your Ack?
> 

Based on code review, sure.  I'll also give it a try on one of my
test machines as soon as I can.

Mike
Mike Travis | 25 Aug 20:44
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

Peter Zijlstra wrote:
> On Thu, 2008-08-21 at 00:27 -0700, Andrew Morton wrote:
>> On Thu, 21 Aug 2008 00:13:22 -0700 (PDT) David Miller <davem <at> davemloft.net> wrote:
>>
>>> From: Andrew Morton <akpm <at> linux-foundation.org>
>>> Date: Wed, 20 Aug 2008 23:46:15 -0700
>>>
>>>> On Wed, 20 Aug 2008 20:08:13 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:
>>>>
>>>>> +	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));
>>>> sparc64 allmodconfig:
>>>>
>>>> mm/quicklist.c: In function `max_pages':
>>>> mm/quicklist.c:44: error: invalid lvalue in unary `&'
>>>>
>>>> we seem to have a made a spectacular mess of cpumasks lately.
>>> It should explode similarly on x86, since it also defines node_to_cpumask()
>>> as an inline function.
>>>
>>> IA64 seems to be one of the few platforms to define this as a macro
>>> evaluating to the node-to-cpumask array entry, so it's clear what
>>> platform Motohiro-san did build testing on :-)
>> Seems to compile OK on x86_32, x86_64, ia64 and powerpc for some reason.
>>
>> This seems to fix things on sparc64:
>>
>> --- a/mm/quicklist.c~mm-quicklist-shouldnt-be-proportional-to-number-of-cpus-fix
>> +++ a/mm/quicklist.c
>> @@ -28,7 +28,7 @@ static unsigned long max_pages(unsigned 
>>  	unsigned long node_free_pages, max;
(Continue reading)

Mike Travis | 25 Aug 20:40
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

David Miller wrote:
> From: Andrew Morton <akpm <at> linux-foundation.org>
> Date: Wed, 20 Aug 2008 23:46:15 -0700
> 
>> On Wed, 20 Aug 2008 20:08:13 +0900 KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:
>>
>>> +	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));

I think the more correct usage would be:

	{
		node_to_cpumask_ptr(v, node);
		num_cpus_per_node = cpus_weight_nr(*v);
		max /= num_cpus_per_node;

		return max(max, min_pages);
	}

which should load 'v' with a pointer to the node_to_cpumask_map[node] entry
[and avoid using stack space for the cpumask_t variable for those arch's
that define a node_to_cpumask_map (or similar).]  Otherwise a local cpumask_t
variable '_v' is created to which 'v' is pointing to and thus can be used
directly as an arg to the cpu_xxx ops.

Thanks,
Mike

>> sparc64 allmodconfig:
>>
>> mm/quicklist.c: In function `max_pages':
(Continue reading)

KOSAKI Motohiro | 26 Aug 01:31
Favicon

Re: [RFC][PATCH 2/2] quicklist shouldn't be proportional to # of CPUs

Hi Mike, 

> >>> +	num_cpus_per_node = cpus_weight_nr(node_to_cpumask(node));
> 
> I think the more correct usage would be:
> 
> 	{
> 		node_to_cpumask_ptr(v, node);
> 		num_cpus_per_node = cpus_weight_nr(*v);
> 		max /= num_cpus_per_node;
> 
> 		return max(max, min_pages);
> 	}
> 
> which should load 'v' with a pointer to the node_to_cpumask_map[node] entry
> [and avoid using stack space for the cpumask_t variable for those arch's
> that define a node_to_cpumask_map (or similar).]  Otherwise a local cpumask_t
> variable '_v' is created to which 'v' is pointing to and thus can be used
> directly as an arg to the cpu_xxx ops.

Thank you for your attension.
please see my latest patch (http://marc.info/?l=linux-mm&m=121966459713193&w=2)
it do that.

Christoph Lameter | 20 Aug 16:10

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

KOSAKI Motohiro wrote:
> Hi Cristoph,
> 
> Thank you for explain your quicklist plan at OLS.
> 
> So, I made summary to issue of quicklist.
> if you have a bit time, Could you please read this mail and patches?
> And, if possible, Could you please tell me your feeling?

I believe what I said at the OLS was that quicklists are fundamentally crappy
and should be replaced by something that works (Guess that is what you meant
by "plan"?). Quicklists were generalized from the IA64 arch code.

Good fixup but I would think that some more radical rework is needed.

Maybe some of this needs to vanish into the TLB handling logic?

Then I have thought for awhile that the main reason that quicklists exist are
the performance problems in the page allocator. If you can make the single
page alloc / free pass competitive in performance with quicklists then we
could get rid of all uses.

KOSAKI Motohiro | 20 Aug 16:49
Favicon

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

Hi

Thank you very quick responce.

>> Thank you for explain your quicklist plan at OLS.
>>
>> So, I made summary to issue of quicklist.
>> if you have a bit time, Could you please read this mail and patches?
>> And, if possible, Could you please tell me your feeling?
>
> I believe what I said at the OLS was that quicklists are fundamentally crappy
> and should be replaced by something that works (Guess that is what you meant
> by "plan"?). Quicklists were generalized from the IA64 arch code.

Unfortunately, Multiple ia64 customer of my campany are suffered by
Quicklist, now.
because Quicklist works well for HPC likes application, but business
server's application has very different behavior.
IOW, Quicklist works well on best case, but it doesn't concern to worst case.

So, if possible, I'd like to make short term solution.
I believe nobody oppose quicklist reducing. it is defenitly too fat.

> Good fixup but I would think that some more radical rework is needed.
> Maybe some of this needs to vanish into the TLB handling logic?

What do you think wrong TLB handing?
pure performance issue?

> Then I have thought for awhile that the main reason that quicklists exist are
(Continue reading)

Christoph Lameter | 20 Aug 17:26

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

KOSAKI Motohiro wrote:

> So, if possible, I'd like to make short term solution.
> I believe nobody oppose quicklist reducing. it is defenitly too fat.

Correct.

>> Good fixup but I would think that some more radical rework is needed.
>> Maybe some of this needs to vanish into the TLB handling logic?
> 
> What do you think wrong TLB handing?
> pure performance issue?

The generic TLB code could be made to do allow the allocation, the batching
and freeing of the pages. Would remove the need for quicklists for some uses.

>
> Do you have any page allocator enhancement plan?
> Can I help it?

A simple approach would be to use the queueing method used in quicklists in
the page allocator hotpath. But the devil is in the details .... There are
numerous checks for the type of page that are done by the page allocator and
not for the quicklists. Somehow we need to work around these.

Robin Holt | 21 Aug 04:13
Favicon

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

On Wed, Aug 20, 2008 at 09:10:47AM -0500, Christoph Lameter wrote:
> KOSAKI Motohiro wrote:
> > Hi Cristoph,
> > 
> > Thank you for explain your quicklist plan at OLS.
> > 
> > So, I made summary to issue of quicklist.
> > if you have a bit time, Could you please read this mail and patches?
> > And, if possible, Could you please tell me your feeling?
> 
> I believe what I said at the OLS was that quicklists are fundamentally crappy
> and should be replaced by something that works (Guess that is what you meant
> by "plan"?). Quicklists were generalized from the IA64 arch code.
> 
> Good fixup but I would think that some more radical rework is needed.
> 
> Maybe some of this needs to vanish into the TLB handling logic?
> 
> Then I have thought for awhile that the main reason that quicklists exist are
> the performance problems in the page allocator. If you can make the single
> page alloc / free pass competitive in performance with quicklists then we
> could get rid of all uses.

It is more than the free/alloc cycle, the quicklist saves us from
having to zero the page.  In a sparsely filled page table, it saves time
and cache footprint.  In a heavily used page table, you end up with a
near wash.

One problem I see is somebody got rid of the node awareness.  We used
to not put pages onto a quicklist when they were being released from a
(Continue reading)

Robin Holt | 21 Aug 04:16
Favicon

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

On Wed, Aug 20, 2008 at 09:13:32PM -0500, Robin Holt wrote:
> On Wed, Aug 20, 2008 at 09:10:47AM -0500, Christoph Lameter wrote:
> > KOSAKI Motohiro wrote:
> > > Hi Cristoph,
> > > 
> > > Thank you for explain your quicklist plan at OLS.
> > > 
> > > So, I made summary to issue of quicklist.
> > > if you have a bit time, Could you please read this mail and patches?
> > > And, if possible, Could you please tell me your feeling?
> > 
> > I believe what I said at the OLS was that quicklists are fundamentally crappy
> > and should be replaced by something that works (Guess that is what you meant
> > by "plan"?). Quicklists were generalized from the IA64 arch code.
> > 
> > Good fixup but I would think that some more radical rework is needed.
> > 
> > Maybe some of this needs to vanish into the TLB handling logic?
> > 
> > Then I have thought for awhile that the main reason that quicklists exist are
> > the performance problems in the page allocator. If you can make the single
> > page alloc / free pass competitive in performance with quicklists then we
> > could get rid of all uses.
> 
> It is more than the free/alloc cycle, the quicklist saves us from
> having to zero the page.  In a sparsely filled page table, it saves time
> and cache footprint.  In a heavily used page table, you end up with a
> near wash.
> 
> One problem I see is somebody got rid of the node awareness.  We used
(Continue reading)

David Miller | 21 Aug 05:08

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

From: Robin Holt <holt <at> sgi.com>
Date: Wed, 20 Aug 2008 21:13:32 -0500

> One problem I see is somebody got rid of the node awareness.  We used
> to not put pages onto a quicklist when they were being released from a
> different node than the cpu is on.  Not sure where that went.  It was
> done because of the trap page problem described here.

NUMA awareness is one of the reasons I keep thinking about dropping
quicklist usage on sparc64.

Using SLAB/SLUB for the page table bits with appropriate constructor
and destructor bits ought to be able to approximate the gains
from avoiding the initialization for cached objects.
Christoph Lameter | 21 Aug 15:10

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

David Miller wrote:

> Using SLAB/SLUB for the page table bits with appropriate constructor
> and destructor bits ought to be able to approximate the gains
> from avoiding the initialization for cached objects.

Its a bit strange to use the small object allocator for page sized
allocations. Plus there is this tie in with the tlb flushing logic. So I think
this would be more clean if it would be moved into the asm-generic/tlb.h or so.
Andrew Morton | 20 Aug 20:31

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

On Wed, 20 Aug 2008 20:05:51 +0900
KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com> wrote:

> Hi Cristoph,
> 
> Thank you for explain your quicklist plan at OLS.
> 
> So, I made summary to issue of quicklist.
> if you have a bit time, Could you please read this mail and patches?
> And, if possible, Could you please tell me your feeling?
> 
> 
> --------------------------------------------------------------------
> 
> Now, Quicklist store some page in each CPU as cache.
> (Each CPU has node_free_pages/16 pages)
> 
> and it is used for page table cache.
> Then, exit() increase cache, the other hand fork() spent it.
> 
> So, if apache type (one parent and many child model) middleware run,
> One CPU process fork(), Other CPU process the middleware work and exit().
> 
> At that time, One CPU don't have page table cache at all,
> Others have maximum caches.
> 
> 	QList_max = (#ofCPUs - 1) x Free / 16
> 	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)
> 
> So, How much quicklist spent memory at maximum case?
(Continue reading)

Robin Holt | 21 Aug 04:42
Favicon

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

> OK, that's a fatal bug and it's present in 2.6.25.x and 2.6.26.x.  A
> serious issue.
> 
> The patches do apply to both stable kernels and I have tagged them for
> backporting into them.  They're nice and small, but I didn't get a
> really solid yes-this-is-what-we-should-do from Christoph?
> 
> 
> This (from [patch 2/2]): "(Although its patch applied, quicklist can
> waste 64GB on 1TB server (= 1TB / 16), it is still too much??)" is a
> bit of a worry.  Yes, 64GB is too much!  But at least this is now only
> a performance issue rather than a stability issue, yes?

That 64GB is not quite correct.  That assumes all 1TB is free.  The
quicklists are trimmed down as the nodes undergo allocations.  The
problem I see right now is that page tables allocated on one node and
freed on a cpu on a different node could be placed early enough on the
quicklist that it will not be freed until the other node gets under
memory pressure.

Could you give the following a try?  It hasn't even been compiled.  I
think this in addition to your cpus per node change are the right thing
to do.

Thanks,
Robin

Index: ia64-cleanups/include/linux/quicklist.h
===================================================================
--- ia64-cleanups.orig/include/linux/quicklist.h	2008-08-20 21:35:10.000000000 -0500
(Continue reading)

Christoph Lameter | 21 Aug 15:07

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

Robin Holt wrote:
>
> Index: ia64-cleanups/include/linux/quicklist.h
> ===================================================================
> --- ia64-cleanups.orig/include/linux/quicklist.h	2008-08-20 21:35:10.000000000 -0500
> +++ ia64-cleanups/include/linux/quicklist.h	2008-08-20 21:38:00.891943270 -0500
> @@ -66,6 +66,15 @@ static inline void __quicklist_free(int 
>  
>  static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp)
>  {
> +#ifdef CONFIG_NUMA
> +	unsigned long nid = page_to_nid(virt_to_page(pp));
> +
> +	if (unlikely(nid != numa_node_id())) {
> +		free_page((unsigned long)pp);
> +		return;
> +	}
> +#endif
> +
>  	__quicklist_free(nr, dtor, pp, virt_to_page(pp));
>  }
>  

We removed this code because it frees a page before the TLB flush has been
performed. This code segment was the reason that quicklists were not accepted
for x86.
Robin Holt | 21 Aug 15:14
Favicon

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

On Thu, Aug 21, 2008 at 08:07:43AM -0500, Christoph Lameter wrote:
> Robin Holt wrote:
> >
> > Index: ia64-cleanups/include/linux/quicklist.h
> > ===================================================================
> > --- ia64-cleanups.orig/include/linux/quicklist.h	2008-08-20 21:35:10.000000000 -0500
> > +++ ia64-cleanups/include/linux/quicklist.h	2008-08-20 21:38:00.891943270 -0500
> > @@ -66,6 +66,15 @@ static inline void __quicklist_free(int 
> >  
> >  static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp)
> >  {
> > +#ifdef CONFIG_NUMA
> > +	unsigned long nid = page_to_nid(virt_to_page(pp));
> > +
> > +	if (unlikely(nid != numa_node_id())) {
> > +		free_page((unsigned long)pp);
> > +		return;
> > +	}
> > +#endif
> > +
> >  	__quicklist_free(nr, dtor, pp, virt_to_page(pp));
> >  }
> >  
> 
> We removed this code because it frees a page before the TLB flush has been
> performed. This code segment was the reason that quicklists were not accepted
> for x86.

How could we do this.  It was a _HUGE_ problem on altix boxes.  When you
started a jobs with a large number of MPI ranks, they would all start
(Continue reading)

Christoph Lameter | 21 Aug 15:18

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

Robin Holt wrote:

>> We removed this code because it frees a page before the TLB flush has been
>> performed. This code segment was the reason that quicklists were not accepted
>> for x86.
> 
> How could we do this.  It was a _HUGE_ problem on altix boxes.  When you
> started a jobs with a large number of MPI ranks, they would all start
> from the shepherd process on a single node and the children would
> migrate to a different cpu.  Unless subsequent jobs used enough memory
> to flush those remote quicklists, we would end up with a depleted node
> that never reclaimed.

Well I tried to get the quicklist stuff resolved at SGI multiple times last
year when the early free before flush was discovered but there did not seem to
be much interest at that point, so we dropped it.

In order to make this work correctly we would need to create a list of remote
pages. These remote pages would then be freed after the TLB flush.
Robin Holt | 21 Aug 15:45
Favicon

Re: [RFC][PATCH 0/2] Quicklist is slighly problematic.

On Thu, Aug 21, 2008 at 08:18:24AM -0500, Christoph Lameter wrote:
> Robin Holt wrote:
> 
> >> We removed this code because it frees a page before the TLB flush has been
> >> performed. This code segment was the reason that quicklists were not accepted
> >> for x86.
> > 
> > How could we do this.  It was a _HUGE_ problem on altix boxes.  When you
> > started a jobs with a large number of MPI ranks, they would all start
> > from the shepherd process on a single node and the children would
> > migrate to a different cpu.  Unless subsequent jobs used enough memory
> > to flush those remote quicklists, we would end up with a depleted node
> > that never reclaimed.
> 
> Well I tried to get the quicklist stuff resolved at SGI multiple times last
> year when the early free before flush was discovered but there did not seem to
> be much interest at that point, so we dropped it.

Well, now that you dope slap me, I vaguely remember this.  I also seem
to recall being very busy with other stuff and convincing myself that a
proper resolution would magically appear.  Argh.

Sorry,
Robin

Gmane