Glauber Costa | 30 May 11:48 2012

[PATCH v3 0/6] per cgroup /proc/stat statistics

Hi

This is a rework of my v2 series taking into account Paul
Turner's comments. All the series is the same except for
the patch that collets nr_switches that is now completely reworked.

It is out of schedule(), relegated to a scheduler hook. Only difference
from Paul's suggestion is that I added a new one, instead of messing with
put_prev_task + pick_next_task pairs. In the patchset that adds it,
I'll try to argue for my solution.

Hope this is acceptable, and please let me know of any other concerns.

v2:
* completely reworked nr_switches gathering
* separated per-se sleep_start to be more clear about it

Glauber Costa (6):
  measure exec_clock for rt sched entities
  account guest time per-cgroup as well.
  expose fine-grained per-cpu data for cpuacct stats
  add a new scheduler hook for context switch
  Also record sleep start for a task group
  expose per-taskgroup schedstats in cgroup

 include/linux/sched.h |    1 +
 kernel/sched/core.c   |  166 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c   |   42 ++++++++++++-
 kernel/sched/rt.c     |   20 ++++++
 kernel/sched/sched.h  |    6 ++
(Continue reading)

Glauber Costa | 30 May 11:48 2012

[PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

The cpuacct cgroup already exposes user and system numbers in a per-cgroup
fashion. But they are a summation along the whole group, not a per-cpu figure.
Also, they are coarse-grained version of the stats usually shown at places
like /proc/stat.

I want to have enough cgroup data to emulate the /proc/stat interface. To
achieve that, I am creating a new file "stat_percpu" that displays the
fine-grained per-cpu data. The original data is left alone.

The format of this file resembles the one found in the usual cgroup's stat
files. But of course, the fields will be repeated, one per cpu, and prefixed
with the cpu number.

Therefore, we'll have something like:

  cpu0.user X
  cpu0.system Y
  ...
  cpu1.user X1
  cpu1.system Y1
  ...

Signed-off-by: Glauber Costa <glommer@...>
CC: Peter Zijlstra <a.p.zijlstra@...>
CC: Paul Turner <pjt@...>
---
 kernel/sched/core.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
(Continue reading)

Peter Zijlstra | 30 May 12:34 2012
Picon

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:

> +static int cpuacct_stats_percpu_show(struct cgroup *cgrp, struct cftype *cft,
> +				     struct cgroup_map_cb *cb)
> +{
> +	struct cpuacct *ca = cgroup_ca(cgrp);
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		do_fill_cb(cb, ca, "user", cpu, CPUTIME_USER);
> +		do_fill_cb(cb, ca, "nice", cpu, CPUTIME_NICE);
> +		do_fill_cb(cb, ca, "system", cpu, CPUTIME_SYSTEM);
> +		do_fill_cb(cb, ca, "irq", cpu, CPUTIME_IRQ);
> +		do_fill_cb(cb, ca, "softirq", cpu, CPUTIME_SOFTIRQ);
> +		do_fill_cb(cb, ca, "guest", cpu, CPUTIME_GUEST);
> +		do_fill_cb(cb, ca, "guest_nice", cpu, CPUTIME_GUEST_NICE);
> +	}
> +
> +	return 0;
> +}

Uhm, hotplug anyone?
Glauber Costa | 30 May 12:34 2012

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On 05/30/2012 02:34 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
>
>> +static int cpuacct_stats_percpu_show(struct cgroup *cgrp, struct cftype *cft,
>> +				     struct cgroup_map_cb *cb)
>> +{
>> +	struct cpuacct *ca = cgroup_ca(cgrp);
>> +	int cpu;
>> +
>> +	for_each_online_cpu(cpu) {
>> +		do_fill_cb(cb, ca, "user", cpu, CPUTIME_USER);
>> +		do_fill_cb(cb, ca, "nice", cpu, CPUTIME_NICE);
>> +		do_fill_cb(cb, ca, "system", cpu, CPUTIME_SYSTEM);
>> +		do_fill_cb(cb, ca, "irq", cpu, CPUTIME_IRQ);
>> +		do_fill_cb(cb, ca, "softirq", cpu, CPUTIME_SOFTIRQ);
>> +		do_fill_cb(cb, ca, "guest", cpu, CPUTIME_GUEST);
>> +		do_fill_cb(cb, ca, "guest_nice", cpu, CPUTIME_GUEST_NICE);
>> +	}
>> +
>> +	return 0;
>> +}
>
> Uhm, hotplug anyone?
What's with hotplug ?

If you mean we should accumulate that on hotplug, I don't see why. We 
certainly don't do that for the tick-based counters. Or do you mean I am 
missing a get_online_cpus() lock ?

humm, I don't see it being taken on other loops like that
(Continue reading)

Peter Zijlstra | 30 May 12:43 2012
Picon

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On Wed, 2012-05-30 at 14:34 +0400, Glauber Costa wrote:
> On 05/30/2012 02:34 PM, Peter Zijlstra wrote:
> > On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
> >
> >> +static int cpuacct_stats_percpu_show(struct cgroup *cgrp, struct cftype *cft,
> >> +				     struct cgroup_map_cb *cb)
> >> +{
> >> +	struct cpuacct *ca = cgroup_ca(cgrp);
> >> +	int cpu;
> >> +
> >> +	for_each_online_cpu(cpu) {
> >> +		do_fill_cb(cb, ca, "user", cpu, CPUTIME_USER);
> >> +		do_fill_cb(cb, ca, "nice", cpu, CPUTIME_NICE);
> >> +		do_fill_cb(cb, ca, "system", cpu, CPUTIME_SYSTEM);
> >> +		do_fill_cb(cb, ca, "irq", cpu, CPUTIME_IRQ);
> >> +		do_fill_cb(cb, ca, "softirq", cpu, CPUTIME_SOFTIRQ);
> >> +		do_fill_cb(cb, ca, "guest", cpu, CPUTIME_GUEST);
> >> +		do_fill_cb(cb, ca, "guest_nice", cpu, CPUTIME_GUEST_NICE);
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >
> > Uhm, hotplug anyone?
> What's with hotplug ?

Who's to say all cpus are online at this point? If you don't want to
make the files come and go with hotplug, one should use
for_each_possible_cpu().

(Continue reading)

Glauber Costa | 30 May 12:44 2012

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On 05/30/2012 02:43 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 14:34 +0400, Glauber Costa wrote:
>> On 05/30/2012 02:34 PM, Peter Zijlstra wrote:
>>> On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
>>>
>>>> +static int cpuacct_stats_percpu_show(struct cgroup *cgrp, struct cftype *cft,
>>>> +				     struct cgroup_map_cb *cb)
>>>> +{
>>>> +	struct cpuacct *ca = cgroup_ca(cgrp);
>>>> +	int cpu;
>>>> +
>>>> +	for_each_online_cpu(cpu) {
>>>> +		do_fill_cb(cb, ca, "user", cpu, CPUTIME_USER);
>>>> +		do_fill_cb(cb, ca, "nice", cpu, CPUTIME_NICE);
>>>> +		do_fill_cb(cb, ca, "system", cpu, CPUTIME_SYSTEM);
>>>> +		do_fill_cb(cb, ca, "irq", cpu, CPUTIME_IRQ);
>>>> +		do_fill_cb(cb, ca, "softirq", cpu, CPUTIME_SOFTIRQ);
>>>> +		do_fill_cb(cb, ca, "guest", cpu, CPUTIME_GUEST);
>>>> +		do_fill_cb(cb, ca, "guest_nice", cpu, CPUTIME_GUEST_NICE);
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>>
>>> Uhm, hotplug anyone?
>> What's with hotplug ?
>
>
> Who's to say all cpus are online at this point? If you don't want to
> make the files come and go with hotplug, one should use
(Continue reading)

Peter Zijlstra | 30 May 13:24 2012
Picon

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On Wed, 2012-05-30 at 14:44 +0400, Glauber Costa wrote:
> But I don't really parse what you mean by "make the files go away". We 
> have just one file, with multiple entries. The entries are in key:value 
> form, so I don't see a huge problem in having some entries disappearing.
> 
> As a matter of fact, this is exactly what happens in /proc/stat when you 
> offline a cpu: the correspondent line will stop showing up.

Oh, I thought you had a file per cpu.

If you have content per cpu, this all might become a problem with 4096
cpus, that's bound to overflow the page of output space?
Paul Turner | 30 May 13:24 2012
Picon

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On Wed, May 30, 2012 at 2:48 AM, Glauber Costa <glommer <at> parallels.com> wrote:
> The cpuacct cgroup already exposes user and system numbers in a per-cgroup
> fashion. But they are a summation along the whole group, not a per-cpu figure.
> Also, they are coarse-grained version of the stats usually shown at places
> like /proc/stat.
>
> I want to have enough cgroup data to emulate the /proc/stat interface. To
> achieve that, I am creating a new file "stat_percpu" that displays the
> fine-grained per-cpu data. The original data is left alone.
>
> The format of this file resembles the one found in the usual cgroup's stat
> files. But of course, the fields will be repeated, one per cpu, and prefixed
> with the cpu number.
>
> Therefore, we'll have something like:
>
>  cpu0.user X
>  cpu0.system Y
>  ...
>  cpu1.user X1
>  cpu1.system Y1
>  ...
>
> Signed-off-by: Glauber Costa <glommer <at> parallels.com>
> CC: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
> CC: Paul Turner <pjt <at> google.com>
> ---
>  kernel/sched/core.c |   33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
>
(Continue reading)

Glauber Costa | 30 May 14:20 2012

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On 05/30/2012 03:24 PM, Paul Turner wrote:
>> +static int cpuacct_stats_percpu_show(struct cgroup *cgrp, struct cftype *cft,
>> >  +                                    struct cgroup_map_cb *cb)
>> >  +{
>> >  +       struct cpuacct *ca = cgroup_ca(cgrp);
>> >  +       int cpu;
>> >  +
>> >  +       for_each_online_cpu(cpu) {
>> >  +               do_fill_cb(cb, ca, "user", cpu, CPUTIME_USER);
>> >  +               do_fill_cb(cb, ca, "nice", cpu, CPUTIME_NICE);
>> >  +               do_fill_cb(cb, ca, "system", cpu, CPUTIME_SYSTEM);
>> >  +               do_fill_cb(cb, ca, "irq", cpu, CPUTIME_IRQ);
>> >  +               do_fill_cb(cb, ca, "softirq", cpu, CPUTIME_SOFTIRQ);
>> >  +               do_fill_cb(cb, ca, "guest", cpu, CPUTIME_GUEST);
>> >  +               do_fill_cb(cb, ca, "guest_nice", cpu, CPUTIME_GUEST_NICE);
>> >  +       }
>> >  +
> I don't know if there's much that can be trivially done about it but I
> suspect these are a bit of a memory allocation time-bomb on a many-CPU
> machine.  The cgroup:seq_file mating (via read_map) treats everything
> as/one/  record.  This means that seq_printf is going to end up
> eventually allocating a buffer that can fit_everything_  (as well as
> every power-of-2 on the way there).  Adding insult to injury is that
> that the backing buffer is kmalloc() not vmalloc().
>
> 200+ bytes per-cpu above really is not unreasonable (46 bytes just for
> the text, plus a byte per base 10 digit we end up reporting), but that
> then  leaves us looking at order-12/13 allocations just to print this
> thing when there are O(many) cpus.
>
(Continue reading)

Paul Turner | 30 May 14:48 2012
Picon

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On Wed, May 30, 2012 at 5:20 AM, Glauber Costa <glommer@...> wrote:
> On 05/30/2012 03:24 PM, Paul Turner wrote:
>>>
>>> +static int cpuacct_stats_percpu_show(struct cgroup *cgrp, struct cftype
>>> *cft,
>>> >  +                                    struct cgroup_map_cb *cb)
>>> >  +{
>>> >  +       struct cpuacct *ca = cgroup_ca(cgrp);
>>> >  +       int cpu;
>>> >  +
>>> >  +       for_each_online_cpu(cpu) {
>>> >  +               do_fill_cb(cb, ca, "user", cpu, CPUTIME_USER);
>>> >  +               do_fill_cb(cb, ca, "nice", cpu, CPUTIME_NICE);
>>> >  +               do_fill_cb(cb, ca, "system", cpu, CPUTIME_SYSTEM);
>>> >  +               do_fill_cb(cb, ca, "irq", cpu, CPUTIME_IRQ);
>>> >  +               do_fill_cb(cb, ca, "softirq", cpu, CPUTIME_SOFTIRQ);
>>> >  +               do_fill_cb(cb, ca, "guest", cpu, CPUTIME_GUEST);
>>> >  +               do_fill_cb(cb, ca, "guest_nice", cpu,
>>> > CPUTIME_GUEST_NICE);
>>> >  +       }
>>> >  +
>>
>> I don't know if there's much that can be trivially done about it but I
>> suspect these are a bit of a memory allocation time-bomb on a many-CPU
>> machine.  The cgroup:seq_file mating (via read_map) treats everything
>> as/one/  record.  This means that seq_printf is going to end up
>> eventually allocating a buffer that can fit_everything_  (as well as
>>
>> every power-of-2 on the way there).  Adding insult to injury is that
>> that the backing buffer is kmalloc() not vmalloc().
(Continue reading)

Glauber Costa | 30 May 14:52 2012

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On 05/30/2012 04:48 PM, Paul Turner wrote:
> a) the information in /proc/stat is actually much denser since it's
> "cpu VAL VAL VAL VAL" as opposed to "cpuX.FIELD VAL"

easily fixable here. Less descriptive, but we can use a header line
with the description much like how /proc/slabinfo does, and we still
have an extensible interface that is dense, at the same time.

> b) If it became a problem the /proc/stat case is actually fairly
> trivially fixable by defining each cpu as a record and "everything
> else" as a magic im-out-of-cpus value.
>
>> >
>> >  Now, if you guys are okay with a file per-cpu, I can do it as well.
>> >  It pollutes the filesystem, but at least protects against the fact that this
>> >  is kmalloc-backed.
>> >
> As I prefaced, I'm not sure there's much that can be trivially done
> about it.  This is really a fundamental limitation of how read_map()
> works.
>
> What we really need is a proper seq_file exposed through cftypes.
That can be done.

Glauber Costa | 30 May 15:26 2012

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On 05/30/2012 04:48 PM, Paul Turner wrote:
>> >  Now, if you guys are okay with a file per-cpu, I can do it as well.
>> >  It pollutes the filesystem, but at least protects against the fact that this
>> >  is kmalloc-backed.
>> >
> As I prefaced, I'm not sure there's much that can be trivially done
> about it.  This is really a fundamental limitation of how read_map()
> works.
>
> What we really need is a proper seq_file exposed through cftypes.

Tejun, would you be okay with an interface that exports somehow the raw 
seq_file in cgroups ?

This way we could call s_show for each cpu and get away with the memory 
usage problem, I presume
Glauber Costa | 30 May 15:26 2012

Re: [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats

On 05/30/2012 04:48 PM, Paul Turner wrote:
>> >  Now, if you guys are okay with a file per-cpu, I can do it as well.
>> >  It pollutes the filesystem, but at least protects against the fact that this
>> >  is kmalloc-backed.
>> >
> As I prefaced, I'm not sure there's much that can be trivially done
> about it.  This is really a fundamental limitation of how read_map()
> works.
>
> What we really need is a proper seq_file exposed through cftypes.

Tejun, would you be okay with an interface that exports somehow the raw 
seq_file in cgroups ?

This way we could call s_show for each cpu and get away with the memory 
usage problem, I presume
Glauber Costa | 30 May 11:48 2012

[PATCH v3 2/6] account guest time per-cgroup as well.

We already track multiple tick statistics per-cgroup, using
the task_group_account_field facility. This patch accounts
guest_time in that manner as well.

Signed-off-by: Glauber Costa <glommer <at> parallels.com>
CC: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
CC: Paul Turner <pjt <at> google.com>
---
 kernel/sched/core.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39eb601..220d416 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
 <at>  <at>  -2717,8 +2717,6  <at>  <at>  void account_user_time(struct task_struct *p, cputime_t cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
-
 	/* Add guest time to process. */
 	p->utime += cputime;
 	p->utimescaled += cputime_scaled;
 <at>  <at>  -2727,11 +2725,11  <at>  <at>  static void account_guest_time(struct task_struct *p, cputime_t cputime,

 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat[CPUTIME_NICE] += (__force u64) cputime;
-		cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime;
(Continue reading)

Peter Zijlstra | 30 May 12:32 2012
Picon

Re: [PATCH v3 2/6] account guest time per-cgroup as well.

On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
> We already track multiple tick statistics per-cgroup, using
> the task_group_account_field facility. This patch accounts
> guest_time in that manner as well.

So this leaves IOWAIT and IDLE the only fields not using
task_group_account_field(), right?
Glauber Costa | 30 May 12:36 2012

Re: [PATCH v3 2/6] account guest time per-cgroup as well.

On 05/30/2012 02:32 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
>> >  We already track multiple tick statistics per-cgroup, using
>> >  the task_group_account_field facility. This patch accounts
>> >  guest_time in that manner as well.
> So this leaves IOWAIT and IDLE the only fields not using
> task_group_account_field(), right?

Yes, because they are essentially global, and their meaning is 
ill-defined from within a cgroup.

If you look further out in the patchset, I intend to export idle from 
cpu, instead of cpuacct, because something that can be used as idle 
value is already computed anyway from the schedstats, so I'm just using it.

iowait will be left blank for now. Me and Paul agreed last time we 
talked that it is not uber important to have iowait values per-cgroup.
Paul Turner | 30 May 12:46 2012
Picon

Re: [PATCH v3 2/6] account guest time per-cgroup as well.

On Wed, May 30, 2012 at 3:36 AM, Glauber Costa <glommer <at> parallels.com> wrote:
> On 05/30/2012 02:32 PM, Peter Zijlstra wrote:
>>
>> On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
>>>
>>> >  We already track multiple tick statistics per-cgroup, using
>>> >  the task_group_account_field facility. This patch accounts
>>> >  guest_time in that manner as well.
>>
>> So this leaves IOWAIT and IDLE the only fields not using
>> task_group_account_field(), right?
>
>
> Yes, because they are essentially global, and their meaning is ill-defined
> from within a cgroup.
>
> If you look further out in the patchset, I intend to export idle from cpu,
> instead of cpuacct, because something that can be used as idle value is
> already computed anyway from the schedstats, so I'm just using it.
>
> iowait will be left blank for now. Me and Paul agreed last time we talked
> that it is not uber important to have iowait values per-cgroup.

Stronger:  it lacks a definition you can sanely measure without atomic
counters everywhere (similarly for group-idle).

> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
(Continue reading)

Glauber Costa | 30 May 11:48 2012

[PATCH v3 5/6] Also record sleep start for a task group

When we're dealing with a task group, instead of a task, also record
the start of its sleep time. Since the test agains TASK_UNINTERRUPTIBLE
does not really make sense and lack an obvious analogous, we always
record it as sleep_start, never block_start.

Signed-off-by: Glauber Costa <glommer <at> parallels.com>
CC: Peter Zijlstra <a.p.zijlstra <at> chello.nl>
CC: Paul Turner <pjt <at> google.com>
---
 kernel/sched/fair.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c26fe38..d932559 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
 <at>  <at>  -1182,7 +1182,8  <at>  <at>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 				se->statistics.sleep_start = rq_of(cfs_rq)->clock;
 			if (tsk->state & TASK_UNINTERRUPTIBLE)
 				se->statistics.block_start = rq_of(cfs_rq)->clock;
-		}
+		} else
+			se->statistics.sleep_start = rq_of(cfs_rq)->clock;
 #endif
 	}

--

-- 
1.7.10.2

(Continue reading)

Paul Turner | 30 May 13:35 2012
Picon

Re: [PATCH v3 5/6] Also record sleep start for a task group

On Wed, May 30, 2012 at 2:48 AM, Glauber Costa <glommer@...> wrote:
> When we're dealing with a task group, instead of a task, also record
> the start of its sleep time. Since the test agains TASK_UNINTERRUPTIBLE
> does not really make sense and lack an obvious analogous, we always
> record it as sleep_start, never block_start.
>
> Signed-off-by: Glauber Costa <glommer@...>
> CC: Peter Zijlstra <a.p.zijlstra@...>
> CC: Paul Turner <pjt@...>
> ---
>  kernel/sched/fair.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c26fe38..d932559 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
>  <at>  <at>  -1182,7 +1182,8  <at>  <at>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>                                se->statistics.sleep_start = rq_of(cfs_rq)->clock;
>                        if (tsk->state & TASK_UNINTERRUPTIBLE)
>                                se->statistics.block_start = rq_of(cfs_rq)->clock;
> -               }
> +               } else
> +                       se->statistics.sleep_start = rq_of(cfs_rq)->clock;

You can't sanely account sleep on a group entity.

Suppose you have 2 sleepers on 1 cpu: you account 1s/s of idle
Suppose you have 2 sleepers now on 2 cpus: you account 2s/s of idle

(Continue reading)

Glauber Costa | 30 May 14:24 2012

Re: [PATCH v3 5/6] Also record sleep start for a task group

On 05/30/2012 03:35 PM, Paul Turner wrote:
> On Wed, May 30, 2012 at 2:48 AM, Glauber Costa<glommer@...>  wrote:
>> When we're dealing with a task group, instead of a task, also record
>> the start of its sleep time. Since the test agains TASK_UNINTERRUPTIBLE
>> does not really make sense and lack an obvious analogous, we always
>> record it as sleep_start, never block_start.
>>
>> Signed-off-by: Glauber Costa<glommer@...>
>> CC: Peter Zijlstra<a.p.zijlstra@...>
>> CC: Paul Turner<pjt@...>
>> ---
>>   kernel/sched/fair.c |    3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c26fe38..d932559 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>>  <at>  <at>  -1182,7 +1182,8  <at>  <at>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>                                 se->statistics.sleep_start = rq_of(cfs_rq)->clock;
>>                         if (tsk->state&  TASK_UNINTERRUPTIBLE)
>>                                 se->statistics.block_start = rq_of(cfs_rq)->clock;
>> -               }
>> +               } else
>> +                       se->statistics.sleep_start = rq_of(cfs_rq)->clock;
>
> You can't sanely account sleep on a group entity.
>
> Suppose you have 2 sleepers on 1 cpu: you account 1s/s of idle
> Suppose you have 2 sleepers now on 2 cpus: you account 2s/s of idle
(Continue reading)

Peter Zijlstra | 30 May 14:44 2012
Picon

Re: [PATCH v3 5/6] Also record sleep start for a task group

On Wed, 2012-05-30 at 16:24 +0400, Glauber Costa wrote:
> sleep_start is not for iowait. This is for idle. And I know no other way 
> to collect idle time per cgroup, other than the time during which it was 
> out of the runqueue.
> 
> Now what you say about the sleepers don't make that much sense for idle 
> because this information is per-cpu as well.
> 
> When the se is being dequeued, it means none of its children is running 
> on that runqueue. That's idle. 

But does that mean the cgroup is idle? Its impossible to re-construct
the machine state from this per-cpu data if your definition of
cgroup-idle is the time when _all_ cpus are idle.

Glauber Costa | 30 May 14:44 2012

Re: [PATCH v3 5/6] Also record sleep start for a task group

On 05/30/2012 04:44 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 16:24 +0400, Glauber Costa wrote:
>> sleep_start is not for iowait. This is for idle. And I know no other way
>> to collect idle time per cgroup, other than the time during which it was
>> out of the runqueue.
>>
>> Now what you say about the sleepers don't make that much sense for idle
>> because this information is per-cpu as well.
>>
>> When the se is being dequeued, it means none of its children is running
>> on that runqueue. That's idle.
>
> But does that mean the cgroup is idle? Its impossible to re-construct
> the machine state from this per-cpu data if your definition of
> cgroup-idle is the time when _all_ cpus are idle.
>
It is idle for that runqueue, aka cpu. The cgroup itself is idle when 
all cpus are idle. And yes, then you have 2s per sec of idle in a 2-way 
system.

That's pretty much how a physical box works as well.

Glauber Costa | 30 May 11:48 2012

[PATCH v3 1/6] measure exec_clock for rt sched entities

For simetry with the cfq tasks, measure exec_clock for the rt
sched entities (rt_se).

This can be used in a number of fashions. For instance, to
compute total cpu usage in a cgroup that is generated by
rt tasks.

Signed-off-by: Glauber Costa <glommer@...>
CC: Peter Zijlstra <a.p.zijlstra@...>
CC: Paul Turner <pjt@...>
---
 kernel/sched/rt.c    |    5 +++++
 kernel/sched/sched.h |    1 +
 2 files changed, 6 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c5565c3..30ee4e2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
 <at>  <at>  -919,6 +919,11  <at>  <at>  static void update_curr_rt(struct rq *rq)

 	sched_rt_avg_update(rq, delta_exec);

+	for_each_sched_rt_entity(rt_se) {
+		rt_rq = rt_rq_of_se(rt_se);
+		schedstat_add(rt_rq, exec_clock, delta_exec);
+	}
+
 	if (!rt_bandwidth_enabled())
 		return;
(Continue reading)

Peter Zijlstra | 30 May 12:29 2012
Picon

Re: [PATCH v3 1/6] measure exec_clock for rt sched entities

On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
> For simetry with the cfq tasks, measure exec_clock for the rt
> sched entities (rt_se).

Symmetry methinks.. anyway, where is the symmetry?, fair.c:update_curr()
doesn't do the for_each_sched_entity() thing.

> This can be used in a number of fashions. For instance, to
> compute total cpu usage in a cgroup that is generated by
> rt tasks.
> 
> Signed-off-by: Glauber Costa <glommer@...>
> CC: Peter Zijlstra <a.p.zijlstra@...>
> CC: Paul Turner <pjt@...>
> ---
>  kernel/sched/rt.c    |    5 +++++
>  kernel/sched/sched.h |    1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index c5565c3..30ee4e2 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
>  <at>  <at>  -919,6 +919,11  <at>  <at>  static void update_curr_rt(struct rq *rq)
>  
>  	sched_rt_avg_update(rq, delta_exec);
>  
> +	for_each_sched_rt_entity(rt_se) {
> +		rt_rq = rt_rq_of_se(rt_se);
> +		schedstat_add(rt_rq, exec_clock, delta_exec);
(Continue reading)

Glauber Costa | 30 May 12:32 2012

Re: [PATCH v3 1/6] measure exec_clock for rt sched entities

On 05/30/2012 02:29 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
>> For simetry with the cfq tasks, measure exec_clock for the rt
>> sched entities (rt_se).
>
> Symmetry methinks..

=p bad me
> anyway, where is the symmetry?, fair.c:update_curr()
> doesn't do the for_each_sched_entity() thing.

It does implicitly, because fair.c:update_curr() is called from
within enqueue_task(), that is called for_each_sched_entity in
enqueue_task_fair().

>
>> This can be used in a number of fashions. For instance, to
>> compute total cpu usage in a cgroup that is generated by
>> rt tasks.
>>
>> Signed-off-by: Glauber Costa<glommer@...>
>> CC: Peter Zijlstra<a.p.zijlstra@...>
>> CC: Paul Turner<pjt@...>
>> ---
>>   kernel/sched/rt.c    |    5 +++++
>>   kernel/sched/sched.h |    1 +
>>   2 files changed, 6 insertions(+)
>>
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index c5565c3..30ee4e2 100644
(Continue reading)

Peter Zijlstra | 30 May 12:42 2012
Picon

Re: [PATCH v3 1/6] measure exec_clock for rt sched entities

On Wed, 2012-05-30 at 14:32 +0400, Glauber Costa wrote:
> >> +    for_each_sched_rt_entity(rt_se) {
> >> +            rt_rq = rt_rq_of_se(rt_se);
> >> +            schedstat_add(rt_rq, exec_clock, delta_exec);
> >> +    }
> >> +
> >>      if (!rt_bandwidth_enabled())
> >>              return;
> >
> > See, this just makes me sad.. you now have a double
> > for_each_sched_rt_entity() loop.
> 
> The way I read the rt.c code, it it is called from enqueue_task_rt only
> once. 

Ah, what I meant was, right after that !rt_bandwidth_enabled() muck we
do another for_each_sched_rt_entity() walk.
Glauber Costa | 30 May 12:42 2012

Re: [PATCH v3 1/6] measure exec_clock for rt sched entities

On 05/30/2012 02:42 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 14:32 +0400, Glauber Costa wrote:
>>>> +    for_each_sched_rt_entity(rt_se) {
>>>> +            rt_rq = rt_rq_of_se(rt_se);
>>>> +            schedstat_add(rt_rq, exec_clock, delta_exec);
>>>> +    }
>>>> +
>>>>       if (!rt_bandwidth_enabled())
>>>>               return;
>>>
>>> See, this just makes me sad.. you now have a double
>>> for_each_sched_rt_entity() loop.
>>
>> The way I read the rt.c code, it it is called from enqueue_task_rt only
>> once.
>
> Ah, what I meant was, right after that !rt_bandwidth_enabled() muck we
> do another for_each_sched_rt_entity() walk.
I guess I can fold it there...

Paul Turner | 30 May 13:00 2012
Picon

Re: [PATCH v3 1/6] measure exec_clock for rt sched entities

On Wed, May 30, 2012 at 3:42 AM, Glauber Costa <glommer@...> wrote:
> On 05/30/2012 02:42 PM, Peter Zijlstra wrote:
>>
>> On Wed, 2012-05-30 at 14:32 +0400, Glauber Costa wrote:
>>>>>
>>>>> +    for_each_sched_rt_entity(rt_se) {
>>>>> +            rt_rq = rt_rq_of_se(rt_se);
>>>>> +            schedstat_add(rt_rq, exec_clock, delta_exec);
>>>>> +    }
>>>>> +
>>>>>      if (!rt_bandwidth_enabled())
>>>>>              return;
>>>>
>>>>
>>>> See, this just makes me sad.. you now have a double
>>>> for_each_sched_rt_entity() loop.
>>>
>>>
>>> The way I read the rt.c code, it it is called from enqueue_task_rt only
>>> once.
>>
>>
>> Ah, what I meant was, right after that !rt_bandwidth_enabled() muck we
>> do another for_each_sched_rt_entity() walk.
>
> I guess I can fold it there...
>

Does this even need to be hierarchical?  While it's natural for it to
be in the CFS case, it feels forced here.
(Continue reading)

Glauber Costa | 30 May 14:09 2012

Re: [PATCH v3 1/6] measure exec_clock for rt sched entities

On 05/30/2012 03:00 PM, Paul Turner wrote:
> Does this even need to be hierarchical?  While it's natural for it to
> be in the CFS case, it feels forced here.
>
> You could instead make this rt_rq->local_exec_clock  charging only to
> the parenting rt_rq and post-aggregate when you want to report.  The
> only thing you'd need to be careful of is also accounting children
> somewhere on the parent on destruction (reaped_exec_clock?).
>
> Harking back to symmetry, local_exec_clock is also a potentially
> useful stat on the CFS side of things since it allows you to usefully
> disambiguate versus your children (common case where this is useful is
> calculating usage of threads in the root cgroup); so it wouldn't need
> to be unique to rt_rq.

I can try this approach.
Glauber Costa | 30 May 11:48 2012

[PATCH v3 6/6] expose per-taskgroup schedstats in cgroup

This patch aims at exposing stat information per-cgroup, such as:
 * idle time,
 * iowait time,
 * steal time,
 * # context switches
and friends. The ultimate goal is to be able to present a per-container view of
/proc/stat inside a container. With this patch, everything that is needed to do
that is in place, except for number of tasks.

For most of the data, I achieve that by hooking into the schedstats framework,
so although the overhead of that is prone to discussion, I am not adding anything,
but reusing what's already there instead. The exception being that the data is
now computed and stored in non-task se's as well, instead of entity_is_task() branches.
However, I expect this to be minimum comparing to the alternative of adding new
hierarchy walks. Those are kept intact.

The format of the new file added is the same as the one recently
introduced for cpuacct:

  cpu0.idle X
  cpu0.steal Y
  ...
  cpu1.idle X1
  cpu1.steal Y1
  ...

Signed-off-by: Glauber Costa <glommer@...>
CC: Peter Zijlstra <a.p.zijlstra@...>
CC: Paul Turner <pjt@...>
---
(Continue reading)

Peter Zijlstra | 30 May 13:22 2012
Picon

Re: [PATCH v3 6/6] expose per-taskgroup schedstats in cgroup

On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
> +static u64 tg_idle(struct task_group *tg, int cpu)
> +{
> +       u64 val;
> +
> +       if (tg != &root_task_group) {
> +               val = cfs_read_sleep(tg->se[cpu]);
> +               /* If we have rt tasks running, we're not really idle */
> +               val -= rt_rq(exec_clock, tg, cpu);
> +       } else
> +               /*
> +                * There are many errors here that we are accumulating.
> +                * However, we only provide this in the interest of having
> +                * a consistent interface for all cgroups. Everybody
> +                * probing the root cgroup should be getting its figures
> +                * from system-wide files as /proc/stat. That would be faster
> +                * to begin with...
> +                *
> +                * Ditto for steal.
> +                */
> +               val = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE] * TICK_NSEC;

You just violated 2 coding style rules in one go :-)

If one side of the if-else has braces, the other side should have too.
If a block is multi-line (regardless of multi-stmt) it should have
braces.

/me hands you a bucket full of {}.

(Continue reading)

Glauber Costa | 30 May 11:48 2012

[PATCH v3 4/6] add a new scheduler hook for context switch

To be able to count the number of switches per-cgroup, and merging
Paul's hint that it would be better to do it in fair.c and rt.c,
I am introducing a new write-side walk through a new scheduler hook,
called at every context switch away from a task (prev).

Read-side is greatly simplified, and as I'll show, the performance impact
does not seem huge. First, aside from the function call, this walk is
O(depth), which is not likely to be huge (if it is, the performance
impact is indeed bad - but one can argue this is a good punishment)

Also, this walk is likely to be cache-hot, since it is at most the very
same loop done by put_prev_task, except it loops depth - 1 instead of depth
times. This is specially important not to hurt tasks in the root cgroup,
that will pay just a branch.

I am introducing a new hook, because the existing ones didn't seem
appropriate. The main possibilities would be put_prev_task and
pick_next_task.

With pick_next_task, there are two main problems:

1) first, the loop is only cache hot in pick_next_task if tasks actually
belong to the same class and group as prev. Depending on the workload,
this is possibly unlikely.

2) This is vulnerable to wrongdoings in accountings when exiting to idle.
Consider two groups A and B with a following pattern: A exits to idle,
but after that, B is scheduled. B, on the other hand, always get back
to itself when it yields to idle. This means that the to-idle transition
in A is never noted.
(Continue reading)

Peter Zijlstra | 30 May 13:20 2012
Picon

Re: [PATCH v3 4/6] add a new scheduler hook for context switch

On Wed, 2012-05-30 at 13:48 +0400, Glauber Costa wrote:
> To be able to count the number of switches per-cgroup, and merging
> Paul's hint that it would be better to do it in fair.c and rt.c,
> I am introducing a new write-side walk through a new scheduler hook,
> called at every context switch away from a task (prev).
> 
> Read-side is greatly simplified, and as I'll show, the performance impact
> does not seem huge. First, aside from the function call, this walk is
> O(depth), which is not likely to be huge (if it is, the performance
> impact is indeed bad - but one can argue this is a good punishment)
> 
> Also, this walk is likely to be cache-hot, since it is at most the very
> same loop done by put_prev_task, except it loops depth - 1 instead of depth
> times. This is specially important not to hurt tasks in the root cgroup,
> that will pay just a branch.

/me cries a little.. I was hoping to fix put_prev_task.. see:

  https://lkml.org/lkml/2012/2/16/487

(I've actually got a 4 patch split out of that if anybody cares)

Its just one of those things stuck behind the -ENOTIME tree :/

The plan is to 'merge' put_prev_task and pick_next_task into one and
avoid a lot of the up-down walking.

You just added a constraint for always having to walk the entire thing
up -- cgroups is too damn expensive already, we should be trimming this
nonsense not bloating it.
(Continue reading)

Peter Zijlstra | 30 May 13:40 2012

Re: [PATCH v3 4/6] add a new scheduler hook for context switch

On Wed, 2012-05-30 at 13:20 +0200, Peter Zijlstra wrote:
> 
> $ perf stat -e cycles --repeat 100 perf bench sched messaging -p -g 100
> 
>     48,826,146,710 cycles  #    2.470 GHz  ( +-  0.17% )
>        2.149005270 seconds time elapsed    ( +-  0.12% ) 

A repeat count of 250 dropped it down to:

        2.144582157 seconds time elapsed    ( +-  0.08% )

Also, if you're poking at the context switch path, something like:

$ taskset 1 perf stat -e cycles --repeat 10 perf bench sched pipe -l 100000

gives a good number and is usually more stable than hackbench.
Glauber Costa | 30 May 14:08 2012

Re: [PATCH v3 4/6] add a new scheduler hook for context switch

On 05/30/2012 03:40 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 13:20 +0200, Peter Zijlstra wrote:
>>
>> $ perf stat -e cycles --repeat 100 perf bench sched messaging -p -g 100
>>
>>      48,826,146,710 cycles  #    2.470 GHz  ( +-  0.17% )
>>         2.149005270 seconds time elapsed    ( +-  0.12% )
>
> A repeat count of 250 dropped it down to:
>
>          2.144582157 seconds time elapsed    ( +-  0.08% )
>
> Also, if you're poking at the context switch path, something like:
>
> $ taskset 1 perf stat -e cycles --repeat 10 perf bench sched pipe -l 100000
>
> gives a good number and is usually more stable than hackbench.
thanks for the pointers

Glauber Costa | 30 May 14:07 2012

Re: [PATCH v3 4/6] add a new scheduler hook for context switch

On 05/30/2012 03:20 PM, Peter Zijlstra wrote:
> /me cries a little.. I was hoping to fix put_prev_task.. see:
>
>    https://lkml.org/lkml/2012/2/16/487
>
> (I've actually got a 4 patch split out of that if anybody cares)
>
> Its just one of those things stuck behind the -ENOTIME tree :/
>
> The plan is to 'merge' put_prev_task and pick_next_task into one and
> avoid a lot of the up-down walking.
>
> You just added a constraint for always having to walk the entire thing
> up -- cgroups is too damn expensive already, we should be trimming this
> nonsense not bloating it.

if they are merged, then I don't need a new hook.

Do you have plans to do it? I'd be happy to continue the work if you 
lack the time, since I am directly interested in the functionality.


Gmane