Julian Coleman | 23 Feb 2012 12:02
Picon

Pausing/resuming CPU's in DDB

Hi,

On my 8-way E3500, I almost always see some of the CPU's fail to pause when
entering DDB, and fail to resume when leaving.  This makes it hard to obtain
CPU-specific information for some CPU's.  Martin suggested that a loop around
the pause/resume code might help here, and the attached patch works for me.
Does anyone see any problem with it?

Thanks,

J

--

-- 
  My other computer also runs NetBSD    /        Sailing at Newbiggin
        http://www.netbsd.org/        /   http://www.newbigginsailingclub.org/
Index: ipifuncs.c
===================================================================
RCS file: /cvsroot/src/sys/arch/sparc64/sparc64/ipifuncs.c,v
retrieving revision 1.44
diff -u -r1.44 ipifuncs.c
--- ipifuncs.c	12 Feb 2012 16:34:10 -0000	1.44
+++ ipifuncs.c	22 Feb 2012 08:09:09 -0000
 <at>  <at>  -325,17 +325,21  <at>  <at> 
 void
 mp_pause_cpus(void)
 {
+	int i = 3;
 	sparc64_cpuset_t cpuset;
(Continue reading)

Takeshi Nakayama | 23 Feb 2012 14:57
Picon

Re: Pausing/resuming CPU's in DDB

>>> Julian Coleman <jdc <at> coris.org.uk> wrote

> On my 8-way E3500, I almost always see some of the CPU's fail to pause when
> entering DDB, and fail to resume when leaving.  This makes it hard to obtain
> CPU-specific information for some CPU's.  Martin suggested that a loop around
> the pause/resume code might help here, and the attached patch works for me.
> Does anyone see any problem with it?

How about increase the retry count in sparc64_send_ipi() ?

Ours is now 1000, but FreeBSD is 5000, OpenBSD is 10000.

Thanks,

-- Takeshi Nakayama

Martin Husemann | 23 Feb 2012 15:03
Picon

Re: Pausing/resuming CPU's in DDB

On Thu, Feb 23, 2012 at 10:57:47PM +0900, Takeshi Nakayama wrote:
> How about increase the retry count in sparc64_send_ipi() ?
> 
> Ours is now 1000, but FreeBSD is 5000, OpenBSD is 10000.

Do both, or scale the retry count on cpu speed ...

If it only fails for ddb enter/exit the loop is fine, IMHO - but as we have
seen other reports of ipi sending failure during normal operation, we should
add the instrumentation Eduardo suggested and find out where we are blocked
out that long (but this is mostly orthogonal to the topic at hand).

Martin

Eduardo Horvath | 23 Feb 2012 18:09
Picon

Re: Pausing/resuming CPU's in DDB

On Thu, 23 Feb 2012, Martin Husemann wrote:

> On Thu, Feb 23, 2012 at 10:57:47PM +0900, Takeshi Nakayama wrote:
> > How about increase the retry count in sparc64_send_ipi() ?
> > 
> > Ours is now 1000, but FreeBSD is 5000, OpenBSD is 10000.
> 
> Do both, or scale the retry count on cpu speed ...

Looking at the code....

Increasing the number of retries in sparc64_send_ipi() is unlikely to help 
the situation since you should only be able to exit that routine one of 
two ways, if it thinks sending the IPI was successful or through this 
code:

        if (panicstr == NULL)
                panic("cpu%d: ipi_send: couldn't send ipi to UPAID %u"
                        " (tried %d times)", cpu_number(), upaid, i);

Are you getting a panic?  If not, then increasing the loop count won't 
help.

> If it only fails for ddb enter/exit the loop is fine, IMHO - but as we have
> seen other reports of ipi sending failure during normal operation, we should
> add the instrumentation Eduardo suggested and find out where we are blocked
> out that long (but this is mostly orthogonal to the topic at hand).

Also, instead of always sending the IPI to all the cpus I would recomment 
updating the cpuset by removing the processors that have halted for the 
(Continue reading)

Julian Coleman | 11 Mar 2012 01:05
Picon

Re: Pausing/resuming CPU's in DDB

Hi,

> Increasing the number of retries in sparc64_send_ipi() is unlikely to help 
> the situation since you should only be able to exit that routine one of 
> two ways, if it thinks sending the IPI was successful or through this 
> code:
> 
>         if (panicstr == NULL)
>                 panic("cpu%d: ipi_send: couldn't send ipi to UPAID %u"
>                         " (tried %d times)", cpu_number(), upaid, i);
> 
> Are you getting a panic?  If not, then increasing the loop count won't 
> help.

No panic, but I did see the "RED State Exception".  I have increased the
number of retries in sparc64_send_ipi to 10000, and the E3500 survived a
10h30 build.sh -j 16 (base, X, 2 kernels) with all filesystems on NFS.

However, even with retries set to 10000, it still sometimes fails to pause or
to resume all the CPU's, so the loops in mp_pause_cpus() and mp_resume_cpus()
still seem to be necessary.

> Also, instead of always sending the IPI to all the cpus I would recomment 
> updating the cpuset by removing the processors that have halted for the 
> next iteration of your patch.

Next iteration of the patch attached.

Thanks,

(Continue reading)

matthew green | 24 Feb 2012 00:55
Picon
Favicon

re: Pausing/resuming CPU's in DDB


> If it only fails for ddb enter/exit the loop is fine, IMHO - but as we have
> seen other reports of ipi sending failure during normal operation, we should
> add the instrumentation Eduardo suggested and find out where we are blocked
> out that long (but this is mostly orthogonal to the topic at hand).

FWIW, to debug some sparc smp issues i was seeing i added an NMI IPI
that made the remote cpus log their %pc, if they weren't answering
normal IPIs.  i never tried porting this code to sparc64 in any sense
(and it isn't commited to sparc port anyway) because level 15 is no
longer "NMI", as i recall anyway.

but the feature is simple to implement and use, if it works.

.mrg.

Eduardo Horvath | 24 Feb 2012 01:54
Picon

re: Pausing/resuming CPU's in DDB

On Fri, 24 Feb 2012, matthew green wrote:

> FWIW, to debug some sparc smp issues i was seeing i added an NMI IPI
> that made the remote cpus log their %pc, if they weren't answering
> normal IPIs.  i never tried porting this code to sparc64 in any sense
> (and it isn't commited to sparc port anyway) because level 15 is no
> longer "NMI", as i recall anyway.
> 
> but the feature is simple to implement and use, if it works.

I don't think there's any way to generate a "NMI" IPI on V9 machines.  
Everything (including IPIs) comes in as an interrupt_vector and then is 
given a priority in software.  

Hm... Maybe what's going on is that the SMP code is turning off the 
interrupt enable bit in the pstate register rather than just increasing 
the IPL in the level register...  that would prevent any interrupts from 
being recieved and queued for later dispatch.

Eduardo

Eduardo Horvath | 24 Feb 2012 02:01
Picon

re: Pausing/resuming CPU's in DDB

On Fri, 24 Feb 2012, Eduardo Horvath wrote:

> On Fri, 24 Feb 2012, matthew green wrote:
> 
> > FWIW, to debug some sparc smp issues i was seeing i added an NMI IPI
> > that made the remote cpus log their %pc, if they weren't answering
> > normal IPIs.  i never tried porting this code to sparc64 in any sense
> > (and it isn't commited to sparc port anyway) because level 15 is no
> > longer "NMI", as i recall anyway.
> > 
> > but the feature is simple to implement and use, if it works.
> 
> I don't think there's any way to generate a "NMI" IPI on V9 machines.  
> Everything (including IPIs) comes in as an interrupt_vector and then is 
> given a priority in software.  
> 
> Hm... Maybe what's going on is that the SMP code is turning off the 
> interrupt enable bit in the pstate register rather than just increasing 
> the IPL in the level register...  that would prevent any interrupts from 
> being recieved and queued for later dispatch.

Hm...  interrupt_vector jumps straight to the IPI routine...  and we have 
a "sparc64_generic_xcall" which could do pretty much anything...  all with 
the interrupt enable bit in the pstat turned off...

Eduardo

Volkmar Seifert | 23 Mar 2012 08:59
Favicon

Re: Pausing/resuming CPU's in DDB

>> How about increase the retry count in sparc64_send_ipi() ?
>>
>> Ours is now 1000, but FreeBSD is 5000, OpenBSD is 10000.
>
> Do both, or scale the retry count on cpu speed ...

We tried using various much higher values already when I posted my problem
about

"/netbsd: panic: cpu0: ipi_send: couldn't send ipi to UPAID2"

(that's the subject of the emails and the error occurring on my machine)

And as it is today, the machine is as unstable as ever, even though it
looked like a good work-around at first.
I can't even login into X without having to fear for the above mentioned
kernel panic.

Increasing the value has, IMHO, no real effect but stalling the panic for
a few loop-cycles more.

> If it only fails for ddb enter/exit the loop is fine, IMHO - but as we
> have seen other reports of ipi sending failure during normal operation, we
> should add the instrumentation Eduardo suggested and find out where we are
> blocked out that long (but this is mostly orthogonal to the topic at
> hand).

I'd be happy to provide testing ground, as my machine very reliably
crashes into this panic in this routine, no matter how high the count.
Sadly, my own knowledge in this particular field is rather limited, so I
(Continue reading)

matthew green | 24 Feb 2012 00:52
Picon
Favicon

re: Pausing/resuming CPU's in DDB


> On my 8-way E3500, I almost always see some of the CPU's fail to pause when
> entering DDB, and fail to resume when leaving.  This makes it hard to obtain
> CPU-specific information for some CPU's.  Martin suggested that a loop around
> the pause/resume code might help here, and the attached patch works for me.
> Does anyone see any problem with it?

why does this work?  hmmm.  i'd much rather figure out the root cause
and fix that, than put this workaround in place...

but i don't object.

.mrg.


Gmane