Toru Nishimura | 28 Apr 2012 08:49

occational system lock up in 6.0_BETA and 6.99.5

Hi,

I need a help from powerpc kernel knowledgefuls.  My sandpoint
NAS is known to make system lock ups under certain heavy
loads.

1. dump(8) or fsck_ffs(8) with -X/-x filesys snapshot. WAPBL enabled.
2. doing powerpc GCC4.5 compilation for genautomata

Right this moment I got another lockup symptom.  I found it
is capable to respond DDB break in.

- KUROBOX 64MB RAM plus 512MB swap space.
- it has continued "build.sh tools" for 7h18m.
- system is under severe VM slashing condition. For most of
time the offending compile process makes continuous page
out/in operation.

$ top
31 processes: 30 sleeping, 1 on CPU
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Memory: 29M Act, 15M Inact, 4K Wired, 2612K Exec, 6568K File, 464K Free
Swap: 512M Total, 455M Used, 57M Free

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
11089 nisimura  85    0   439M   35M biowait   38:49  0.00%  0.00% genautomata
    0 root     221    0     0K 4972K pgdaemon   8:20  0.00%  0.00% [system]
29191 nisimura  43    0  4316K 1572K CPU        0:38  0.00%  0.00% top
  817 root      85    0  8432K  452K select     0:35  0.00%  0.00% telnetd
10498 nisimura  85    0    12M  364K wait       0:09  0.00%  0.00% nbgmake
(Continue reading)

D'Arcy Cain | 28 Apr 2012 18:52
Picon

Re: occational system lock up in 6.0_BETA and 6.99.5

On 04/28/12 02:49, Toru Nishimura wrote:
> $ /store/bin/build-sandpoint tools
> ^C^C^C ..... serial console echoes back ^C but gets no go.

This sounds a lot like my 1386 system at the moment.  Every morning
I have to do a hard boot.  I tried without ACPI and same result.

> ~# ... sending BREAK does work.

Me too. http://www.vex.net/~darcy/20120428_4307.jpg

--

-- 
D'Arcy J.M. Cain <darcy <at> NetBSD.org>
http://www.NetBSD.org/ IM:darcy <at> Vex.Net

Toru Nishimura | 29 Apr 2012 00:12

Re: occational system lock up in 6.0_BETA and 6.99.5

Hi,

> This sounds a lot like my i386 system at the moment.  Every morning
> I have to do a hard boot.  I tried without ACPI and same result.
> 
>> ~# ... sending BREAK does work.
> 
> Me too. http://www.vex.net/~darcy/20120428_4307.jpg

Well, i386 case is rather surprise to me.  Lockups are related with
scheduler itself, not particular device or memory activity.

Toru Nishimura / ALKYL Technology

Bernd Ernesti | 2 May 2012 07:55
Picon

Re: occational system lock up in 6.0_BETA and 6.99.5

Hi,

On Sat, Apr 28, 2012 at 03:49:13PM +0900, Toru Nishimura wrote:
> Hi,
> 
> I need a help from powerpc kernel knowledgefuls.  My sandpoint
> NAS is known to make system lock ups under certain heavy
> loads.
> 
> 1. dump(8) or fsck_ffs(8) with -X/-x filesys snapshot. WAPBL enabled.
> 2. doing powerpc GCC4.5 compilation for genautomata

As D'Arcy I had a lock up on i386 too.
In my case it is 6.0_BETA.

I do not have a backtrace because I was under X11 when I compiled
a package and the system locked up while checking the existing
files in /usr/pkg. xosview showed around 40MB/s i/o at that time.

Bernd

Tom Ivar Helbekkmo | 2 May 2012 09:36
Picon
Gravatar

Re: occational system lock up in 6.0_BETA and 6.99.5

Bernd Ernesti <netbsd <at> lists.veego.de> writes:

> As D'Arcy I had a lock up on i386 too.

NetBSD/amd64 is hanging itself up during heavy loads, too - there are
several reports on this currently being discussed on port-amd64.  In my
case, at least, it's obviously related to heavy disk I/O, as everything
else continues to work, but every process that tries to access disk gets
stuck in biowait.  I'm running a very current amd64-current.

-tih
--

-- 
"The market" is a bunch of 28-year-olds who don't know anything. --Paul Krugman

Thor Lancelot Simon | 3 May 2012 14:34
Picon
Favicon

Re: occational system lock up in 6.0_BETA and 6.99.5

On Wed, May 02, 2012 at 09:36:27AM +0200, Tom Ivar Helbekkmo wrote:
> Bernd Ernesti <netbsd <at> lists.veego.de> writes:
> 
> > As D'Arcy I had a lock up on i386 too.
> 
> NetBSD/amd64 is hanging itself up during heavy loads, too - there are
> several reports on this currently being discussed on port-amd64.  In my
> case, at least, it's obviously related to heavy disk I/O, as everything
> else continues to work, but every process that tries to access disk gets
> stuck in biowait.  I'm running a very current amd64-current.

Make it more -current; a fix for one problem of this kind was checked in
over the weekend.

Thor

Tom Ivar Helbekkmo | 3 May 2012 20:11
Picon
Gravatar

Re: occational system lock up in 6.0_BETA and 6.99.5

Thor Lancelot Simon <tls <at> panix.com> writes:

> Make it more -current; a fix for one problem of this kind was checked in
> over the weekend.

I am already current as per May 1st, so I guess the one that's biting me
is a different one.  I'm now running with SMP disabled, to see if it'll
stay up like that.

-tih
--

-- 
"The market" is a bunch of 28-year-olds who don't know anything. --Paul Krugman

D'Arcy Cain | 4 May 2012 01:19
Favicon
Gravatar

Re: occasional system lock up in 6.0_BETA and 6.99.5

On 12-05-03 02:11 PM, Tom Ivar Helbekkmo wrote:
> Thor Lancelot Simon<tls <at> panix.com>  writes:
>
>> Make it more -current; a fix for one problem of this kind was checked in
>> over the weekend.
>
> I am already current as per May 1st, so I guess the one that's biting me
> is a different one.  I'm now running with SMP disabled, to see if it'll
> stay up like that.

Same here.  I am running from -current as of May 2.  Haven't tried
disabling SMP yet.  Tried disabling ACPI but same thing.

OK, I had to fix the typo in the subject.  :-)

--

-- 
D'Arcy J.M. Cain
System Administrator, Vex.Net
http://www.Vex.Net/ IM:darcy <at> Vex.Net

Toru Nishimura | 8 May 2012 05:24

Re: occasional system lock up in 6.0_BETA and 6.99.5

Hi,

With DDB backtrace we can see the followings;

db> bt
0x0060bd90: at comintr+0x590
0x0060bde0: at pic_handle_intr+0x198
0x0060be20: at trapstart+0x684
0x0060bef0: at sched_curcpu_runnable_p+0x2c <<< HERE <<<
0x0060bf00: at idle_loop+0xe8
0x0060bf20: at cpu_lwp_bootstrap+0xc
saved LR(0x7ffffd) is invalid.

Here is the objdump list of the offending code;

00181800 <sched_curcpu_runnable_p>:
  181800:       7c 08 02 a6     mflr    r0
  181804:       94 21 ff f0     stwu    r1,-16(r1)
  181808:       93 e1 00 0c     stw     r31,12(r1)
  18180c:       90 01 00 14     stw     r0,20(r1)
  181810:       48 00 6a 95     bl      1882a4 <kpreempt_disable>
  181814:       7d 30 42 a6     mfsprg  r9,0
  181818:       81 29 00 30     lwz     r9,48(r9)
  18181c:       80 69 00 1c     lwz     r3,28(r9)
  181820:       7c 63 00 34     cntlzw  r3,r3
  181824:       54 63 d9 7e     rlwinm  r3,r3,27,5,31
  181828:       68 7f 00 01     xori    r31,r3,1
  18182c:       48 00 71 c5     bl      1889f0 <kpreempt_enable> <<< L <at>  <at> K <<<
  181830:       80 01 00 14     lwz     r0,20(r1)
  181834:       7f e3 fb 78     mr      r3,r31
(Continue reading)

D'Arcy Cain | 10 May 2012 07:59
Picon

Re: occasional system lock up in 6.0_BETA and 6.99.5

On 12-05-03 02:11 PM, Tom Ivar Helbekkmo wrote:
> I am already current as per May 1st, so I guess the one that's biting me
> is a different one.  I'm now running with SMP disabled, to see if it'll
> stay up like that.

Did it?

Meanwhile, I am still hanging.  For those that missed it on the other
list, here is the backtrace from the kernel debugger after a hang:

http://www.vex.net/~darcy/20120428_4307.jpg

--

-- 
D'Arcy J.M. Cain <darcy <at> NetBSD.org>
http://www.NetBSD.org/ IM:darcy <at> Vex.Net

Tom Ivar Helbekkmo | 10 May 2012 08:44
Picon
Gravatar

Re: occasional system lock up in 6.0_BETA and 6.99.5

D'Arcy Cain <darcy <at> NetBSD.org> writes:

> On 12-05-03 02:11 PM, Tom Ivar Helbekkmo wrote:
>> I am already current as per May 1st, so I guess the one that's biting me
>> is a different one.  I'm now running with SMP disabled, to see if it'll
>> stay up like that.
>
> Did it?

I never completed that test.  Martin Husemann suggested the problem
might be with "direct map" code, and on his suggestion I disabled that
by ifdef'ing out the relevant defines near the end of
sys/arch/amd64/include/types.h.  I used to have one or two hangs or
crashes per day, and it's now been a week with not a single incident.

So, I guess the problem biting me was irrelevant to port-386.

-tih
--

-- 
"The market" is a bunch of 28-year-olds who don't know anything. --Paul Krugman

matthew green | 10 May 2012 08:52
Picon
Favicon

re: occasional system lock up in 6.0_BETA and 6.99.5


> On 12-05-03 02:11 PM, Tom Ivar Helbekkmo wrote:
> > I am already current as per May 1st, so I guess the one that's biting me
> > is a different one.  I'm now running with SMP disabled, to see if it'll
> > stay up like that.
> 
> Did it?
> 
> Meanwhile, I am still hanging.  For those that missed it on the other
> list, here is the backtrace from the kernel debugger after a hang:
> 
> http://www.vex.net/~darcy/20120428_4307.jpg

FWIW, this backtrace basically shows you broke into ddb from the
idle world.  ie, it's relatively info-less.

what does "ps" in ddb show?  etc.  look for proesses you expect
to be running and see what their bt is.

thanks.

.mrg.

D'Arcy Cain | 10 May 2012 18:15
Picon

Re: occasional system lock up in 6.0_BETA and 6.99.5

On 12-05-10 02:52 AM, matthew green wrote:
>> Meanwhile, I am still hanging.  For those that missed it on the other
>> list, here is the backtrace from the kernel debugger after a hang:
>>
>> http://www.vex.net/~darcy/20120428_4307.jpg
>
> FWIW, this backtrace basically shows you broke into ddb from the
> idle world.  ie, it's relatively info-less.

I thought it might be but I am no expert on reading them.

> what does "ps" in ddb show?  etc.  look for proesses you expect
> to be running and see what their bt is.

I will do this the next time it hangs, probably by tomorrow morning.
Any other suggestions while I am in the debugger?

--

-- 
D'Arcy J.M. Cain <darcy <at> NetBSD.org>
http://www.NetBSD.org/ IM:darcy <at> Vex.Net

Bernd Ernesti | 10 May 2012 23:17
Picon

Re: occasional system lock up in 6.0_BETA and 6.99.5

[..]

It would be very nice if not everyone thinks it is a good
idea to add everyone to the cc: or to: line.
It is enough to send it to the mailing lists and not to
everyone else who might be interested to get tons of
duplicated mails.

Bernd

P.S. I expect to get NOT a personal reply to this mail.

Thor Lancelot Simon | 11 May 2012 02:47
Picon
Favicon

Re: occasional system lock up in 6.0_BETA and 6.99.5

On Thu, May 10, 2012 at 11:17:56PM +0200, Bernd Ernesti wrote:
> [..]
> 
> It would be very nice if not everyone thinks it is a good
> idea to add everyone to the cc: or to: line.
> It is enough to send it to the mailing lists and not to
> everyone else who might be interested to get tons of
> duplicated mails.
> 
> Bernd
> 
> P.S. I expect to get NOT a personal reply to this mail.

If you don't want a personal response, you could try setting Reply-To:
to where you do want the response to go...

--

-- 
Thor Lancelot Simon	                                     tls <at> panix.com
  "The liberties...lose much of their value whenever those who have greater
   private means are permitted to use their advantages to control the course
   of public debate."					-John Rawls

D'Arcy Cain | 24 May 2012 14:29
Picon

Re: occasional system lock up in 6.0_BETA and 6.99.5

On 12-05-10 12:15 PM, D'Arcy Cain wrote:
> On 12-05-10 02:52 AM, matthew green wrote:
>> what does "ps" in ddb show? etc. look for proesses you expect
>> to be running and see what their bt is.
>
> I will do this the next time it hangs, probably by tomorrow morning.
> Any other suggestions while I am in the debugger?

I haven't done this because the latest kernel that I built on May 5
has been stable.  Whatever the issue was seems to have been fixed.

--

-- 
D'Arcy J.M. Cain <darcy <at> NetBSD.org>
http://www.NetBSD.org/ IM:darcy <at> Vex.Net

Tom Ivar Helbekkmo | 8 May 2012 13:22
Picon
Gravatar

Re: occational system lock up in 6.0_BETA and 6.99.5

I wrote:

> Bernd Ernesti <netbsd <at> lists.veego.de> writes:
>
>> As D'Arcy I had a lock up on i386 too.
>
> NetBSD/amd64 is hanging itself up during heavy loads, too - there are
> several reports on this currently being discussed on port-amd64.

It turns out the problem biting my amd64 box was specific to that port.

-tih
--

-- 
"The market" is a bunch of 28-year-olds who don't know anything. --Paul Krugman

D'Arcy Cain | 8 May 2012 18:28
Picon

Re: occational system lock up in 6.0_BETA and 6.99.5

On 12-05-08 07:22 AM, Tom Ivar Helbekkmo wrote:
> It turns out the problem biting my amd64 box was specific to that port.

Yes, the discussion has branched to the i386 mailing list for my issue.

--

-- 
D'Arcy J.M. Cain <darcy <at> NetBSD.org>
http://www.NetBSD.org/ IM:darcy <at> Vex.Net

Toru Nishimura | 3 May 2012 05:47

Re: occational system lock up in 6.0_BETA and 6.99.5

More about system lockup under heavy load.

> ~# ... DDB break in works ...
> 
> db> bt
> 0x0060bd90: at comintr+0x590
> 0x0060bde0: at pic_handle_intr+0x198
> 0x0060be20: at trapstart+0x684
> 0x0060bef0: at sched_curcpu_runnable_p+0x2c <<< THIS <<<
> 0x0060bf00: at idle_loop+0xe8
> 0x0060bf20: at setfunc_trampoline+0x8
> saved LR(0x7ffffd) is invalid.
> ...
>
> The last operation
> I did is invoking "ps xa" and DDB shows no ps process.

It's now quite obvious it's the "schedular stuck"  The DDB traceback
is the same as GCC4.5 genautomata lockup case.

db> bt
0x0060bd90: at comintr+0x590
0x0060bde0: at pic_handle_intr+0x198
0x0060be20: at trapstart+0x684
0x0060bef0: at sched_curcpu_runnable_p+0x2c <<< HERE <<<
0x0060bf00: at idle_loop+0xe8
0x0060bf20: at cpu_lwp_bootstrap+0xc
saved LR(0x7ffffd) is invalid.

Toru Nishimura / ALKYL Technology
(Continue reading)

Toru Nishimura | 3 May 2012 08:33

Re: occational system lock up in 6.0_BETA and 6.99.5

Sore more rumbling on system lockup under heavy disk I/O.

I found that dump lockup happens on the combination of WAPBL and
dump -X snapshot.

- Standard dump operation without -X snapshot does work either in WAPBL
or no WAPBL case.
- Snapshot dump of WAPBL-less (/sbin/mount -o update,nolog) filesys works.

Amusingly "reenabling WAPBL (-o update,log) during dump -X" is found Ok.
And next invokation of dump -X ends up with lockup some seconds later.

Toru Nishimura / ALKYL Technology

Ignatios Souvatzis | 3 May 2012 09:18
Picon

Re: occational system lock up in 6.0_BETA and 6.99.5

On Thu, May 03, 2012 at 03:33:36PM +0900, Toru Nishimura wrote:
> Sore more rumbling on system lockup under heavy disk I/O.
> 
> I found that dump lockup happens on the combination of WAPBL and
> dump -X snapshot.
> 
> - Standard dump operation without -X snapshot does work either in WAPBL
> or no WAPBL case.
> - Snapshot dump of WAPBL-less (/sbin/mount -o update,nolog) filesys works.
> 
> Amusingly "reenabling WAPBL (-o update,log) during dump -X" is found Ok.
> And next invokation of dump -X ends up with lockup some seconds later.

I think I've seen this but with rsync, so I moved the backup script back
to snapshot-less rsync. This was with sparc64/5.99.mumble , and on a
production setup, so I couldn't experiment.

	-is


Gmane