Daniel Krambrock | 7 Jan 12:12 2011

Heartbeat dies with SIGXCPU, pacemaker ping RA syntax error

hi there,

we have got an 12 node cluster for managing KVM based virtual machines.
we are using fedora 12 for the node systems with pacemaker
(pacemaker-1.0.7-1.fc12.x86_64) and heartbeat
(heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64).

we had a crash of heartbeat with SIGXCPU

Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process
25702 killed by signal 24 [SIGXCPU - CPU limit exceeded].
Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process
25702 dumped core
Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died.
Beginning communications restart process for comm channel 0.
Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process
25701 killed by signal 9 [SIGKILL - Kill, unblockable].
Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes
for channel 0 have died.  Restarting.
Jan  2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast
heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040
(ttl=1 loop=0)
Jan  2 01:21:11 node09 heartbeat: [31328]: info: Communications restart
succeeded.
Jan  2 01:21:12 node09 heartbeat: [22135]: info: Stack hogger failed
0xffffffff
Jan  2 01:21:12 node09 heartbeat: [22136]: info: Stack hogger failed
0xffffffff

we figured out that if debug mode is turned on, heartbeat is setting a
(Continue reading)

Igor Chudov | 7 Jan 13:44 2011
Picon

Re: Heartbeat dies with SIGXCPU, pacemaker ping RA syntax error

I have the same problem (on Ubuntu).

Very interested in an answer.

i

On Fri, Jan 7, 2011 at 5:12 AM, Daniel Krambrock <ajshiu <at> googlemail.com>wrote:

> hi there,
>
> we have got an 12 node cluster for managing KVM based virtual machines.
> we are using fedora 12 for the node systems with pacemaker
> (pacemaker-1.0.7-1.fc12.x86_64) and heartbeat
> (heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64).
>
> we had a crash of heartbeat with SIGXCPU
>
> Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process
> 25702 killed by signal 24 [SIGXCPU - CPU limit exceeded].
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process
> 25702 dumped core
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died.
> Beginning communications restart process for comm channel 0.
> Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process
> 25701 killed by signal 9 [SIGKILL - Kill, unblockable].
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes
> for channel 0 have died.  Restarting.
> Jan  2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast
> heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040
> (ttl=1 loop=0)
(Continue reading)

Daniel Krambrock | 7 Jan 18:41 2011

Re: Heartbeat dies with SIGXCPU, pacemaker ping RA syntax error

hi there,

i think we found the reason for the syntax error in ping RA:
the crash of heartbeat had produced a coredump
in  /var/lib/heartbeat/cores/root , which is the working directory of
the ping RA. ping RA makes use of a unquoted * symbol:

score=`expr $active * $OCF_RESKEY_multiplier`

since we have a coredump in the working directory, shell is
misunderstanding the * symbol.
A patch for that would be:

--- ping        2011-01-07 18:31:35.000000000 +0100
+++ ping.new    2011-01-07 18:32:50.000000000 +0100
 <at>  <at>  -241,7 +241,7  <at>  <at> 
            *) ocf_log err "Unexpected result for '$p_exe $p_args
$OCF_RESKEY_options $host' $rc: $p_out";;
        esac
     done
-    score=`expr $active * $OCF_RESKEY_multiplier`
+    score=`expr $active \* $OCF_RESKEY_multiplier`
     attrd_updater -n $OCF_RESKEY_name -v $score -d $OCF_RESKEY_dampen
 }

beside form that we still have the SIGXCPU problem.

bests

daniel
(Continue reading)

Igor Chudov | 7 Jan 18:55 2011
Picon

Re: Heartbeat dies with SIGXCPU, pacemaker ping RA syntax error

Very interested on SIGXCPU problem.

I cannot deploy my solution with it.

i

On Fri, Jan 7, 2011 at 11:41 AM, Daniel Krambrock <ajshiu <at> googlemail.com>wrote:

> hi there,
>
> i think we found the reason for the syntax error in ping RA:
> the crash of heartbeat had produced a coredump
> in  /var/lib/heartbeat/cores/root , which is the working directory of
> the ping RA. ping RA makes use of a unquoted * symbol:
>
> score=`expr $active * $OCF_RESKEY_multiplier`
>
> since we have a coredump in the working directory, shell is
> misunderstanding the * symbol.
> A patch for that would be:
>
> --- ping        2011-01-07 18:31:35.000000000 +0100
> +++ ping.new    2011-01-07 18:32:50.000000000 +0100
>  <at>  <at>  -241,7 +241,7  <at>  <at> 
>            *) ocf_log err "Unexpected result for '$p_exe $p_args
> $OCF_RESKEY_options $host' $rc: $p_out";;
>        esac
>     done
> -    score=`expr $active * $OCF_RESKEY_multiplier`
> +    score=`expr $active \* $OCF_RESKEY_multiplier`
(Continue reading)


Gmane