Keith Browne | 4 May 2012 06:00

Timers becoming unscheduled

We're using timers to get some loose real-time behavior in a couple of 
applications.  We've found that over the long term, our timers silently fail; 
they stop calling the function registered at timer creation.  This is happening 
both with timers that are scheduled to recur with a fixed repeat interval and 
with timers without a repeat interval--in the latter case, we are rescheduling 
the timer in the callback function itself.  In all cases, we're setting :thread 
t when calling sb-ext:make-timer, so as to have the callbacks execute in their 
own threads.

The failure we're seeing is happening when we're connecting to an SBCL instance 
using Emacs and SLIME.  We've run some tests in which we run SBCL directly from 
the shell, and we haven't seen a failure in that case.  The problem may be 
related to signal handling when SLIME/SWANK is running.

When we're just running a handful of timers that keep rescheduling themselves 
every few seconds, we can run our application for days or weeks before we see a 
failure.  When one timer fails, though, it seems to take all the rest of the 
timers in the image with it--also suggestive of a problem in signal handling. 
We've run test cases with hundreds or even thousands of timers turning over at 
once, and in those cases we can usually see a failure within minutes, or an 
hour to two hours.

Some sample code demonstrating the problem with recurring timers is available 
at http://www.deepsky.com/~tuxedo/exercise-timers.tar.gz.

Is this a known problem with the timer implementation in SBCL?  In order to be 
certain that our application won't stop executing timer-related callbacks, 
should we have a dedicated thread that sleeps and periodically wakes up to 
reschedule all the timers?  Or an external process that tickles our Lisp 
program for the same purpose?
(Continue reading)

Nikodemus Siivola | 4 May 2012 13:10
Gravatar

Re: Timers becoming unscheduled

On 4 May 2012 07:00, Keith Browne <tuxedo <at> deepsky.com> wrote:

> The failure we're seeing is happening when we're connecting to an SBCL instance
> using Emacs and SLIME.  We've run some tests in which we run SBCL directly from
> the shell, and we haven't seen a failure in that case.  The problem may be
> related to signal handling when SLIME/SWANK is running.
>
> When we're just running a handful of timers that keep rescheduling themselves
> every few seconds, we can run our application for days or weeks before we see a
> failure.  When one timer fails, though, it seems to take all the rest of the
> timers in the image with it--also suggestive of a problem in signal handling.
> We've run test cases with hundreds or even thousands of timers turning over at
> once, and in those cases we can usually see a failure within minutes, or an
> hour to two hours.
>
> Some sample code demonstrating the problem with recurring timers is available
> at http://www.deepsky.com/~tuxedo/exercise-timers.tar.gz.
>
> Is this a known problem with the timer implementation in SBCL?  In order to be
> certain that our application won't stop executing timer-related callbacks,
> should we have a dedicated thread that sleeps and periodically wakes up to
> reschedule all the timers?  Or an external process that tickles our Lisp
> program for the same purpose?

No, this is not a known problem -- or at least it doesn't ring any
bells for me. I have your test running here, and will try to
reproduce.

A few extra questions:

(Continue reading)

Nikodemus Siivola | 4 May 2012 13:56
Gravatar

Re: Timers becoming unscheduled

On 4 May 2012 14:10, Nikodemus Siivola <nikodemus <at> random-state.net> wrote:

> No, this is not a known problem -- or at least it doesn't ring any
> bells for me. I have your test running here, and will try to
> reproduce.

Ok, I can reproduce this, but I'm almost out of time for today, but I
should have a couple of hours tomorrow.

Cheers,

 -- nikodemus

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
Keith Browne | 4 May 2012 17:17

Re: Timers becoming unscheduled

On Fri, 4 May 2012, Nikodemus Siivola wrote:

> No, this is not a known problem -- or at least it doesn't ring any
> bells for me. I have your test running here, and will try to
> reproduce.
>
> A few extra questions:
>
> - SBCL version?

We're running 1.0.56 right now, but we've seen the same problem on a few 
earlier versions.

> - Output from uname -a?

The two machines I can reach handily for now are:

Linux lowry 3.0.0-17-generic #30-Ubuntu SMP Thu Mar 8 20:45:39 UTC 2012 
x86_64 x86_64 x86_64 GNU/Linux

Linux coruscant 2.6.32-5-686 #1 SMP Mon Oct 3 04:15:24 UTC 2011 i686

> - How exactly are you running the test? You say you connect to SBCL:
> do you mean just M-x slime, or do you have SBCL already running with a
> swank server that you connect to?

We've been running under M-x slime.

> Ok, I can reproduce this, but I'm almost out of time for today, but I
> should have a couple of hours tomorrow.
(Continue reading)

Nikodemus Siivola | 5 May 2012 09:48
Gravatar

Re: Timers becoming unscheduled

On 4 May 2012 18:17, Keith Browne <tuxedo <at> deepsky.com> wrote:
> On Fri, 4 May 2012, Nikodemus Siivola wrote:
>
>> No, this is not a known problem -- or at least it doesn't ring any
>> bells for me. I have your test running here, and will try to
>> reproduce.
>>
>> A few extra questions:
>>
>> - SBCL version?
>
>
> We're running 1.0.56 right now, but we've seen the same problem on a few
> earlier versions.
>
>
>> - Output from uname -a?
>
>
> The two machines I can reach handily for now are:
>
> Linux lowry 3.0.0-17-generic #30-Ubuntu SMP Thu Mar 8 20:45:39 UTC 2012
> x86_64 x86_64 x86_64 GNU/Linux
>
> Linux coruscant 2.6.32-5-686 #1 SMP Mon Oct 3 04:15:24 UTC 2011 i686
>
>
>> - How exactly are you running the test? You say you connect to SBCL:
>> do you mean just M-x slime, or do you have SBCL already running with a
>> swank server that you connect to?
(Continue reading)


Gmane