Milan Jovanovic | 25 Jul 21:35

process-enable issue

Hi, i have problems with multi-threading on linux, i think it's the same like "http://trac.clozure.com/openmcl/ticket/297"
First it was "Unable to enable process #<PROCESS ...have been trying for 1 seconds" and inferior-list segmentation fault after 2-3 hours of running (this was on SUSE LINUX 10.0 X86-64 2.6.13-15-smp)

After Gary Byers suggestion  that it is meaby linux kernel bug i tried  on SUSE Server 10 (x86_64) -  kernel 2.6.24. After more then day of running with no errors i saw one more  "Unable to enable process #<PROCESS ...have been trying for 1 seconds" but this time no segmentation fault.
So I'm asking is it problem/bug if this happens  or only if it happens with  segmentation fault following ?

btw. i tried code on sbcl to be sure that it's not something there and it's running couple of days with no problems

Thanks
Best,Milan

_______________________________________________
Openmcl-devel mailing list
Openmcl-devel <at> clozure.com
http://clozure.com/mailman/listinfo/openmcl-devel
Gary Byers | 25 Jul 22:25

Re: process-enable issue

In the original bug report, the backtrace for what was thread #35 showed

  (2AAAAD619B18) : 0 (PROCESS-ENABLE #<PROCESS Worker thread(38) [Active] #x300043A1C8ED> [...]) 405
  (2AAAAD619B68) : 1 (%PROCESS-RUN-FUNCTION '(:NAME "Worker thread") #<COMPILED-LEXICAL-CLOSURE
(:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> NIL) 1373
  (2AAAAD619C58) : 2 (PROCESS-RUN-FUNCTION "Worker thread" #<COMPILED-LEXICAL-CLOSURE (:INTERNAL
ACL2::RUN-THREAD) #x300043A1CD7F> [...]) 213

and :proc showed

38 :    Worker thread  [Active] 
35 :    Worker thread  [semaphore wait]  (Requesting terminal input)
14 :    Worker thread  [semaphore wait] 
1 : -> listener     [Active] 
0 :    Initial      [Active]

In other words, thread 35 created thread 38 and was waiting for it
to signal a semaphore that would indicate that it's reset itself
and is ready to be enabled (given a function to run).  :PROC shows
that thread 38 is already running, which doesn't make much sense.
The Linux kernel that David Rager was running was one that allegedly
had just fixed a bug which could cause the the wrong thread to be
awakened via FUTEX_WAIT, and it seemed plausible that that bug hadn't
really been fixed there.  The case that failed reliably for David
on the machine that David was using worked reliably for me, similar
cases seemed to work for others, and blaming this on something at
the OS level makes more sense than anything else that I can think
of.  (Another fuzzy explanation is that malloc() - when called
from two threads at the same time - returned the same block of
memory to both callers because of a locking problem, so two
threads wound up sharing the same "pointer to semaphore".)

There's a separate issue in that PROCESS-ENABLE waits for the target
thread to indicate that it's "ready" with a timeout of 1 second. 
That's usually long enough, but it's entirely arbitrary (how long
it actually takes depends on the load on and the whims of the
scheduler.)  Taking longer than a second might indicate that the
newly-created thread isn't getting enough CPU time to signal its
readiness to run,  The whole notion of having a timeout for
something that can take an indeterminate amount of time is
questionable, so it probably makes sense to not use a one-second
timeout in PROCESS-ENABLE by default, at the very least.

Can you tell whether it was the first case (where PROCESS-ENABLE
was waiting to enable a thread that - somehow - seems to have
already been enabled) or the second (the one-second timeout is
too short, and quite possibly the entire idea of a timeout is
misguided) or the second ?

In the former case, the thread being enabled would be on the
list returned by (ALL-PROCESSES) or in the output displayed
by :PROC, and in the latter case it wouldn't.

On Fri, 25 Jul 2008, Milan Jovanovic wrote:

> Hi, i have problems with multi-threading on linux, i think it's the same
> like "http://trac.clozure.com/openmcl/ticket/297"
> First it was "Unable to enable process #<PROCESS ...have been trying for 1
> seconds" and inferior-list segmentation fault after 2-3 hours of running
> (this was on SUSE LINUX 10.0 X86-64 2.6.13-15-smp)
>
> After Gary Byers suggestion  that it is meaby linux kernel bug i tried  on
> SUSE Server 10 (x86_64) -  kernel 2.6.24. After more then day of running
> with no errors i saw one more  "Unable to enable process #<PROCESS ...have
> been trying for 1 seconds" but this time no segmentation fault.
> So I'm asking is it problem/bug if this happens  or only if it happens with
> segmentation fault following ?
>
> btw. i tried code on sbcl to be sure that it's not something there and it's
> running couple of days with no problems
>
> Thanks
> Best,Milan
>
David Rager | 26 Jul 00:13

Re: process-enable issue

In the case described, when I (:y 35), and type :go (or whatever made the lisp system ignore the warning), IIRC, it all worked.  Therrefore, IIRC, it's probably the latter, where one second isn't enough (or something new is occurring to make threads not swap in as much).

The thing that may be indicative that it's not an OS problem, is that this just started happening when I upgraded to the RC verson of CCL (RC 1.2?).  I can inquire of our IT department if you would find whether there was an OS change during this period to be relevant information.  RC 1.2 fixed another OpenMCL problem (which I was quite pleased about), so it wasn't like I could just keep using the old OpenMCL.

At least now our group is no longer the only group seeing and reporting this behavior.

On Fri, Jul 25, 2008 at 1:25 PM, Gary Byers <gb <at> clozure.com> wrote:
In the original bug report, the backtrace for what was thread #35 showed

 (2AAAAD619B18) : 0 (PROCESS-ENABLE #<PROCESS Worker thread(38) [Active] #x300043A1C8ED> [...]) 405
 (2AAAAD619B68) : 1 (%PROCESS-RUN-FUNCTION '(:NAME "Worker thread") #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> NIL) 1373
 (2AAAAD619C58) : 2 (PROCESS-RUN-FUNCTION "Worker thread" #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> [...]) 213

and :proc showed

38 :    Worker thread  [Active]
35 :    Worker thread  [semaphore wait]  (Requesting terminal input)
14 :    Worker thread  [semaphore wait]
1 : -> listener     [Active]
0 :    Initial      [Active]

In other words, thread 35 created thread 38 and was waiting for it
to signal a semaphore that would indicate that it's reset itself
and is ready to be enabled (given a function to run).  :PROC shows
that thread 38 is already running, which doesn't make much sense.
The Linux kernel that David Rager was running was one that allegedly
had just fixed a bug which could cause the the wrong thread to be
awakened via FUTEX_WAIT, and it seemed plausible that that bug hadn't
really been fixed there.  The case that failed reliably for David
on the machine that David was using worked reliably for me, similar
cases seemed to work for others, and blaming this on something at
the OS level makes more sense than anything else that I can think
of.  (Another fuzzy explanation is that malloc() - when called
from two threads at the same time - returned the same block of
memory to both callers because of a locking problem, so two
threads wound up sharing the same "pointer to semaphore".)

There's a separate issue in that PROCESS-ENABLE waits for the target
thread to indicate that it's "ready" with a timeout of 1 second.
That's usually long enough, but it's entirely arbitrary (how long
it actually takes depends on the load on and the whims of the
scheduler.)  Taking longer than a second might indicate that the
newly-created thread isn't getting enough CPU time to signal its
readiness to run,  The whole notion of having a timeout for
something that can take an indeterminate amount of time is
questionable, so it probably makes sense to not use a one-second
timeout in PROCESS-ENABLE by default, at the very least.

Can you tell whether it was the first case (where PROCESS-ENABLE
was waiting to enable a thread that - somehow - seems to have
already been enabled) or the second (the one-second timeout is
too short, and quite possibly the entire idea of a timeout is
misguided) or the second ?

In the former case, the thread being enabled would be on the
list returned by (ALL-PROCESSES) or in the output displayed
by :PROC, and in the latter case it wouldn't.

On Fri, 25 Jul 2008, Milan Jovanovic wrote:

> Hi, i have problems with multi-threading on linux, i think it's the same
> like "http://trac.clozure.com/openmcl/ticket/297"
> First it was "Unable to enable process #<PROCESS ...have been trying for 1
> seconds" and inferior-list segmentation fault after 2-3 hours of running
> (this was on SUSE LINUX 10.0 X86-64 2.6.13-15-smp)
>
> After Gary Byers suggestion  that it is meaby linux kernel bug i tried  on
> SUSE Server 10 (x86_64) -  kernel 2.6.24. After more then day of running
> with no errors i saw one more  "Unable to enable process #<PROCESS ...have
> been trying for 1 seconds" but this time no segmentation fault.
> So I'm asking is it problem/bug if this happens  or only if it happens with
> segmentation fault following ?
>
> btw. i tried code on sbcl to be sure that it's not something there and it's
> running couple of days with no problems
>
> Thanks
> Best,Milan
>
_______________________________________________
Openmcl-devel mailing list
Openmcl-devel <at> clozure.com
http://clozure.com/mailman/listinfo/openmcl-devel

_______________________________________________
Openmcl-devel mailing list
Openmcl-devel <at> clozure.com
http://clozure.com/mailman/listinfo/openmcl-devel
Gary Byers | 26 Jul 00:46

Re: process-enable issue

My mistake: if you just do:

? (make-process "foo")

the process will run a little bit of code, add itself to the list of
all processes, then signal a semaphore and wait to be preset and
enabled.

In David's case, the creating thread's wait had timed out, but by the
time he did :proc, interrupted the waiting thread, and printed a
backtrace, the thread was initialized and ready to go, and its
whostate was "Active".  That's a change in how whostates are
implemented; in 1.1, the newly-reset thread would have reported itself
as "Reset" instead of "Active", and the former's more accurate.  The
thread isn't really "Active" - it's still waiting to be preset and
enabled - and I started postulating that the thread had somehow been
enabled due to very low-level wires getting crossed somewhere.

So, there are two bugs here:

1) the whole idea of a timeout in PROCESS-ENABLE is wrong (since we
don't generally know how long it'll take for the target thread to
get ready to run), and we should just wait indefinitely.

2) a newly-created or newly-reset thread should not have a whostate of
"Active"; that's an unintentional change which can cause at least one
person (the person who made the change) to get very confused.

Sorry; will fix.

On Fri, 25 Jul 2008, David Rager wrote:

> In the case described, when I (:y 35), and type :go (or whatever made the
> lisp system ignore the warning), IIRC, it all worked.  Therrefore, IIRC,
> it's probably the latter, where one second isn't enough (or something new is
> occurring to make threads not swap in as much).
>
> The thing that may be indicative that it's not an OS problem, is that this
> just started happening when I upgraded to the RC verson of CCL (RC 1.2?).  I
> can inquire of our IT department if you would find whether there was an OS
> change during this period to be relevant information.  RC 1.2 fixed another
> OpenMCL problem (which I was quite pleased about), so it wasn't like I could
> just keep using the old OpenMCL.
>
> At least now our group is no longer the only group seeing and reporting this
> behavior.
>
> On Fri, Jul 25, 2008 at 1:25 PM, Gary Byers <gb <at> clozure.com> wrote:
>
>> In the original bug report, the backtrace for what was thread #35 showed
>>
>>  (2AAAAD619B18) : 0 (PROCESS-ENABLE #<PROCESS Worker thread(38) [Active]
>> #x300043A1C8ED> [...]) 405
>>  (2AAAAD619B68) : 1 (%PROCESS-RUN-FUNCTION '(:NAME "Worker thread")
>> #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> NIL)
>> 1373
>>  (2AAAAD619C58) : 2 (PROCESS-RUN-FUNCTION "Worker thread"
>> #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F>
>> [...]) 213
>>
>> and :proc showed
>>
>> 38 :    Worker thread  [Active]
>> 35 :    Worker thread  [semaphore wait]  (Requesting terminal input)
>> 14 :    Worker thread  [semaphore wait]
>> 1 : -> listener     [Active]
>> 0 :    Initial      [Active]
>>
>> In other words, thread 35 created thread 38 and was waiting for it
>> to signal a semaphore that would indicate that it's reset itself
>> and is ready to be enabled (given a function to run).  :PROC shows
>> that thread 38 is already running, which doesn't make much sense.
>> The Linux kernel that David Rager was running was one that allegedly
>> had just fixed a bug which could cause the the wrong thread to be
>> awakened via FUTEX_WAIT, and it seemed plausible that that bug hadn't
>> really been fixed there.  The case that failed reliably for David
>> on the machine that David was using worked reliably for me, similar
>> cases seemed to work for others, and blaming this on something at
>> the OS level makes more sense than anything else that I can think
>> of.  (Another fuzzy explanation is that malloc() - when called
>> from two threads at the same time - returned the same block of
>> memory to both callers because of a locking problem, so two
>> threads wound up sharing the same "pointer to semaphore".)
>>
>> There's a separate issue in that PROCESS-ENABLE waits for the target
>> thread to indicate that it's "ready" with a timeout of 1 second.
>> That's usually long enough, but it's entirely arbitrary (how long
>> it actually takes depends on the load on and the whims of the
>> scheduler.)  Taking longer than a second might indicate that the
>> newly-created thread isn't getting enough CPU time to signal its
>> readiness to run,  The whole notion of having a timeout for
>> something that can take an indeterminate amount of time is
>> questionable, so it probably makes sense to not use a one-second
>> timeout in PROCESS-ENABLE by default, at the very least.
>>
>> Can you tell whether it was the first case (where PROCESS-ENABLE
>> was waiting to enable a thread that - somehow - seems to have
>> already been enabled) or the second (the one-second timeout is
>> too short, and quite possibly the entire idea of a timeout is
>> misguided) or the second ?
>>
>> In the former case, the thread being enabled would be on the
>> list returned by (ALL-PROCESSES) or in the output displayed
>> by :PROC, and in the latter case it wouldn't.
>>
>> On Fri, 25 Jul 2008, Milan Jovanovic wrote:
>>
>>> Hi, i have problems with multi-threading on linux, i think it's the same
>>> like "http://trac.clozure.com/openmcl/ticket/297"
>>> First it was "Unable to enable process #<PROCESS ...have been trying for
>> 1
>>> seconds" and inferior-list segmentation fault after 2-3 hours of running
>>> (this was on SUSE LINUX 10.0 X86-64 2.6.13-15-smp)
>>>
>>> After Gary Byers suggestion  that it is meaby linux kernel bug i tried
>>  on
>>> SUSE Server 10 (x86_64) -  kernel 2.6.24. After more then day of running
>>> with no errors i saw one more  "Unable to enable process #<PROCESS
>> ...have
>>> been trying for 1 seconds" but this time no segmentation fault.
>>> So I'm asking is it problem/bug if this happens  or only if it happens
>> with
>>> segmentation fault following ?
>>>
>>> btw. i tried code on sbcl to be sure that it's not something there and
>> it's
>>> running couple of days with no problems
>>>
>>> Thanks
>>> Best,Milan
>>>
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel <at> clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>>
>
David Rager | 26 Jul 02:24

Re: process-enable issue

Great!  I'll update my image of CCL and give it a try in a few days or when you say it's fixed.  Thanks!

On Fri, Jul 25, 2008 at 3:46 PM, Gary Byers <gb <at> clozure.com> wrote:
My mistake: if you just do:

? (make-process "foo")

the process will run a little bit of code, add itself to the list of
all processes, then signal a semaphore and wait to be preset and
enabled.

In David's case, the creating thread's wait had timed out, but by the
time he did :proc, interrupted the waiting thread, and printed a
backtrace, the thread was initialized and ready to go, and its
whostate was "Active".  That's a change in how whostates are
implemented; in 1.1, the newly-reset thread would have reported itself
as "Reset" instead of "Active", and the former's more accurate.  The
thread isn't really "Active" - it's still waiting to be preset and
enabled - and I started postulating that the thread had somehow been
enabled due to very low-level wires getting crossed somewhere.

So, there are two bugs here:

1) the whole idea of a timeout in PROCESS-ENABLE is wrong (since we
don't generally know how long it'll take for the target thread to
get ready to run), and we should just wait indefinitely.

2) a newly-created or newly-reset thread should not have a whostate of
"Active"; that's an unintentional change which can cause at least one
person (the person who made the change) to get very confused.

Sorry; will fix.





On Fri, 25 Jul 2008, David Rager wrote:

In the case described, when I (:y 35), and type :go (or whatever made the
lisp system ignore the warning), IIRC, it all worked.  Therrefore, IIRC,
it's probably the latter, where one second isn't enough (or something new is
occurring to make threads not swap in as much).

The thing that may be indicative that it's not an OS problem, is that this
just started happening when I upgraded to the RC verson of CCL (RC 1.2?).  I
can inquire of our IT department if you would find whether there was an OS
change during this period to be relevant information.  RC 1.2 fixed another
OpenMCL problem (which I was quite pleased about), so it wasn't like I could
just keep using the old OpenMCL.

At least now our group is no longer the only group seeing and reporting this
behavior.

On Fri, Jul 25, 2008 at 1:25 PM, Gary Byers <gb <at> clozure.com> wrote:

In the original bug report, the backtrace for what was thread #35 showed

 (2AAAAD619B18) : 0 (PROCESS-ENABLE #<PROCESS Worker thread(38) [Active]
#x300043A1C8ED> [...]) 405
 (2AAAAD619B68) : 1 (%PROCESS-RUN-FUNCTION '(:NAME "Worker thread")
#<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> NIL)
1373
 (2AAAAD619C58) : 2 (PROCESS-RUN-FUNCTION "Worker thread"
#<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F>
[...]) 213

and :proc showed

38 :    Worker thread  [Active]
35 :    Worker thread  [semaphore wait]  (Requesting terminal input)
14 :    Worker thread  [semaphore wait]
1 : -> listener     [Active]
0 :    Initial      [Active]

In other words, thread 35 created thread 38 and was waiting for it
to signal a semaphore that would indicate that it's reset itself
and is ready to be enabled (given a function to run).  :PROC shows
that thread 38 is already running, which doesn't make much sense.
The Linux kernel that David Rager was running was one that allegedly
had just fixed a bug which could cause the the wrong thread to be
awakened via FUTEX_WAIT, and it seemed plausible that that bug hadn't
really been fixed there.  The case that failed reliably for David
on the machine that David was using worked reliably for me, similar
cases seemed to work for others, and blaming this on something at
the OS level makes more sense than anything else that I can think
of.  (Another fuzzy explanation is that malloc() - when called
from two threads at the same time - returned the same block of
memory to both callers because of a locking problem, so two
threads wound up sharing the same "pointer to semaphore".)

There's a separate issue in that PROCESS-ENABLE waits for the target
thread to indicate that it's "ready" with a timeout of 1 second.
That's usually long enough, but it's entirely arbitrary (how long
it actually takes depends on the load on and the whims of the
scheduler.)  Taking longer than a second might indicate that the
newly-created thread isn't getting enough CPU time to signal its
readiness to run,  The whole notion of having a timeout for
something that can take an indeterminate amount of time is
questionable, so it probably makes sense to not use a one-second
timeout in PROCESS-ENABLE by default, at the very least.

Can you tell whether it was the first case (where PROCESS-ENABLE
was waiting to enable a thread that - somehow - seems to have
already been enabled) or the second (the one-second timeout is
too short, and quite possibly the entire idea of a timeout is
misguided) or the second ?

In the former case, the thread being enabled would be on the
list returned by (ALL-PROCESSES) or in the output displayed
by :PROC, and in the latter case it wouldn't.

On Fri, 25 Jul 2008, Milan Jovanovic wrote:

Hi, i have problems with multi-threading on linux, i think it's the same
like "http://trac.clozure.com/openmcl/ticket/297"
First it was "Unable to enable process #<PROCESS ...have been trying for
1
seconds" and inferior-list segmentation fault after 2-3 hours of running
(this was on SUSE LINUX 10.0 X86-64 2.6.13-15-smp)

After Gary Byers suggestion  that it is meaby linux kernel bug i tried
 on
SUSE Server 10 (x86_64) -  kernel 2.6.24. After more then day of running
with no errors i saw one more  "Unable to enable process #<PROCESS
...have
been trying for 1 seconds" but this time no segmentation fault.
So I'm asking is it problem/bug if this happens  or only if it happens
with
segmentation fault following ?

btw. i tried code on sbcl to be sure that it's not something there and
it's
running couple of days with no problems

Thanks
Best,Milan

_______________________________________________
Openmcl-devel mailing list
Openmcl-devel <at> clozure.com
http://clozure.com/mailman/listinfo/openmcl-devel



_______________________________________________
Openmcl-devel mailing list
Openmcl-devel <at> clozure.com
http://clozure.com/mailman/listinfo/openmcl-devel
David Rager | 4 Aug 22:43

Re: process-enable issue

I didn't test it extensively, but the updated sources+build seem to be working.  Thanks~

On Fri, Jul 25, 2008 at 3:46 PM, Gary Byers <gb <at> clozure.com> wrote:
My mistake: if you just do:

? (make-process "foo")

the process will run a little bit of code, add itself to the list of
all processes, then signal a semaphore and wait to be preset and
enabled.

In David's case, the creating thread's wait had timed out, but by the
time he did :proc, interrupted the waiting thread, and printed a
backtrace, the thread was initialized and ready to go, and its
whostate was "Active".  That's a change in how whostates are
implemented; in 1.1, the newly-reset thread would have reported itself
as "Reset" instead of "Active", and the former's more accurate.  The
thread isn't really "Active" - it's still waiting to be preset and
enabled - and I started postulating that the thread had somehow been
enabled due to very low-level wires getting crossed somewhere.

So, there are two bugs here:

1) the whole idea of a timeout in PROCESS-ENABLE is wrong (since we
don't generally know how long it'll take for the target thread to
get ready to run), and we should just wait indefinitely.

2) a newly-created or newly-reset thread should not have a whostate of
"Active"; that's an unintentional change which can cause at least one
person (the person who made the change) to get very confused.

Sorry; will fix.





On Fri, 25 Jul 2008, David Rager wrote:

In the case described, when I (:y 35), and type :go (or whatever made the
lisp system ignore the warning), IIRC, it all worked.  Therrefore, IIRC,
it's probably the latter, where one second isn't enough (or something new is
occurring to make threads not swap in as much).

The thing that may be indicative that it's not an OS problem, is that this
just started happening when I upgraded to the RC verson of CCL (RC 1.2?).  I
can inquire of our IT department if you would find whether there was an OS
change during this period to be relevant information.  RC 1.2 fixed another
OpenMCL problem (which I was quite pleased about), so it wasn't like I could
just keep using the old OpenMCL.

At least now our group is no longer the only group seeing and reporting this
behavior.

On Fri, Jul 25, 2008 at 1:25 PM, Gary Byers <gb <at> clozure.com> wrote:

In the original bug report, the backtrace for what was thread #35 showed

 (2AAAAD619B18) : 0 (PROCESS-ENABLE #<PROCESS Worker thread(38) [Active]
#x300043A1C8ED> [...]) 405
 (2AAAAD619B68) : 1 (%PROCESS-RUN-FUNCTION '(:NAME "Worker thread")
#<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> NIL)
1373
 (2AAAAD619C58) : 2 (PROCESS-RUN-FUNCTION "Worker thread"
#<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F>
[...]) 213

and :proc showed

38 :    Worker thread  [Active]
35 :    Worker thread  [semaphore wait]  (Requesting terminal input)
14 :    Worker thread  [semaphore wait]
1 : -> listener     [Active]
0 :    Initial      [Active]

In other words, thread 35 created thread 38 and was waiting for it
to signal a semaphore that would indicate that it's reset itself
and is ready to be enabled (given a function to run).  :PROC shows
that thread 38 is already running, which doesn't make much sense.
The Linux kernel that David Rager was running was one that allegedly
had just fixed a bug which could cause the the wrong thread to be
awakened via FUTEX_WAIT, and it seemed plausible that that bug hadn't
really been fixed there.  The case that failed reliably for David
on the machine that David was using worked reliably for me, similar
cases seemed to work for others, and blaming this on something at
the OS level makes more sense than anything else that I can think
of.  (Another fuzzy explanation is that malloc() - when called
from two threads at the same time - returned the same block of
memory to both callers because of a locking problem, so two
threads wound up sharing the same "pointer to semaphore".)

There's a separate issue in that PROCESS-ENABLE waits for the target
thread to indicate that it's "ready" with a timeout of 1 second.
That's usually long enough, but it's entirely arbitrary (how long
it actually takes depends on the load on and the whims of the
scheduler.)  Taking longer than a second might indicate that the
newly-created thread isn't getting enough CPU time to signal its
readiness to run,  The whole notion of having a timeout for
something that can take an indeterminate amount of time is
questionable, so it probably makes sense to not use a one-second
timeout in PROCESS-ENABLE by default, at the very least.

Can you tell whether it was the first case (where PROCESS-ENABLE
was waiting to enable a thread that - somehow - seems to have
already been enabled) or the second (the one-second timeout is
too short, and quite possibly the entire idea of a timeout is
misguided) or the second ?

In the former case, the thread being enabled would be on the
list returned by (ALL-PROCESSES) or in the output displayed
by :PROC, and in the latter case it wouldn't.

On Fri, 25 Jul 2008, Milan Jovanovic wrote:

Hi, i have problems with multi-threading on linux, i think it's the same
like "http://trac.clozure.com/openmcl/ticket/297"
First it was "Unable to enable process #<PROCESS ...have been trying for
1
seconds" and inferior-list segmentation fault after 2-3 hours of running
(this was on SUSE LINUX 10.0 X86-64 2.6.13-15-smp)

After Gary Byers suggestion  that it is meaby linux kernel bug i tried
 on
SUSE Server 10 (x86_64) -  kernel 2.6.24. After more then day of running
with no errors i saw one more  "Unable to enable process #<PROCESS
...have
been trying for 1 seconds" but this time no segmentation fault.
So I'm asking is it problem/bug if this happens  or only if it happens
with
segmentation fault following ?

btw. i tried code on sbcl to be sure that it's not something there and
it's
running couple of days with no problems

Thanks
Best,Milan

_______________________________________________
Openmcl-devel mailing list
Openmcl-devel <at> clozure.com
http://clozure.com/mailman/listinfo/openmcl-devel



_______________________________________________
Openmcl-devel mailing list
Openmcl-devel <at> clozure.com
http://clozure.com/mailman/listinfo/openmcl-devel
Milan Jovanovic | 26 Jul 13:07

Re: process-enable issue

I think that it's the case when process-enable tries to enable process that is already running.
If i am getting this right  from manual process-run-function is using process-enable when creating process and my case is that I am creating fixed number of worker processes at the start of program and those messages/errors I get after hours of working (and not creating processes no more) so why process-enable then ?
If i get this completely wrong ... sorry :)

On Fri, Jul 25, 2008 at 10:25 PM, Gary Byers <gb <at> clozure.com> wrote:
In the original bug report, the backtrace for what was thread #35 showed

 (2AAAAD619B18) : 0 (PROCESS-ENABLE #<PROCESS Worker thread(38) [Active] #x300043A1C8ED> [...]) 405
 (2AAAAD619B68) : 1 (%PROCESS-RUN-FUNCTION '(:NAME "Worker thread") #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> NIL) 1373
 (2AAAAD619C58) : 2 (PROCESS-RUN-FUNCTION "Worker thread" #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> [...]) 213

and :proc showed

38 :    Worker thread  [Active] 35 :    Worker thread  [semaphore wait]  (Requesting terminal input)
14 :    Worker thread  [semaphore wait] 1 : -> listener     [Active] 0 :    Initial      [Active]

In other words, thread 35 created thread 38 and was waiting for it
to signal a semaphore that would indicate that it's reset itself
and is ready to be enabled (given a function to run).  :PROC shows
that thread 38 is already running, which doesn't make much sense.
The Linux kernel that David Rager was running was one that allegedly
had just fixed a bug which could cause the the wrong thread to be
awakened via FUTEX_WAIT, and it seemed plausible that that bug hadn't
really been fixed there.  The case that failed reliably for David
on the machine that David was using worked reliably for me, similar
cases seemed to work for others, and blaming this on something at
the OS level makes more sense than anything else that I can think
of.  (Another fuzzy explanation is that malloc() - when called
from two threads at the same time - returned the same block of
memory to both callers because of a locking problem, so two
threads wound up sharing the same "pointer to semaphore".)

There's a separate issue in that PROCESS-ENABLE waits for the target
thread to indicate that it's "ready" with a timeout of 1 second. That's usually long enough, but it's entirely arbitrary (how long
it actually takes depends on the load on and the whims of the
scheduler.)  Taking longer than a second might indicate that the
newly-created thread isn't getting enough CPU time to signal its
readiness to run,  The whole notion of having a timeout for
something that can take an indeterminate amount of time is
questionable, so it probably makes sense to not use a one-second
timeout in PROCESS-ENABLE by default, at the very least.

Can you tell whether it was the first case (where PROCESS-ENABLE
was waiting to enable a thread that - somehow - seems to have
already been enabled) or the second (the one-second timeout is
too short, and quite possibly the entire idea of a timeout is
misguided) or the second ?

In the former case, the thread being enabled would be on the
list returned by (ALL-PROCESSES) or in the output displayed
by :PROC, and in the latter case it wouldn't.


On Fri, 25 Jul 2008, Milan Jovanovic wrote:

Hi, i have problems with multi-threading on linux, i think it's the same
like "http://trac.clozure.com/openmcl/ticket/297"
First it was "Unable to enable process #<PROCESS ...have been trying for 1
seconds" and inferior-list segmentation fault after 2-3 hours of running
(this was on SUSE LINUX 10.0 X86-64 2.6.13-15-smp)

After Gary Byers suggestion  that it is meaby linux kernel bug i tried  on
SUSE Server 10 (x86_64) -  kernel 2.6.24. After more then day of running
with no errors i saw one more  "Unable to enable process #<PROCESS ...have
been trying for 1 seconds" but this time no segmentation fault.
So I'm asking is it problem/bug if this happens  or only if it happens with
segmentation fault following ?

btw. i tried code on sbcl to be sure that it's not something there and it's
running couple of days with no problems

Thanks
Best,Milan


_______________________________________________
Openmcl-devel mailing list
Openmcl-devel <at> clozure.com
http://clozure.com/mailman/listinfo/openmcl-devel

Gmane