31 Dec 2011 04:28
31 Dec 2011 05:13
Re: Changeset 0306c5a64775
Roderic Morris <roderyc <at> gmail.com>
2011-12-31 04:13:16 GMT
2011-12-31 04:13:16 GMT
If I'm looking at the right revision, it's to allow the use of wait() without triggering deadlock detection -Roderic On Fri, Dec 30, 2011 at 10:28 PM, Robert Ransom <rransom.8774 <at> gmail.com> wrote: > What is changeset 0306c5a64775 intended to do? > > > Robert Ransom >
31 Dec 2011 12:01
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2011-12-31 11:01:46 GMT
2011-12-31 11:01:46 GMT
On 2011-12-31, Roderic Morris <roderyc <at> gmail.com> wrote: > If I'm looking at the right revision, it's to allow the use of wait() > without triggering deadlock detection It might do that, but it breaks get-address-info from the net-addresses package (which (ab)uses external-event UIDs in the same way). See attached for a script that hangs S48. (I was expecting the call to get-address-info to dislodge another of the debug-message calls, but it just wedged instead.) Robert Ransom
31 Dec 2011 16:32
Re: Changeset 0306c5a64775
Roderic Morris <roderyc <at> gmail.com>
2011-12-31 15:32:42 GMT
2011-12-31 15:32:42 GMT
Right, I noted that and had a similar example when I submitted the patch. The external-events package doesn't work as advertised, but I couldn't figure out how to fix it. It's really the only mechanism that can be used to do both library calls correctly though. -Roderic On Sat, Dec 31, 2011 at 6:01 AM, Robert Ransom <rransom.8774 <at> gmail.com> wrote: > On 2011-12-31, Roderic Morris <roderyc <at> gmail.com> wrote: >> If I'm looking at the right revision, it's to allow the use of wait() >> without triggering deadlock detection > > It might do that, but it breaks get-address-info from the > net-addresses package (which (ab)uses external-event UIDs in the same > way). See attached for a script that hangs S48. (I was expecting the > call to get-address-info to dislodge another of the debug-message > calls, but it just wedged instead.) > > > Robert Ransom
1 Jan 2012 22:41
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-01 21:41:41 GMT
2012-01-01 21:41:41 GMT
On 2011-12-31, Roderic Morris <roderyc <at> gmail.com> wrote: > Right, I noted that and had a similar example when I submitted the patch. > > The external-events package doesn't work as advertised, but I couldn't > figure out how to fix it. It's really the only mechanism that can be > used to do both library calls correctly though. See attached for a patch that fixes one clear bug in the external-events package. That doesn't fix the problem with wait-for-child-process. In order to fix the general problem of incorrect deadlock detection, I think we need: 1. a way to mark thread queues as ‘blocking on an external event’ for deadlock-detection purposes 2. a way to mark condvars and placeholders as ‘blocking on an external event’ (getaddrinfo and wait-for-child-process should be using placeholders; the posix-signals package should be using condvars) 3. support for registering long-term handlers to handle every occurrence of an external event UID 4. an as-general-as-possible ‘external asynchronous result’ system to handle placeholders which should be filled in from C For piece 1 (and 2), we could try iterating through a list of weak references to thread queues to determine whether any threads are blocked on one of the thread queues, but we might need to optimize that data structure if many of those thread queues turn out to not have any threads on them.(Continue reading)
12 Jun 2012 20:32
Re: Changeset 0306c5a64775
Michael Sperber <sperber <at> deinprogramm.de>
2012-06-12 18:32:57 GMT
2012-06-12 18:32:57 GMT
Robert Ransom <rransom.8774 <at> gmail.com> writes: > See attached for a patch that fixes one clear bug in the > external-events package. That doesn't fix the problem with > wait-for-child-process. > > In order to fix the general problem of incorrect deadlock detection, I > think we need: > > 1. a way to mark thread queues as ‘blocking on an external event’ for > deadlock-detection purposes > 2. a way to mark condvars and placeholders as ‘blocking on an external > event’ (getaddrinfo and wait-for-child-process should be using > placeholders; the posix-signals package should be using condvars) > 3. support for registering long-term handlers to handle every > occurrence of an external event UID > 4. an as-general-as-possible ‘external asynchronous result’ system to > handle placeholders which should be filled in from C So I finally fixed (I hope) these issues - it took a whole bunch of changes, along the lines of what Robert suggested: - Threads can, when blocking on an external event or something similar, explicitly say that the blocking does not contribute to deadlock. - `wait-for-child-process' can now use a very simple implementation using placeholders. - The race with external-event uids is fixed by allocating the uids in(Continue reading)
14 Jun 2012 14:53
Re: Changeset 0306c5a64775
Roderic Morris <roderyc <at> gmail.com>
2012-06-14 12:53:19 GMT
2012-06-14 12:53:19 GMT
Great! Related tests pass in scsh as well. Code looks good as well from what I've been able to check. I'm glad to see the ball moving on this, and I'm trying to do the same with my scheme48 based projects.
One issue that you may or may not consider a release blocker though: scheme48 is almost unusably slow and resource intensive on OSX. It seems to be to happen when loading code, for instance ,opening posix takes at least a minute of 100% usage of a cpu core. Opening packages that don't use the FFI also seem to be unusually slow, though not nearly so. Even starting scheme48 has a noticeable pause. I believe it's related to the change to an llvm based compiler in the latest OSX releases. scheme48 1.8 doesn't seem to be affected though.
-Roderic
On Tuesday, June 12, 2012 at 2:32 PM, Michael Sperber wrote:
Robert Ransom <rransom.8774 <at> gmail.com> writes:See attached for a patch that fixes one clear bug in theexternal-events package. That doesn't fix the problem withwait-for-child-process.In order to fix the general problem of incorrect deadlock detection, Ithink we need:1. a way to mark thread queues as ‘blocking on an external event’ fordeadlock-detection purposes2. a way to mark condvars and placeholders as ‘blocking on an externalevent’ (getaddrinfo and wait-for-child-process should be usingplaceholders; the posix-signals package should be using condvars)3. support for registering long-term handlers to handle everyoccurrence of an external event UID4. an as-general-as-possible ‘external asynchronous result’ system tohandle placeholders which should be filled in from CSo I finally fixed (I hope) these issues - it took a whole bunch ofchanges, along the lines of what Robert suggested:- Threads can, when blocking on an external event or something similar,explicitly say that the blocking does not contribute to deadlock.- `wait-for-child-process' can now use a very simple implementationusing placeholders.- The race with external-event uids is fixed by allocating the uids inScheme rather than in C. It should now be possible to build ahigher-level abstraction for this with placeholders, but I haven'tdone that yet.While the tests all succeed, and I've mulled over the changes quite awhile, are quite sensitive: Review and further testing much appreciated!(Especially as this issue was the only one blocking the release.)Thanks to Robert and Roderic for analyzing the problems and for thepatches - extremely helpful!--Cheers =8-} MikeFriede, Völkerverständigung und überhaupt blabla
14 Jun 2012 15:47
Re: Changeset 0306c5a64775
Michael Sperber <sperber <at> deinprogramm.de>
2012-06-14 13:47:15 GMT
2012-06-14 13:47:15 GMT
Roderic Morris <roderyc <at> gmail.com> writes: > Great! Related tests pass in scsh as well. Code looks good as well > from what I've been able to check. I'm glad to see the ball moving on > this, and I'm trying to do the same with my scheme48 based projects. Thanks for trying this out! > One issue that you may or may not consider a release blocker though: > scheme48 is almost unusably slow and resource intensive on OSX. It > seems to be to happen when loading code, for instance ,opening posix > takes at least a minute of 100% usage of a cpu core. Opening packages > that don't use the FFI also seem to be unusually slow, though not > nearly so. Even starting scheme48 has a noticeable pause. I believe > it's related to the change to an llvm based compiler in the latest OSX > releases. scheme48 1.8 doesn't seem to be affected though. Yes, there's a bug in the LLVM shipped with Mac OS X / XCode these days: It contains a buggy prerelease of LLVM. I haven't found a way to work around it in Scheme 48, and it really does not seem worth it: I downloaded the clang+llvm binaries from the LLVM site, and those work just fine. -- -- Cheers =8-} Mike Friede, Völkerverständigung und überhaupt blabla
14 Jun 2012 18:49
Re: Changeset 0306c5a64775
Marcus Crestani <crestani <at> informatik.uni-tuebingen.de>
2012-06-14 16:49:20 GMT
2012-06-14 16:49:20 GMT
>>>>>"MS" == Michael Sperber <sperber <at> deinprogramm.de> writes:
MS> Yes, there's a bug in the LLVM shipped with Mac OS X / XCode these
MS> days: It contains a buggy prerelease of LLVM. I haven't found a way
MS> to work around it in Scheme 48, and it really does not seem worth
MS> it: I downloaded the clang+llvm binaries from the LLVM site, and
MS> those work just fine.
You can work around by turning off the compiler's optimization ("-O0").
Only "-O" with >0 causes the problem for me.
--
--
Marcus
14 Jun 2012 21:58
Re: Changeset 0306c5a64775
Roderic Morris <roderyc <at> gmail.com>
2012-06-14 19:58:29 GMT
2012-06-14 19:58:29 GMT
Perfect, both of those methods worked. Thanks.
-Roderic
On Thursday, June 14, 2012 at 12:49 PM, Marcus Crestani wrote:
"MS" == Michael Sperber <sperber <at> deinprogramm.de> writes:MS> Yes, there's a bug in the LLVM shipped with Mac OS X / XCode theseMS> days: It contains a buggy prerelease of LLVM. I haven't found a wayMS> to work around it in Scheme 48, and it really does not seem worthMS> it: I downloaded the clang+llvm binaries from the LLVM site, andMS> those work just fine.You can work around by turning off the compiler's optimization ("-O0").Only "-O" with >0 causes the problem for me.--Marcus
5 Jan 2012 18:15
Re: Changeset 0306c5a64775
Michael Sperber <sperber <at> deinprogramm.de>
2012-01-05 17:15:27 GMT
2012-01-05 17:15:27 GMT
Robert Ransom <rransom.8774 <at> gmail.com> writes: > On 2011-12-31, Roderic Morris <roderyc <at> gmail.com> wrote: >> If I'm looking at the right revision, it's to allow the use of wait() >> without triggering deadlock detection > > It might do that, but it breaks get-address-info from the > net-addresses package (which (ab)uses external-event UIDs in the same > way). Not quite: In `net-addresses', the external event is indeed triggered from the external code, which is not the case with the waitpid stuff, which just trampolines into the external events from Scheme. So I believe the solution is to do with the wait code as with the getaddrinfo case, namely to wait in a separate thread (or revert to the old implementation if threads are not available.) I'll look into doing this over the weekend. Meanwhile, thanks for the analysis and the patches (which look good) - I've pushed those. -- -- Cheers =8-} Mike Friede, Völkerverständigung und überhaupt blabla
6 Jan 2012 07:20
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-06 06:20:29 GMT
2012-01-06 06:20:29 GMT
On 2012-01-05, Michael Sperber <sperber <at> deinprogramm.de> wrote: > > Robert Ransom <rransom.8774 <at> gmail.com> writes: > >> On 2011-12-31, Roderic Morris <roderyc <at> gmail.com> wrote: >>> If I'm looking at the right revision, it's to allow the use of wait() >>> without triggering deadlock detection >> >> It might do that, but it breaks get-address-info from the >> net-addresses package (which (ab)uses external-event UIDs in the same >> way). > > Not quite: In `net-addresses', the external event is indeed triggered > from the external code, which is not the case with the waitpid stuff, > which just trampolines into the external events from Scheme. So I > believe the solution is to do with the wait code as with the getaddrinfo > case, namely to wait in a separate thread (or revert to the old > implementation if threads are not available.) I'll look into doing this > over the weekend. When I wrote that, I suspected that part of the problem with wait-for-child-process was that it dynamically allocated a new external-event UID for each event it wants to wait for, rather than using one external-event UID for a whole class of events to be waited for. The PreScheme code to handle external events seems to me to be designed for the latter case. getaddrinfo also uses a new UID for each event, rather than one UID for the whole class of getaddrinfo-completion events; I still suspect that that is why multiple simultaneous calls to wait-for-child-process broke a later call to getaddrinfo. See attached for a bundle which fixes wait-for-child-process. I'm no longer convinced that changeset 0306c5a64775 was wrong, but I think my bundle uses a cleaner approach, and it works with the current external-event and interrupt systems. Robert Ransom
6 Jan 2012 07:36
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-06 06:36:36 GMT
2012-01-06 06:36:36 GMT
On 2012-01-06, Robert Ransom <rransom.8774 <at> gmail.com> wrote: > See attached for a bundle which fixes wait-for-child-process. Oops. You want changeset fdd24fd71a02 (and its ancestors). See attached for a bundle containing only the good commits. Robert Ransom
6 Jan 2012 15:53
Re: Changeset 0306c5a64775
Michael Sperber <sperber <at> deinprogramm.de>
2012-01-06 14:53:28 GMT
2012-01-06 14:53:28 GMT
Thanks for looking into this! Robert Ransom <rransom.8774 <at> gmail.com> writes: > See attached for a bundle which fixes wait-for-child-process. I'm no > longer convinced that changeset 0306c5a64775 was wrong, but I think my > bundle uses a cleaner approach, and it works with the current > external-event and interrupt systems. I'm somewhat suspicious of the changes in the bundle, as it's not clear to me why deadlock should get erroneously signalled before the changes: After all, there's is special provision to *not* signal deadlock in the root scheduler - it calls `waiting-for-external-events?', and if that returns #t, no deadlock is assumed. (And I still think the right fix is to handle wait the same as getaddrinfo.) You've looked at the code in depth - did you consider this? > When I wrote that, I suspected that part of the problem with > wait-for-child-process was that it dynamically allocated a new > external-event UID for each event it wants to wait for, rather than > using one external-event UID for a whole class of events to be waited > for. The PreScheme code to handle external events seems to me to be > designed for the latter case. [...] > getaddrinfo also uses a new UID for each event, rather than one UID > for the whole class of getaddrinfo-completion events; I still suspect > that that is why multiple simultaneous calls to wait-for-child-process > broke a later call to getaddrinfo. It was designed for both cases, but in fact the getaddrinfo code was the original motivation for implementing it. It's needed there because there may be multiple simultaneous active calls to getaddrinfo from different threads, and they all need to be notified separately of completion. So, if that doesn't work correctly, there's a bug. -- -- Cheers =8-} Mike Friede, Völkerverständigung und überhaupt blabla
7 Jan 2012 20:05
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-07 19:05:39 GMT
2012-01-07 19:05:39 GMT
On 2012-01-06, Michael Sperber <sperber <at> deinprogramm.de> wrote: > > Thanks for looking into this! > > Robert Ransom <rransom.8774 <at> gmail.com> writes: > >> See attached for a bundle which fixes wait-for-child-process. I'm no >> longer convinced that changeset 0306c5a64775 was wrong, but I think my >> bundle uses a cleaner approach, and it works with the current >> external-event and interrupt systems. > > I'm somewhat suspicious of the changes in the bundle, as it's not clear > to me why deadlock should get erroneously signalled before the changes: > After all, there's is special provision to *not* signal deadlock in the > root scheduler - it calls `waiting-for-external-events?', and if that > returns #t, no deadlock is assumed. (And I still think the right fix is > to handle wait the same as getaddrinfo.) > > You've looked at the code in depth - did you consider this? Before Roderic's patch, waiting-for-external-events? didn't know that threads which were waiting on the process-id's placeholder were waiting for an external event, just as it still doesn't know that threads which are waiting on a signal queue are waiting for an external event. My branch provides a way to inform waiting-for-external-events? that threads blocked on a given synchronization object will be alerted when an external event of some sort occurs; this should be used for signal queues, too. Regarding getaddrinfo, I think the Right Thing is to make the Scheme interface to getaddrinfo fill in a placeholder, rather than using the external event system directly as it does now. Ideally, the external asynchronous result code would be written in such a way that both getaddrinfo and wait-for-child-process could use it. >> When I wrote that, I suspected that part of the problem with >> wait-for-child-process was that it dynamically allocated a new >> external-event UID for each event it wants to wait for, rather than >> using one external-event UID for a whole class of events to be waited >> for. The PreScheme code to handle external events seems to me to be >> designed for the latter case. [...] > >> getaddrinfo also uses a new UID for each event, rather than one UID >> for the whole class of getaddrinfo-completion events; I still suspect >> that that is why multiple simultaneous calls to wait-for-child-process >> broke a later call to getaddrinfo. > > It was designed for both cases, but in fact the getaddrinfo code was the > original motivation for implementing it. It's needed there because > there may be multiple simultaneous active calls to getaddrinfo from > different threads, and they all need to be notified separately of > completion. > > So, if that doesn't work correctly, there's a bug. There probably is a bug. It looks like the same bug in how interrupts are handled (when multiple interrupts of the same type arrive at about the same time, some seem to get 'lost' or delayed) that causes my test case for the POSIX signals package to fail. (That test case has been disabled since I wrote it because I couldn't figure out why it wasn't working or how to fix it.) Someone may have to put 'bread crumbs' into the VM and RTS in order to figure out exactly what is going wrong. Robert Ransom
7 Jan 2012 20:15
Re: Changeset 0306c5a64775
Michael Sperber <sperber <at> deinprogramm.de>
2012-01-07 19:15:07 GMT
2012-01-07 19:15:07 GMT
Robert Ransom <rransom.8774 <at> gmail.com> writes: > Before Roderic's patch, waiting-for-external-events? didn't know that > threads which were waiting on the process-id's placeholder were > waiting for an external event, just as it still doesn't know that > threads which are waiting on a signal queue are waiting for an > external event. Why would it have to know that? And, even it were necessary to know that, it would be easier to look at the condvar's waiters instead of marking the queues, no? I still feel I'm missing something important in your argument ... > Regarding getaddrinfo, I think the Right Thing is to make the Scheme > interface to getaddrinfo fill in a placeholder, rather than using the > external event system directly as it does now. OK - it still seems to me the hard part is communicating the information from C to Scheme somehow. But at this point I really just want to fix the deadlock issue and the potential bug before the release (which I really, really want to happen soon). After that, we can make bigger changes. -- -- Cheers =8-} Mike Friede, Völkerverständigung und überhaupt blabla
7 Jan 2012 20:41
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-07 19:41:27 GMT
2012-01-07 19:41:27 GMT
On 2012-01-07, Michael Sperber <sperber <at> deinprogramm.de> wrote: > > Robert Ransom <rransom.8774 <at> gmail.com> writes: > >> Before Roderic's patch, waiting-for-external-events? didn't know that >> threads which were waiting on the process-id's placeholder were >> waiting for an external event, just as it still doesn't know that >> threads which are waiting on a signal queue are waiting for an >> external event. > > Why would it have to know that? In the process-id case, threads waiting on a process ID's placeholder are awakened by process-terminated-children, which is called by the os-signal interrupt handler. Nothing waits on an external-event UID. In the signal-queue case, threads waiting on a signal queue's ready-for-read condvar are awakened by the os-signal interrupt handler pushing a signal object into the signal queue. Nothing waits on an external-event UID. If the deadlock-detection code isn't told that those synchronization objects will be alerted by (an interrupt handler triggered by) an external event of some sort, then it will falsely report that a deadlock has occurred. > And, even it were necessary to know > that, it would be easier to look at the condvar's waiters instead of > marking the queues, no? No, because we also need to look at placeholders' waiters, and the simplest way to handle both condvars and placeholders (as well as other synchronization ‘primitives’ which may be added in the future) is to look at their underlying thread queues. > I still feel I'm missing something important in your argument ... > >> Regarding getaddrinfo, I think the Right Thing is to make the Scheme >> interface to getaddrinfo fill in a placeholder, rather than using the >> external event system directly as it does now. > > OK - it still seems to me the hard part is communicating the information > from C to Scheme somehow. > > But at this point I really just want to fix the deadlock issue and the > potential bug before the release (which I really, really want to happen > soon). After that, we can make bigger changes. The interrupt system has been slightly broken for as long as I nave used Scheme 48, and I haven't figured out why. I think not using interrupts in a way that will definitely trigger that brokenness (i.e. not flooding it with multiple interrupts of the same type at once) is sufficient for the release. Robert Ransom
8 Jan 2012 01:09
Re: Changeset 0306c5a64775
Will Noble <will <at> cow9.org>
2012-01-08 00:09:15 GMT
2012-01-08 00:09:15 GMT
On Sat, Jan 07, 2012 at 07:41:27PM +0000, Robert Ransom wrote: > On 2012-01-07, Michael Sperber <sperber <at> deinprogramm.de> wrote: > > > > Robert Ransom <rransom.8774 <at> gmail.com> writes: > > > >> Before Roderic's patch, waiting-for-external-events? didn't know that > >> threads which were waiting on the process-id's placeholder were > >> waiting for an external event, just as it still doesn't know that > >> threads which are waiting on a signal queue are waiting for an > >> external event. > > > > Why would it have to know that? > > In the process-id case, threads waiting on a process ID's placeholder > are awakened by process-terminated-children, which is called by the > os-signal interrupt handler. Nothing waits on an external-event UID. I have a slightly different take on this. Deadlock detection occurs in ROOT-WAIT, which, besides consulting WAITING-FOR-EXTERNAL-EVENTS? also looks at threads blocking on I/O and (obviously) threads waiting for an alarm. Each of these three cases corresponds to a type of VM interrupt. So my idea of what ROOT-WAIT does here is to say that, if we have reason to believe a VM interrupt might wake a thread, then we don't need to declare deadlock. What's missing from this picture is OS signals. The VM recognizes OS signals as a type of interrupt, but ROOT-WAIT doesn't check if anything is waiting on one. I assumed the reason for this is because, while the run-time system maintains queues for the other cases, POSIX maintains sole responsibility for signal-waiters, and RTS doesn't call POSIX. This suggests to me that the RTS should at least keep track of the number of threads waiting for a signal, and POSIX could increment/decrement that number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT could check. Basically, I agree what Robert says, except that I don't think WAITING-FOR-EXTERNAL-EVENTS? should be extended to deal with things other than external events (in the narrow sense of the mechanism called "external events", the one with UIDs). Instead we should just ask "are there any threads that could wake up if we get an OS signal?" before calling deadlock. Best, Will
8 Jan 2012 07:29
Re: Changeset 0306c5a64775
Roderic Morris <roderyc <at> gmail.com>
2012-01-08 06:29:54 GMT
2012-01-08 06:29:54 GMT
I actually quite like the approach Robert's patches take. It has the advantage of expedience in this case, as it's already implemented and works well. Moreover, it provides a much more pleasant and useful interface (and one available in scheme, instead of through the FFI) than the UID approach, which seems to be broken in the general case. Will might be right in his thinking, but there's still the matter of library writers, outside of the libraries that are included in s48, that need to avoid deadlock when waiting on say, work being done in a separate pthread. If the UID approach doesn't work in general, why not give them something like what Robert's written? -Roderic On Sat, Jan 7, 2012 at 7:09 PM, Will Noble <will <at> cow9.org> wrote: > On Sat, Jan 07, 2012 at 07:41:27PM +0000, Robert Ransom wrote: >> On 2012-01-07, Michael Sperber <sperber <at> deinprogramm.de> wrote: >> > >> > Robert Ransom <rransom.8774 <at> gmail.com> writes: >> > >> >> Before Roderic's patch, waiting-for-external-events? didn't know that >> >> threads which were waiting on the process-id's placeholder were >> >> waiting for an external event, just as it still doesn't know that >> >> threads which are waiting on a signal queue are waiting for an >> >> external event. >> > >> > Why would it have to know that? >> >> In the process-id case, threads waiting on a process ID's placeholder >> are awakened by process-terminated-children, which is called by the >> os-signal interrupt handler. Nothing waits on an external-event UID. > > I have a slightly different take on this. Deadlock detection occurs in > ROOT-WAIT, which, besides consulting WAITING-FOR-EXTERNAL-EVENTS? also looks > at threads blocking on I/O and (obviously) threads waiting for an alarm. > Each of these three cases corresponds to a type of VM interrupt. So my idea > of what ROOT-WAIT does here is to say that, if we have reason to believe a > VM interrupt might wake a thread, then we don't need to declare deadlock. > > What's missing from this picture is OS signals. The VM recognizes OS signals > as a type of interrupt, but ROOT-WAIT doesn't check if anything is waiting > on one. I assumed the reason for this is because, while the run-time system > maintains queues for the other cases, POSIX maintains sole responsibility > for signal-waiters, and RTS doesn't call POSIX. > > This suggests to me that the RTS should at least keep track of the number of > threads waiting for a signal, and POSIX could increment/decrement that > number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT > could check. > > Basically, I agree what Robert says, except that I don't think > WAITING-FOR-EXTERNAL-EVENTS? should be extended to deal with things other > than external events (in the narrow sense of the mechanism called "external > events", the one with UIDs). Instead we should just ask "are there any > threads that could wake up if we get an OS signal?" before calling deadlock. > > Best, > Will
8 Jan 2012 13:05
Re: Changeset 0306c5a64775
Michael Sperber <sperber <at> deinprogramm.de>
2012-01-08 12:05:45 GMT
2012-01-08 12:05:45 GMT
Will Noble <will <at> cow9.org> writes: > What's missing from this picture is OS signals. The VM recognizes OS signals > as a type of interrupt, but ROOT-WAIT doesn't check if anything is waiting > on one. I assumed the reason for this is because, while the run-time system > maintains queues for the other cases, POSIX maintains sole responsibility > for signal-waiters, and RTS doesn't call POSIX. > > This suggests to me that the RTS should at least keep track of the number of > threads waiting for a signal, and POSIX could increment/decrement that > number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT > could check. Yup. I just remembered this from the deep past when I looked at the signal docs in the manual - where this problem is documented. I definitely want to fix this before the release. -- -- Cheers =8-} Mike Friede, Völkerverständigung und überhaupt blabla
10 Jan 2012 11:11
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-10 10:11:26 GMT
2012-01-10 10:11:26 GMT
On 2012-01-08, Will Noble <will <at> cow9.org> wrote: > I have a slightly different take on this. Deadlock detection occurs in > ROOT-WAIT, which, besides consulting WAITING-FOR-EXTERNAL-EVENTS? also looks > at threads blocking on I/O and (obviously) threads waiting for an alarm. > Each of these three cases corresponds to a type of VM interrupt. So my idea > of what ROOT-WAIT does here is to say that, if we have reason to believe a > VM interrupt might wake a thread, then we don't need to declare deadlock. > > What's missing from this picture is OS signals. The VM recognizes OS signals > as a type of interrupt, but ROOT-WAIT doesn't check if anything is waiting > on one. I assumed the reason for this is because, while the run-time system > maintains queues for the other cases, POSIX maintains sole responsibility > for signal-waiters, and RTS doesn't call POSIX. > > This suggests to me that the RTS should at least keep track of the number of > threads waiting for a signal, and POSIX could increment/decrement that > number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT > could check. Would a counter be sufficient for this? I see some use of weak pointers in scheme/posix/signal.scm, so I really have no idea whether we could safely use that optimization (i.e. look at a counter instead of the thread queues themselves), even if we didn't need a more general solution. Robert Ransom
11 Jan 2012 03:54
Re: Changeset 0306c5a64775
Will Noble <will <at> cow9.org>
2012-01-11 02:54:30 GMT
2012-01-11 02:54:30 GMT
On Tue, Jan 10, 2012 at 10:11:26AM +0000, Robert Ransom wrote: > On 2012-01-08, Will Noble <will <at> cow9.org> wrote: > > This suggests to me that the RTS should at least keep track of the number of > > threads waiting for a signal, and POSIX could increment/decrement that > > number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT > > could check. > > Would a counter be sufficient for this? I see some use of weak > pointers in scheme/posix/signal.scm, so I really have no idea whether > we could safely use that optimization (i.e. look at a counter instead > of the thread queues themselves), even if we didn't need a more > general solution. Yes. A sleeping thread can't wait for a signal so a counter works just as well as looking at the thread queue. It's less optimization than convenience. With the approach I outlined in the previous message you might equally well register the actual thread queues with the RTS and have ROOT-WAIT go through them to see if a thread somewhere is waiting for a signal, but a counter does the same thing. Regarding the weak pointer usage in signal.scm: those map signal numbers to signal representations. If, say, you have a signal 5 in the system, the mapping returns the existing object when asked for signal 5. However, if everyone else loses their reference to that object, you don't want the abstract concept of "5" to keep it around forever without being garbage-collected. It's not something that would break the counter shortcut. Best, Will
14 Jan 2012 13:43
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-14 12:43:53 GMT
2012-01-14 12:43:53 GMT
On 2012-01-11, Will Noble <will <at> cow9.org> wrote: > On Tue, Jan 10, 2012 at 10:11:26AM +0000, Robert Ransom wrote: >> On 2012-01-08, Will Noble <will <at> cow9.org> wrote: >> > This suggests to me that the RTS should at least keep track of the >> > number of >> > threads waiting for a signal, and POSIX could increment/decrement that >> > number. Then there could be a function WAITING-ON-OS-SIGNAL? that >> > ROOT-WAIT >> > could check. >> >> Would a counter be sufficient for this? I see some use of weak >> pointers in scheme/posix/signal.scm, so I really have no idea whether >> we could safely use that optimization (i.e. look at a counter instead >> of the thread queues themselves), even if we didn't need a more >> general solution. > > Yes. A sleeping thread can't wait for a signal so a counter works just as > well as looking at the thread queue. It's less optimization than > convenience. With the approach I outlined in the previous message you might > equally well register the actual thread queues with the RTS and have > ROOT-WAIT go through them to see if a thread somewhere is waiting for a > signal, but a counter does the same thing. ??? A thread which is waiting for a signal is sleeping. In any case, I don't see how this is relevant to whether we can use a counter (which will break deadlock detection permanently if one of the signal system's thread queues is ever GCed while a thread is blocked on it) to determine whether any threads are waiting for an OS signal. > Regarding the weak pointer usage in signal.scm: those map signal numbers to > signal representations. If, say, you have a signal 5 in the system, the > mapping returns the existing object when asked for signal 5. However, if > everyone else loses their reference to that object, you don't want the > abstract concept of "5" to keep it around forever without being > garbage-collected. It's not something that would break the counter shortcut. Signal queues are weakly held too. Fortunately, the (signal-queue-signals queue) call in find-next-signal keeps a reference to the signal queue around when a thread is blocked on the queue, so we would get away with the counter optimization (but only by a lucky accident). Robert Ransom
14 Jan 2012 23:55
Re: Changeset 0306c5a64775
Will Noble <will <at> cow9.org>
2012-01-14 22:55:56 GMT
2012-01-14 22:55:56 GMT
On Sat, Jan 14, 2012 at 12:43:53PM +0000, Robert Ransom wrote: > ??? A thread which is waiting for a signal is sleeping. In any case, > I don't see how this is relevant to whether we can use a counter > (which will break deadlock detection permanently if one of the signal > system's thread queues is ever GCed while a thread is blocked on it) > to determine whether any threads are waiting for an OS signal. I misunderstood your issue with the counter and ended up explaining why we couldn't double-count, which is incredibly obvious. I hope you were not offended ;). On the other point, it doesn't look to me like a thread queue with a thread in it waiting for a signal could be garbage-collected... > Signal queues are weakly held too. Fortunately, the > (signal-queue-signals queue) call in find-next-signal keeps a > reference to the signal queue around when a thread is blocked on the > queue, so we would get away with the counter optimization (but only by > a lucky accident). ...although I don't understand how this prevents it. The mechanism I see is the signal mapper (from OS signal number to list of thread queues, for all known signals). It strongly references every non-empty thread queue. So a way to know if any thread is waiting for a signal without keeping a counter is to iterate over the mapper looking for a non-empty queue. (Also evidently I was talking about a different usage of weak pointers in signal.scm than the one you referred to, so basically my entire post explained the wrong things. Sorry!) I don't think we really have a disagreement here. To summarize my recommendation, I like the narrow solution where the deadlock detection code checks if any thread is waiting on an OS signal; currently it doesn't take this into account at all. POSIX has the relevant information, but we need it to call the RTS instead of the other way around to avoid making the RTS a client of POSIX. We could accomplish this a number of ways, including using a counter. Best, Will
8 Jan 2012 13:03
Re: Changeset 0306c5a64775
Michael Sperber <sperber <at> deinprogramm.de>
2012-01-08 12:03:51 GMT
2012-01-08 12:03:51 GMT
Robert Ransom <rransom.8774 <at> gmail.com> writes: > In the process-id case, threads waiting on a process ID's placeholder > are awakened by process-terminated-children, which is called by the > os-signal interrupt handler. Nothing waits on an external-event UID. Ah, OK, but that's a bug (even documented in the manual) with the signal code, which also needs fixing. As soon as that's fixed, the problem goes away, right? > If the deadlock-detection code isn't told that those synchronization > objects will be alerted by (an interrupt handler triggered by) an > external event of some sort, then it will falsely report that a > deadlock has occurred. That also makes sense, but if we're talking about "external event of some sort", rather than *the* external events (i.e. the specific things in scheme/rts/external-event.scm) then the fix should also happen outside of external-event.scm, deeper down in the thread system, is my current thinking. Does that make sense? (I'm trying to think this through until I find enough time to actually work on the fix. I really appreciate your patience.) -- -- Cheers =8-} Mike Friede, Völkerverständigung und überhaupt blabla
10 Jan 2012 10:53
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-10 09:53:34 GMT
2012-01-10 09:53:34 GMT
On 2012-01-08, Michael Sperber <sperber <at> deinprogramm.de> wrote: > > Robert Ransom <rransom.8774 <at> gmail.com> writes: > >> In the process-id case, threads waiting on a process ID's placeholder >> are awakened by process-terminated-children, which is called by the >> os-signal interrupt handler. Nothing waits on an external-event UID. > > Ah, OK, but that's a bug (even documented in the manual) with the signal > code, which also needs fixing. As soon as that's fixed, the problem > goes away, right? process-id placeholders are separate from signal queues, and their deadlock-detection issues need to be fixed separately. (We can't just use the fix for deadlock detection when threads are waiting on signal queues to handle the case of process-id placeholders, because something needs to listen for SIGCHLD all the time, even when no threads are currently waiting on process-id placeholders.) >> If the deadlock-detection code isn't told that those synchronization >> objects will be alerted by (an interrupt handler triggered by) an >> external event of some sort, then it will falsely report that a >> deadlock has occurred. > > That also makes sense, but if we're talking about "external event of > some sort", rather than *the* external events (i.e. the specific > things in scheme/rts/external-event.scm) then the fix should also > happen outside of external-event.scm, deeper down in the thread system, > is my current thinking. Does that make sense? It makes sense to keep that mechanism separate from the external-events package. See my (revised) wait-for-child-process-deadlock-fix branch (attached, and pushed using hg-git to https://git.torproject.org/user/rransom/scheme48.git) for that. Robert Ransom
11 Jan 2012 07:37
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-11 06:37:13 GMT
2012-01-11 06:37:13 GMT
On 2012-01-10, Robert Ransom <rransom.8774 <at> gmail.com> wrote: > On 2012-01-08, Michael Sperber <sperber <at> deinprogramm.de> wrote: >> That also makes sense, but if we're talking about "external event of >> some sort", rather than *the* external events (i.e. the specific >> things in scheme/rts/external-event.scm) then the fix should also >> happen outside of external-event.scm, deeper down in the thread system, >> is my current thinking. Does that make sense? > > It makes sense to keep that mechanism separate from the > external-events package. See my (revised) > wait-for-child-process-deadlock-fix branch (attached, and pushed using > hg-git to https://git.torproject.org/user/rransom/scheme48.git) for > that. I've found some bugs in this branch; I'll send a revised branch soon. Robert Ransom
11 Jan 2012 09:46
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-11 08:46:54 GMT
2012-01-11 08:46:54 GMT
On 2012-01-11, Robert Ransom <rransom.8774 <at> gmail.com> wrote: > On 2012-01-10, Robert Ransom <rransom.8774 <at> gmail.com> wrote: >> On 2012-01-08, Michael Sperber <sperber <at> deinprogramm.de> wrote: > >>> That also makes sense, but if we're talking about "external event of >>> some sort", rather than *the* external events (i.e. the specific >>> things in scheme/rts/external-event.scm) then the fix should also >>> happen outside of external-event.scm, deeper down in the thread system, >>> is my current thinking. Does that make sense? >> >> It makes sense to keep that mechanism separate from the >> external-events package. See my (revised) >> wait-for-child-process-deadlock-fix branch (attached, and pushed using >> hg-git to https://git.torproject.org/user/rransom/scheme48.git) for >> that. > > I've found some bugs in this branch; I'll send a revised branch soon. See attached. Robert Ransom
10 Jan 2012 11:08
Re: Changeset 0306c5a64775
Robert Ransom <rransom.8774 <at> gmail.com>
2012-01-10 10:08:56 GMT
2012-01-10 10:08:56 GMT
On 2012-01-07, Michael Sperber <sperber <at> deinprogramm.de> wrote: > > Robert Ransom <rransom.8774 <at> gmail.com> writes: >> Regarding getaddrinfo, I think the Right Thing is to make the Scheme >> interface to getaddrinfo fill in a placeholder, rather than using the >> external event system directly as it does now. > > OK - it still seems to me the hard part is communicating the information > from C to Scheme somehow. The external-events system is broken, both as used by getaddrinfo and when used with long-lived UIDs. When a UID is allocated from C for an event, then passed to a Scheme thread which waits for that UID once, there is a race condition -- if C signals an event with that UID before Scheme starts waiting for it, the event will be dropped on the floor, and the Scheme thread will not be awakened unless and until some other event is allocated the same UID and is signaled. When a long-lived UID is allocated for a class of events, and a Scheme thread waits for that UID repeatedly, there is a different race condition -- if C signals that UID between the time that Scheme checks for events previously queued from C and the time that it waits for a new event, the event signal will be dropped on the floor. See my event-waiters branch (attached and pushed to hg-git) for a synchronization primitive which can be used to fix the latter race condition, provided that the event-waiter is created before C starts signaling events of a particular class. To fix the race condition in getaddrinfo-style usage of external events, we need a way to allocate some sort of event identifier from Scheme, pass the identifier to C, and pass the result back to Scheme to be put in a placeholder. Robert Ransom
RSS Feed