Vladimir Kotal | 23 Jun 2011 19:56
Picon
Favicon

dtrace script for CLOSE_WAIT leaks ?


Hi all,

Does anyone have a dtrace script for observing connection/socket leaks ? 
I.e. when application does not do shutdown()/close() on a socket which 
it previously accept()'ed ? Ideally with IP address/port pairs printed 
in the output.

v.
Brian Utterback | 23 Jun 2011 20:37
Picon
Favicon

Re: dtrace script for CLOSE_WAIT leaks ?

On 06/23/11 13:56, Vladimir Kotal wrote:
> 
> Hi all,
> 
> Does anyone have a dtrace script for observing connection/socket leaks ?
> I.e. when application does not do shutdown()/close() on a socket which
> it previously accept()'ed ? Ideally with IP address/port pairs printed
> in the output.

Not sure what you would be looking for. You could trace connections as
they transition to CLOSE_WAIT, but that isn't what you want, is it? The
problem is not when they transition into CLOSE_WAIT, but when they fail
to transition out of CLOSE_WAIT. How could you trace that?

--

-- 
blu

Always code as if the guy who ends up maintaining your code will be a
violent psychopath who knows where you live. - Martin Golding
-----------------------------------------------------------------------|
Brian Utterback - Solaris RPE, Oracle Corporation.
Ph:603-262-3916, Em:brian.utterback@...
Brian Ruthven | 23 Jun 2011 22:48
Picon
Favicon

Re: dtrace script for CLOSE_WAIT leaks ?

On 23/06/2011 18:56, Vladimir Kotal wrote:
>
> Hi all,
>
> Does anyone have a dtrace script for observing connection/socket leaks 
> ? I.e. when application does not do shutdown()/close() on a socket 
> which it previously accept()'ed ? Ideally with IP address/port pairs 
> printed in the output.
>

I've looked to do this kind of thing with dtrace in the past, and 
concluded that it couldn't be done (which could well be incorrect!). The 
problem with leaked stuff is that whilst capturing data, you can pair up 
and eliminate the symmetric events, but the stuff you're interested in 
is the bit left over.

It's fairly trivial to pair up the opens and closes, something like this 
(pseudo-d-code):

transition-to-close-wait:entry
{
     array[port] = 1;
}

delete-tcp-or-close-socket:entry
/array[port]/
{
     array[port] = 0;
}

(Continue reading)

Brian Utterback | 24 Jun 2011 15:12
Picon
Favicon

Re: dtrace script for CLOSE_WAIT leaks ?

Assuming you could find a way to dump the array, doesn't this just give
you a list of port whose connections are currently in CLOSE_WAIT?
Wouldn't netstat give you the same info?

Instead of setting the array value to 1, you could set it to the value
of walltimestamp. That way when you dumped it out, you would have the
time it went into CLOSE_WAIT, which would give you an indication of
which ones were in the state the longest. I wonder if you could get an
aggregation to work here? Hmm.

On 06/23/11 16:48, Brian Ruthven wrote:
> On 23/06/2011 18:56, Vladimir Kotal wrote:
>>
>> Hi all,
>>
>> Does anyone have a dtrace script for observing connection/socket leaks
>> ? I.e. when application does not do shutdown()/close() on a socket
>> which it previously accept()'ed ? Ideally with IP address/port pairs
>> printed in the output.
>>
> 
> I've looked to do this kind of thing with dtrace in the past, and
> concluded that it couldn't be done (which could well be incorrect!). The
> problem with leaked stuff is that whilst capturing data, you can pair up
> and eliminate the symmetric events, but the stuff you're interested in
> is the bit left over.
> 
> It's fairly trivial to pair up the opens and closes, something like this
> (pseudo-d-code):
> 
(Continue reading)

James Carlson | 24 Jun 2011 15:37

Re: dtrace script for CLOSE_WAIT leaks ?

Brian Utterback wrote:
> Assuming you could find a way to dump the array, doesn't this just give
> you a list of port whose connections are currently in CLOSE_WAIT?
> Wouldn't netstat give you the same info?
> 
> Instead of setting the array value to 1, you could set it to the value
> of walltimestamp. That way when you dumped it out, you would have the
> time it went into CLOSE_WAIT, which would give you an indication of
> which ones were in the state the longest. I wonder if you could get an
> aggregation to work here? Hmm.

In about 99 and 44/100ths percent of the cases I've looked at in the
past, what appears to be a "leak" is actually something exacerbated by
the OS.

What I usually see is that the application opens a socket (via socket()
or accept()), does some work, and then closes the socket normally.
Unbeknownst to the application, part of that "work" involved a fork(),
perhaps buried in a library somewhere.  (The free fork() given out to
users of syslog() employing LOG_CONS was once a possible cause, but
there are others.)

The fork() logic duplicates all of the open file descriptors, and the
code calling fork() in this case doesn't "know" that there are
descriptors that it shouldn't be copying so it can't easily close them
afterwards.  It's the new process -- possibly completely unknown to the
main application -- that's still holding the socket open, allowing it to
slip into CLOSE_WAIT state.

For that reason, I think any CLOSE_WAIT diagnostic function should at
(Continue reading)

Nico Williams | 24 Jun 2011 17:50

Re: dtrace script for CLOSE_WAIT leaks ?

On Fri, Jun 24, 2011 at 8:37 AM, James Carlson <carlsonj@...> wrote:
> (Would be nice to have something like z/OS's FCTLCLOFORK or the
> sometimes-discussed Linux FD_DONTINHERIT flag.)

+1
Brian Utterback | 24 Jun 2011 18:10
Picon
Favicon

Re: dtrace script for CLOSE_WAIT leaks ?

On 06/24/11 09:37, James Carlson wrote:

> The fork() logic duplicates all of the open file descriptors, and the
> code calling fork() in this case doesn't "know" that there are
> descriptors that it shouldn't be copying so it can't easily close them
> afterwards.  It's the new process -- possibly completely unknown to the
> main application -- that's still holding the socket open, allowing it to
> slip into CLOSE_WAIT state.
> 
> For that reason, I think any CLOSE_WAIT diagnostic function should at
> least track the fork() descriptor duplication and allow you to trace
> back to the application that "leaked" descriptors by way of creating new
> processes.
> 
> (Would be nice to have something like z/OS's FCTLCLOFORK or the
> sometimes-discussed Linux FD_DONTINHERIT flag.)
> 

The most common cause of CLOSE_WAIT connections in my experience has
been middleware code that indeed forked a child that inherited the FD
when it was not intended, but which also forked a child that was
intended to get the FD. So at least in that case the FD_DONTINHERIT and
FCTLCLOFORK would not have helped.

Perhaps we should recommend the use of shutdown instead of close?
--

-- 
blu

Always code as if the guy who ends up maintaining your code will be a
violent psychopath who knows where you live. - Martin Golding
(Continue reading)

Rao Shoaib | 24 Jun 2011 19:03
Picon
Favicon

Re: dtrace script for CLOSE_WAIT leaks ?

Brian Utterback wrote:
On 06/24/11 09:37, James Carlson wrote:
The fork() logic duplicates all of the open file descriptors, and the code calling fork() in this case doesn't "know" that there are descriptors that it shouldn't be copying so it can't easily close them afterwards. It's the new process -- possibly completely unknown to the main application -- that's still holding the socket open, allowing it to slip into CLOSE_WAIT state. For that reason, I think any CLOSE_WAIT diagnostic function should at least track the fork() descriptor duplication and allow you to trace back to the application that "leaked" descriptors by way of creating new processes. (Would be nice to have something like z/OS's FCTLCLOFORK or the sometimes-discussed Linux FD_DONTINHERIT flag.)
The most common cause of CLOSE_WAIT connections in my experience has been middleware code that indeed forked a child that inherited the FD when it was not intended, but which also forked a child that was intended to get the FD. So at least in that case the FD_DONTINHERIT and FCTLCLOFORK would not have helped. Perhaps we should recommend the use of shutdown instead of close?
You really need both as close is needed to reduce the file descriptors.

Rao.


<div>
Brian Utterback wrote:
<blockquote cite="mid:4E04B6DE.7080507@..." type="cite">
  On 06/24/11 09:37, James Carlson wrote:

  
  <blockquote type="cite">
    The fork() logic duplicates all of the open file descriptors, and the
code calling fork() in this case doesn't "know" that there are
descriptors that it shouldn't be copying so it can't easily close them
afterwards.  It's the new process -- possibly completely unknown to the
main application -- that's still holding the socket open, allowing it to
slip into CLOSE_WAIT state.

For that reason, I think any CLOSE_WAIT diagnostic function should at
least track the fork() descriptor duplication and allow you to trace
back to the application that "leaked" descriptors by way of creating new
processes.

(Would be nice to have something like z/OS's FCTLCLOFORK or the
sometimes-discussed Linux FD_DONTINHERIT flag.)

    
  </blockquote>

The most common cause of CLOSE_WAIT connections in my experience has
been middleware code that indeed forked a child that inherited the FD
when it was not intended, but which also forked a child that was
intended to get the FD. So at least in that case the FD_DONTINHERIT and
FCTLCLOFORK would not have helped.

Perhaps we should recommend the use of shutdown instead of close?

</blockquote>
You really need both as close is needed to reduce the file descriptors.<br><br>
Rao.<br><br><br>
</div>

Gmane