H.K. Jerry Chu | 22 Mar 2006 01:04
Picon

small change in the connected mode draft

Hi folks,

Allison Mankin, the transport area AD who brought up some
concern during IESG review of the connected draft regarding
simultaneous retransmissions at different layers, has suggested
and Vivek agreed to the following change to section 7.1
"A Cautionary Note on IPoIB-RC".

The revised section reads like this:

	The RC mode of InfiniBand guarantees in-order delivery of
	packets. Every message transmitted over the RC connection is
	broken into physical MTU sized packets by the RC connection. If
	any packet is lost, it is retransmitted until the complete
	message is exchanged. Therefore, there is a possibility of an
	upper transport layer experiencing a timeout, while the RC layer
	is still in the process of transferring the complete message.
	TCP will view the timeout as an indicator of congestion and
	enter slow-start thereby affecting throughput drastically
	[RFC2581]. Other upper layer protocols might insert
	retransmissions into the fabric adding to the already existing
	congestion.

	The applicability of Infiniband reliability is on a fabric
	with short latencies (not wide area).  Therefore, the RC timer
	values should be short compared with the starting minimum
	time values used by the upper end-to-end transports.  In
	addition, because the RC mode does not have measurement
	based reliable transmission, its use over fabrics with long
	latency or very dynamic latency may be a concern for congestion-
(Continue reading)

Michael Krause | 23 Mar 2006 21:57
Picon

Re: small change in the connected mode draft

At 04:04 PM 3/21/2006, H.K. Jerry Chu wrote:
Hi folks,

Allison Mankin, the transport area AD who brought up some
concern during IESG review of the connected draft regarding
simultaneous retransmissions at different layers, has suggested
and Vivek agreed to the following change to section 7.1
"A Cautionary Note on IPoIB-RC".

The revised section reads like this:

        The RC mode of InfiniBand guarantees in-order delivery of
        packets. Every message transmitted over the RC connection is
        broken into physical MTU sized packets by the RC connection. If
        any packet is lost, it is retransmitted until the complete
        message is exchanged. Therefore, there is a possibility of an
        upper transport layer experiencing a timeout, while the RC layer
        is still in the process of transferring the complete message.
        TCP will view the timeout as an indicator of congestion and
        enter slow-start thereby affecting throughput drastically
        [RFC2581]. Other upper layer protocols might insert
         retransmissions into the fabric adding to the already existing
         congestion.

        The applicability of Infiniband reliability is on a fabric
        with short latencies (not wide area).  Therefore, the RC timer
        values should be short compared with the starting minimum
        time values used by the upper end-to-end transports.  In
        addition, because the RC mode does not have measurement
        based reliable transmission, its use over fabrics with long
        latency or very dynamic latency may be a concern for congestion-
        aware traffic traversing those fabrics.

If you have any comment/issue on the proposed change please post them
to the list before COB this Friday (3/24/06). I'd like to move the
draft past IESG review so we can wrap up the WG asap.

BTW, please be informed that my email address will change after this
Friday. My new address is hkjerry.chu <at> gmail.com.

What about RNR which can go into an infinite timer state?  It should also be noted that IB does not mandate timer ranges that necessarily correspond to TCP timers.  In fact, in the face of real IB congestion or a port / VL arbitration policy that places such traffic in a best effort QoS slot, then it is quite possible to see delays that would be interpreted by any ULP such as TCP as congestion.  Is this really an issue as the applications continue to operate albeit at a slower rate?

Mike
_______________________________________________
IPoverIB mailing list
IPoverIB <at> ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib

Gmane