AS2 Reliability and an issue for comment.
2006-03-07 17:11:14 GMT
There are several residual Internet-Drafts under review by the Ediint vender and user community that are responding to industry requests for standardization. Most of these Internet-Drafts are informational RFCs and not “official” Ediint chartered efforts. Examples include drafts on Compression, the Features header, Certificate Exchange messages (CEM), filename transmission, multipart payload support, and Reliability for AS2. The question here concerns a small but important point in the AS2 reliability draft.
AS2 has an option for either synchronous or asynchronous MDNs. The issue for comment is concerned with synchronous MDN mode.
Most venders have attempted to provide for some recovery from network and/or server failures, and also to protect their customers from resource exhaustion. When synchronous MDNs are used to transfer large amounts of business data with compression, digital signatures, and encryption applied to that data, heavily loaded systems can take a large amount of time to produce the MDN to send back in the HTTP response. The HTTP connection then needs to be held open for an unpredictable amount of time, using resources on both sides.
Now, because it is possible for an AS2 application to become “hung” on the server side, software engineers often build in a “timer” that closes a connection after some period of time. Unfortunately, the timeout can occur before the HTTP requester (client) has received the protocol’s HTTP response. In addition, sometimes various HTTP intermediaries (tunnels/proxies/gateways/etc) may time out a connection along the path from client to final HTTP server based on “inactivity,” and again prevent the completion of the HTTP protocol.
These exceptional conditions may be tied to an exception handler that retries the HTTP request with its large payload. More often than not, this retry of a large payload to an ever increasingly loaded server is a recipe for further failure (and retry). Because AS2 payloads are growing from the tens to hundreds of megabytes, and the AS2 traffic on existing servers is growing, the “timeout/retry” spiral has become an operational difficulty for AS2 systems that needs consideration.
The AS2 specification does have a built-in solution for this problem—asynchronous MDN mode. However, users have indicated an interest in whether there is anything else that might be done to address the timeout problem and make AS2 in synchronous MDN mode more reliable.
One direction is to try to make the timeout interval value “flexible” and adapt it intelligently. While both transmission time and payload size are known to the sender, the receiver load (often the most critical factor) is not known. So it becomes difficult to arrive at an intelligent solution that will not sometimes be wrong, which tends to not satisfy AS2 endusers.
Another direction might be to prohibit timeouts. This solution would remove protections against tying up resources (both on sender and receiver sides) in the really exceptional situations of a hung or dead thread/process that did not clean up with an appropriate HTTP status code (5xx range). Again there would be resistance to the adoption of this solution by developers and engineers.
Another direction might be to prohibit retries when using synchronous MDNs. This direction effectively gives up on AS2 reliability. When the specific error condition is recoverable (server down, connection refused, transient network error, server temporarily busy), then retry can be a reasonable way to enhance automation and reduce the need for operational intervention and special manual handling.
If the basic problem of the “timeout/retry” spiral is that there is no way to tell intermediaries or the client that there is forward progress being made on completing the HTTP response, then one remaining direction is to provide a forward progress indicator. The HTTP protocol does have an option for providing this feature that takes advantage of the HTTP response “100 continue” status. In other words, a HTTP server can be configured to send a sequence of “100 continue” replies, and a HTTP 1.1 client is effectively instructed to wait for a reply in the success range (“2xx”) or possibly failure (“5xx”). [ 3xx and 4xx cases ignored here for simplicity—these statuses should be given as an initial HTTP response IMO.] This solution does not magically create resources when they are falling short but at least it does potentially avoid the “retry/timeout” spiral.
Recommending that AS2 reliability makes use of this “keep alive” or “forward progress” indicator would mark a change in current operational modes. It is to be expected that this capability would be marked by a special feature value (or AS2-version number if the feature header is not approved) to allow a smooth transition to interoperability. Also, how frequently to send 100 continues and how to react to a stretch of time without “100 continues” are issues needing consensus from the participants on this list. This is assuming that people support the direction here proposed so stakeholders should let their views be known!