Wax, Edward | 15 Aug 2012 15:44

RabbitMQ application failover/recovery in HA cluster

I am implementing a .NET WCF ActiveDuplex service with RabbitMQ under the covers for messaging purposes.   I am using the asynchronous message pattern of publish/subscribe within the ActiveDuplex service, which includes callbacks to the originating client. 

 

There are 2 parts to the equation:  (a) our implementation of the WCF ActiveDuplex interface and (b) the RabbitMQ failover/recovery strategy.  I was hoping you could help with the RabbitMQ part of the equation. 

 

Our previous implementation of RabbitMQ consisted of a Pacemaker active/passive HA cluster with two nodes, and a SAN disk for shared storage. We would connect to the cluster virtual IP for all RabbitMQ transactions, and the HA features of Pacemaker would manage resource failures automatically with the net effect that consumers could complete a RabbitMQ session without interruption.

 

We recently moved to a RabbitMQ Cluster as recommended in the Clustering Guide at http://www.rabbitmq.com/clustering.html , where multiple RabbitMQ servers are arranged in an Active/Active fashion.  In doing so we gain the benefits of a true HA environment but we have lost the ability to seamlessly recover from RabbitMQ failures, previously provided by the Pacemaker infrastructure (the single virtual IP).   I’ve reviewed your .NET documentation and see there are references to the various shutdown protocols.  Do you have more specific documentation that would elaborate on these and provide examples on how we can manage our client connections in this HA environment (e.g. if I receive a ModelShutdown event, what steps do I need to go through in order to "transfer" a RabbitMQ session to a new connection).  Or are there other "management" APIs that should be used instead?

 

I’ve also been looking the book, RabbitMQ In Action, and a section on application failure and recovery (6.2) sounds almost as simple as wrapping the consumer/producer code in a try/catch block.

 

Any thoughts would be appreciated.

<div>
<div class="WordSection1">
<p class="MsoPlainText">I am implementing a .NET WCF ActiveDuplex service with RabbitMQ under the covers for messaging purposes.&nbsp;&nbsp; I am using the asynchronous message pattern of publish/subscribe within the ActiveDuplex service, which includes callbacks to
 the originating client.&nbsp; <p></p></p>
<p class="MsoPlainText"><p>&nbsp;</p></p>
<p class="MsoPlainText">There are 2 parts to the equation:&nbsp; (a) our implementation of the WCF ActiveDuplex interface and (b) the RabbitMQ failover/recovery strategy.&nbsp; I was hoping you could help with the RabbitMQ part of the equation.&nbsp;
<p></p></p>
<p class="MsoPlainText"><p>&nbsp;</p></p>
<p class="MsoPlainText">Our previous implementation of RabbitMQ consisted of a Pacemaker active/passive HA cluster with two nodes, and a SAN disk for shared storage. We would connect to the cluster virtual IP for all RabbitMQ transactions, and the HA features
 of Pacemaker would manage resource failures automatically with the net effect that consumers could complete a RabbitMQ session without interruption.<p></p></p>
<p class="MsoPlainText"><p>&nbsp;</p></p>
<p class="MsoPlainText">We recently moved to a RabbitMQ Cluster as recommended in the Clustering Guide at
<a href="http://www.rabbitmq.com/clustering.html">http://www.rabbitmq.com/clustering.html</a> , where multiple RabbitMQ servers are arranged in an Active/Active fashion.&nbsp; In doing so we gain the benefits of a true HA environment but we have lost the ability
 to seamlessly recover from RabbitMQ failures, previously provided by the Pacemaker infrastructure (the single virtual IP).&nbsp;&nbsp; I&rsquo;ve reviewed your .NET documentation and see there are references to the various shutdown protocols.&nbsp; Do you have more specific documentation
 that would elaborate on these and provide examples on how we can manage our client connections in this HA environment (e.g. if I receive a ModelShutdown event, what steps do I need to go through in order to "transfer" a RabbitMQ session to a new connection).&nbsp;
 Or are there other "management" APIs that should be used instead?<p></p></p>
<p class="MsoPlainText"><p>&nbsp;</p></p>
<p class="MsoPlainText">I&rsquo;ve also been looking the book, RabbitMQ In Action, and a section on application failure and recovery (6.2) sounds almost as simple as wrapping the consumer/producer code in a try/catch block.
<p></p></p>
<p class="MsoPlainText"><p>&nbsp;</p></p>
<p class="MsoPlainText">Any thoughts would be appreciated.<p></p></p>
</div>
</div>
Emile Joubert | 16 Aug 2012 13:05
Favicon

Re: RabbitMQ application failover/recovery in HA cluster

Hi Edward,

On 15/08/12 14:44, Wax, Edward wrote:
> our client connections in this HA environment (e.g. if I receive a
> ModelShutdown event, what steps do I need to go through in order to
> "transfer" a RabbitMQ session to a new connection).  Or are there other
> "management" APIs that should be used instead?

A possible reason for receiving a ModelShutdown event is if the broker
raises a channel exception, but this event is also sent if the
channel/Model closes normally. You won't need to re-establish a
channel/Model alone without also needing to re-establish the containing
connection due to a node failure.

The more relevant events in a clustered environment are
IConnection.ConnectionShutdown and IBasicConsumer.HandleBasicCancel.
Upon receipt of the former reconnection to another node in the cluster
could be attempted. If a subscriber of a mirrored queue receives a
cancel notification then it could attempt resubscribing, and it should
be aware that redeliveries are especially likely.

> I’ve also been looking the book, /RabbitMQ In Action/, and a section on
> application failure and recovery (6.2) sounds almost as simple as
> wrapping the consumer/producer code in a try/catch block.

That deals with recovery when connected to a cluster, rather than an HA
cluster with mirrored queues. The assumption there is that queues could
be lost, while in your case mirrored queues will not be lost as long as
any nodes that mirror it remain running.

-Emile


Gmane