Raghav | 20 Jun 2012 18:28
Picon

Swapping Providers

Hi Steve,


Below is the scenario I want to achieve.


                      Warm-Standby
Master  ------------------------------------------------ DR-Master
    |
    |    Replication built on  Slony 2.1.
    |
   V
Slave 

Given is actually setup of production, master/slave replication with log archive shipping using slony 2.1. And, Master with DR setup on same data center as DR-Master with Warm Standby configuration running fine with minimal lag.

Q: What I want to achieve ?
A:  Swap the Masters. Means, now slave should receive events from DR-Master once it get promoted to Master as shown below

Master   DR Master
    |            |
   X           |    Replication built on  Slony 2.1.
    |            |
   V            |
Slave <------|

I believed, the DR-Master will have the same _slonyschema except the primary hostname. So, I have changed on both DR-Master & Slave of sl_path of primary hostname as fix.
Assuming that, I have tested on my local box with below steps.

1. Stop slony between master,slave.
2. Take the _slonyschema dump of Master (In case any problem).
3. Promote the DR-Master.
4. update the _slonyschem.sl_path of column pa_conninfo with primary hostname on DR-Master & Slave.
5. Start slony daemons on DR-Master & Slave.

Above steps won't work all the time, I did many times, very few times I could able to send events from DR-master to Slave with proper numbering of .SQL files.(log archives).
I would request to recommend the correct procedure for swapping providers.

-- 
Regards
Raghav

_______________________________________________
Slony1-general mailing list
Slony1-general@...
http://lists.slony.info/mailman/listinfo/slony1-general
Steve Singer | 22 Jun 2012 00:02

Re: Swapping Providers

On 12-06-20 12:28 PM, Raghav wrote:
> Hi Steve,
>
> Below is the scenario I want to achieve.
>
>
>                        Warm-Standby
> Master  ------------------------------------------------ DR-Master
>      |
>      |    Replication built on  Slony 2.1.
>      |
>     V
> Slave
>
> Given is actually setup of production, master/slave replication with log
> archive shipping using slony 2.1. And, Master with DR setup on same data
> center as DR-Master with Warm Standby configuration running fine with
> minimal lag.
>
> Q: What I want to achieve ?
> A:  Swap the Masters. Means, now slave should receive events from
> DR-Master once it get promoted to Master as shown below

Have you read 'Controlled Switchover' 
http://www.slony.info/documentation/2.1/failover.html

I think the MOVE SET command is what you want.

You setup 'DR Master' as a subscriber to your sets just like 'slave' 
was.  You then MOVE SET to make 'DR Master' the new origin.  'slave' 
will receive updates from 'DR Master' instead of 'Master' , assuming you 
have a complete path network in place.

>
>             Master   DR Master
>
>                  |            |
>                 X           |    Replication built on  Slony 2.1.
>
>                  |            |
>
>                 V            |
>
>             Slave <------|
>
>
> I believed, the DR-Master will have the same _slonyschema except the
> primary hostname. So, I have changed on both DR-Master & Slave of
> sl_path of primary hostname as fix.
> Assuming that, I have tested on my local box with below steps.
>
> 1. Stop slony between master,slave.
> 2. Take the _slonyschema dump of Master (In case any problem).
> 3. Promote the DR-Master.
> 4. update the _slonyschem.sl_path of column pa_conninfo with primary
> hostname on DR-Master & Slave.
> 5. Start slony daemons on DR-Master & Slave.
>
> Above steps won't work all the time, I did many times, very few times I
> could able to send events from DR-master to Slave with proper numbering
> of .SQL files.(log archives).
> I would request to recommend the correct procedure for swapping providers.
>
> --
> Regards
> Raghav
> Blog: htt://raghavt.blogspot.com/ <http://raghavt.blogspot.com/>
>
>
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general@...
> http://lists.slony.info/mailman/listinfo/slony1-general
Raghav | 22 Jun 2012 05:27
Picon

Re: Swapping Providers


Q: What I want to achieve ?
A:  Swap the Masters. Means, now slave should receive events from
DR-Master once it get promoted to Master as shown below

Have you read 'Controlled Switchover' http://www.slony.info/documentation/2.1/failover.html

I think the MOVE SET command is what you want.

You setup 'DR Master' as a subscriber to your sets just like 'slave' was.  You then MOVE SET to make 'DR Master' the new origin.  'slave' will receive updates from 'DR Master' instead of 'Master' , assuming you have a complete path network in place.

Thank you Steve. Make sense to me.

Sorry to say, my question was how to avoid resubscribing DR-Master to slave and then promoting it to master. Since, complete information of master is available on dr-master.

Due to my curiosity level :)  I tested again with different approach. Since, DR Master is Streaming replication READ-ONLY server it will have same _slonyschema of all what's there on Master, just the connection string changes needed. So, I thought changing the connection string should fix.

I read this part of documentation and implemented below steps not for once, couple of times, this time it never failed.

1. Stop slon on Master/Slave
2. change store paths with function on DR-master.
aquent_cdb=# select _myrep.storepath(1,2,'host=dr-master.com dbname=master user=postgres port=5432',10);   ///Note: previsouly it was from master.com
 storepath
------------
 5000000017
(1 row)
4. Repeat the step 3 on slave to change the provider info.
5. Start slon on DR-Master/slave.

Above steps, went fine and all write operations on dr-master(newly promoted as master) started publishing on slave. 
Correct me if this approach is doable. 

--Raghav
_______________________________________________
Slony1-general mailing list
Slony1-general@...
http://lists.slony.info/mailman/listinfo/slony1-general
Steve Singer | 22 Jun 2012 15:00

Re: Swapping Providers

On 12-06-21 11:27 PM, Raghav wrote:
>
> Thank you Steve. Make sense to me.
>
> Sorry to say, my question was how to avoid resubscribing DR-Master to
> slave and then promoting it to master. Since, complete information of
> master is available on dr-master.
>
> Due to my curiosity level :)  I tested again with different approach.
> Since, DR Master is Streaming replication READ-ONLY server it will have
> same _slonyschema of all what's there on Master, just the connection
> string changes needed. So, I thought changing the connection string
> should fix.

You can use the built-in streaming replication for what you describe but 
if you want unplanned failover it should be the SYNCHRONOUS replication.

asynchronous replication will work fine for a controlled switchover 
(stop the slons, let your 'DR Master' get caught up with streaming 
replication' then bring it up as a read-write node and restart the slons).

However if you have an unplanned failure of your 'master' node the 
'slave' node might be further ahead via slony than your 'dr master' node 
via async streaming replication.  If this happens and you promote the 
'DR master' then you might have a very bad day.

>
> I read this part of documentation and implemented below steps not for
> once, couple of times, this time it never failed.
> http://slony.info/documentation/2.1/stmtstorepath.html
>
> 1. Stop slon on Master/Slave
> 2. change store paths with function on DR-master.
> aquent_cdb=# select _myrep.storepath(1,2,'host=*dr-master.com
> <http://dr-master.com>* dbname=master user=postgres port=5432',10);
> ///Note: previsouly it was from master.com <http://master.com>
>   storepath
> ------------
>   5000000017
> (1 row)
> 4. Repeat the step 3 on slave to change the provider info.
> 5. Start slon on DR-Master/slave.
>
> Above steps, went fine and all write operations on dr-master(newly
> promoted as master) started publishing on slave.
> Correct me if this approach is doable.
>
> --Raghav
Raghav | 22 Jun 2012 16:37
Picon

Re: Swapping Providers


You can use the built-in streaming replication for what you describe but if you want unplanned failover it should be the SYNCHRONOUS replication.

asynchronous replication will work fine for a controlled switchover (stop the slons, let your 'DR Master' get caught up with streaming replication' then bring it up as a read-write node and restart the slons).

However if you have an unplanned failure of your 'master' node the 'slave' node might be further ahead via slony than your 'dr master' node via async streaming replication.  If this happens and you promote the 'DR master' then you might have a very bad day.



Got it. Thanks Steve for all  your help.
I could able to fix this by storepath() function and as you said, I have taken care that 'slave' node never pass master events ahead.

--Raghav
_______________________________________________
Slony1-general mailing list
Slony1-general@...
http://lists.slony.info/mailman/listinfo/slony1-general
Jan Wieck | 24 Jun 2012 15:14
Picon
Favicon

Re: Swapping Providers

On 6/21/2012 11:27 PM, Raghav wrote:
>
>         Q: What I want to achieve ?
>         A:  Swap the Masters. Means, now slave should receive events from
>         DR-Master once it get promoted to Master as shown below
>
>
>     Have you read 'Controlled Switchover'
>     http://www.slony.info/__documentation/2.1/failover.__html
>     <http://www.slony.info/documentation/2.1/failover.html>
>
>     I think the MOVE SET command is what you want.
>
>     You setup 'DR Master' as a subscriber to your sets just like 'slave'
>     was.  You then MOVE SET to make 'DR Master' the new origin.  'slave'
>     will receive updates from 'DR Master' instead of 'Master' , assuming
>     you have a complete path network in place.
>
> Thank you Steve. Make sense to me.
>
> Sorry to say, my question was how to avoid resubscribing DR-Master to
> slave and then promoting it to master. Since, complete information of
> master is available on dr-master.
>
> Due to my curiosity level :)  I tested again with different approach.
> Since, DR Master is Streaming replication READ-ONLY server it will have
> same _slonyschema of all what's there on Master, just the connection
> string changes needed. So, I thought changing the connection string
> should fix.

Using streaming replication for the DR Master could present a problem. 
Since it is in the same data center the following may seem unlikely. But 
what can happen is that the DR Master is further behind in replication 
than the Slony Slave at the moment, disaster strikes and the Master 
becomes unavailable. We never know ahead what will cause the Master to 
go down and how fast it will be.

What if the death is affecting the network interface of Master first. At 
first it is just losing a few packets and some of them are packets from 
the WAL sender to the DR Master. It will take several seconds for the 
TCP/IP protocol to detect that and retransmit. Time enough for several 
more transactions to commit and Slony to replicate them. And before the 
DR Master can catch up, the motherboard finally fails with a puff of smoke.

What will happen with your below steps in this situation is that the 
Slave has some changes, that the DR Master doesn't have. The DR Master 
(now Master) will generate new SYNC events and the first (few) will have 
the same event number as ones, that the Slave had replicated from the 
old Master. They will be ignored by the Slave. So at the end the DR 
Master will be missing some changes made to the old Master and the Slave 
will be missing other changes that had been made against the DR Master.

Jan

>
> I read this part of documentation and implemented below steps not for
> once, couple of times, this time it never failed.
> http://slony.info/documentation/2.1/stmtstorepath.html
>
> 1. Stop slon on Master/Slave
> 2. change store paths with function on DR-master.
> aquent_cdb=# select _myrep.storepath(1,2,'host=*dr-master.com
> <http://dr-master.com>* dbname=master user=postgres port=5432',10);
> ///Note: previsouly it was from master.com <http://master.com>
>   storepath
> ------------
>   5000000017
> (1 row)
> 4. Repeat the step 3 on slave to change the provider info.
> 5. Start slon on DR-Master/slave.
>
> Above steps, went fine and all write operations on dr-master(newly
> promoted as master) started publishing on slave.
> Correct me if this approach is doable.
>
> --Raghav
>
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general@...
> http://lists.slony.info/mailman/listinfo/slony1-general
>

--

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
Raghav | 25 Jun 2012 09:36
Picon

Re: Swapping Providers


Using streaming replication for the DR Master could present a problem. Since it is in the same data center the following may seem unlikely. But what can happen is that the DR Master is further behind in replication than the Slony Slave at the moment, disaster strikes and the Master becomes unavailable. We never know ahead what will cause the Master to go down and how fast it will be.
 
What if the death is affecting the network interface of Master first. At first it is just losing a few packets and some of them are packets from the WAL sender to the DR Master. It will take several seconds for the TCP/IP protocol to detect that and retransmit. Time enough for several more transactions to commit and Slony to replicate them. And before the DR Master can catch up, the motherboard finally fails with a puff of smoke.

What will happen with your below steps in this situation is that the Slave has some changes, that the DR Master doesn't have. The DR Master (now Master) will generate new SYNC events and the first (few) will have the same event number as ones, that the Slave had replicated from the old Master. They will be ignored by the Slave. So at the end the DR Master will be missing some changes made to the old Master and the Slave will be missing other changes that had been made against the DR Master.


Great Analysis Jan... I thought all of those what you mentioned. All you mentioned are very true.
But, Thumb rule of all the thoughts is that, Master should be on top in Events numbering, Slave events should be always lower. For example if Master current events are in sequence of say '5000000023', then slave should be < 5000000023. If your Slave events tops the Master Event Seq number, then all your efforts go waste, even after promoting DR-Master as master.

I sticked to this rule only as Steve mentioned. I tried to maintain my DR-Master (Streaming Replication) to be topper in Events and did not allow Slave to cross master event number.

Simple test case would be, create master/slave/dr_master on one port with the rule I mentioned above.

1. Setup replication with two databases master/slave with one table replication and maintain SYNC.
2. Now Stop Slon on master & slave.
3. Here, assume we are promoting DR-master as Master. So,create new database as dr-master with template=master on same port, bcoz, its going to be the same as Master with Events when you stopped Slony in Step 2.

4. Now change the sl_path with store path() function on DR-Master & Slave. Since, now DR-master would be the provider.

dr_master=# select _myrep.storepath(1,2,'host=127.0.0.1 dbname=dr_master user=postgres port=5432',10);
 storepath
------------
 5000000021
(1 row)

On Slave 

slave1=# select _myrep.storepath(1,2,'host=127.0.0.1 dbname=dr_master user=postgres port=5432',10);
 storepath
------------
 5000000012
(1 row)

See above, my Dr-Master events number is greater than slave.

5. Now start slon process for dr-master/slave and you should see syncs.

-bash-4.1$ psql -d dr_master -c "select max(ev_seqno) from _myrep.sl_event;"
    max
------------
 5000000025
(1 row)

-bash-4.1$ psql -d slave1 -c "select max(ev_seqno) from _myrep.sl_event;"
    max
------------
 5000000023
(1 row)

Here onwards any DML's work smoothly.

--Raghav
_______________________________________________
Slony1-general mailing list
Slony1-general@...
http://lists.slony.info/mailman/listinfo/slony1-general
Jan Wieck | 25 Jun 2012 14:45
Picon
Favicon

Re: Swapping Providers

On 6/25/2012 3:36 AM, Raghav wrote:
>
>     Using streaming replication for the DR Master could present a
>     problem. Since it is in the same data center the following may seem
>     unlikely. But what can happen is that the DR Master is further
>     behind in replication than the Slony Slave at the moment, disaster
>     strikes and the Master becomes unavailable. We never know ahead what
>     will cause the Master to go down and how fast it will be.
>
>     What if the death is affecting the network interface of Master
>     first. At first it is just losing a few packets and some of them are
>     packets from the WAL sender to the DR Master. It will take several
>     seconds for the TCP/IP protocol to detect that and retransmit. Time
>     enough for several more transactions to commit and Slony to
>     replicate them. And before the DR Master can catch up, the
>     motherboard finally fails with a puff of smoke.
>
>     What will happen with your below steps in this situation is that the
>     Slave has some changes, that the DR Master doesn't have. The DR
>     Master (now Master) will generate new SYNC events and the first
>     (few) will have the same event number as ones, that the Slave had
>     replicated from the old Master. They will be ignored by the Slave.
>     So at the end the DR Master will be missing some changes made to the
>     old Master and the Slave will be missing other changes that had been
>     made against the DR Master.
>
>
> Great Analysis Jan... I thought all of those what you mentioned. All you
> mentioned are very true.
> But, Thumb rule of all the thoughts is that, Master should be on top in
> Events numbering, Slave events should be always lower. For example if
> Master current events are in sequence of say '5000000023', then slave
> should be < 5000000023. If your Slave events tops the Master Event Seq
> number, then all your efforts go waste, even after promoting DR-Master
> as master.
>
> I sticked to this rule only as Steve mentioned. I tried to maintain my
> DR-Master (Streaming Replication) to be topper in Events and did not
> allow Slave to cross master event number.
>
> Simple test case would be, create master/slave/dr_master on one port
> with the rule I mentioned above.
>
> 1. Setup replication with two databases master/slave with one table
> replication and maintain SYNC.
> 2. Now Stop Slon on master & slave.
> 3. Here, assume we are promoting DR-master as Master. So,create new
> database as dr-master with template=master on same port, bcoz, its going
> to be the same as Master with Events when you stopped Slony in Step 2.
>
> 4. Now change the sl_path with store path() function on DR-Master &
> Slave. Since, now DR-master would be the provider.
>
> dr_master=# select _myrep.storepath(1,2,'host=127.0.0.1 dbname=dr_master
> user=postgres port=5432',10);
>   storepath
> ------------
>   5000000021
> (1 row)
>
> On Slave
>
> slave1=# select _myrep.storepath(1,2,'host=127.0.0.1 dbname=dr_master
> user=postgres port=5432',10);
>   storepath
> ------------
>   5000000012
> (1 row)
>
> See above, my Dr-Master events number is greater than slave.

This analysis is flawed. The two event numbers are from different 
origins and therefore, don't compare to each other. The combination of 
ev_origin,ev_seqno can never be higher on any node, than it is on the 
origin itself.

Your disaster recovery plan assumes, that streaming replication will 
ALLWAYS be faster than Slony replication. But that is only true for 
synchronous streaming replication. If you use asynchronous streaming 
replication, then one tiny network glitch and your DR-master will be 
several seconds behind while the Slony replica may be not. If the master 
blows up in that moment, your plan fails.

To simulate this problem, Steve and I were pointing out, do the following:

1. Create your setup as before.
2. Stop the streaming replication (simulating the network communication 
problem)
3. Update a row on the master and wait for the SYNC to replicate.
4. Stop the slon processes. DO NOT let the streaming replica catch up 
with the now DEAD master. Assume the master and all its data, including 
WAL, have become unavailable.
5. Promote DR-master and do the two store path commands.
6. Start slon processes.
7. Update another row on the new master.
8. Compare table content on master and slave.

You can detect the problem before step 5 by comparing the ev_seqno with 
ev_origin=old-master on the DR-master and slave. Whichever is higher 
should be promoted to master. In the unlikely case that it is the Slony 
slave, you will have to rebuild the DR-master from scratch, though.

Jan

--

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
Raghav | 25 Jun 2012 18:45
Picon

Re: Swapping Providers

This analysis is flawed. The two event numbers are from different origins and therefore, don't compare to each other. The combination of ev_origin,ev_seqno can never be higher on any node, than it is on the origin itself.

Thanks Jan. For correcting me, let me recheck thoroughly from my side why I concluded on the seqno.
 
Your disaster recovery plan assumes, that streaming replication will ALLWAYS be faster than Slony replication. But that is only true for synchronous streaming replication. If you use asynchronous streaming replication, then one tiny network glitch and your DR-master will be several seconds behind while the Slony replica may be not. If the master blows up in that moment, your plan fails.

Yes, very much true and completely agreed. Aim is to not to have any lag between Master & DR-Master. 
As you said, in any case if Master & DR-Master are not in Sync due to network or any lag then my whole plan collapses. So, I made my goal clear here, that I need to STOP slony between Master & Slave long before promoting DR-master as Master, to make sure nothing is left on Master to update on DR-master.  
 
To simulate this problem, Steve and I were pointing out, do the following:

1. Create your setup as before.
2. Stop the streaming replication (simulating the network communication problem)
3. Update a row on the master and wait for the SYNC to replicate.
4. Stop the slon processes. DO NOT let the streaming replica catch up with the now DEAD master. Assume the master and all its data, including WAL, have become unavailable.
5. Promote DR-master and do the two store path commands.
6. Start slon processes.
7. Update another row on the new master.
8. Compare table content on master and slave.

You can detect the problem before step 5 by comparing the ev_seqno with ev_origin=old-master on the DR-master and slave. Whichever is higher should be promoted to master. In the unlikely case that it is the Slony slave, you will have to rebuild the DR-master from scratch, though.


Exactly, these are the steps I followed and succeded, as pointed I never simulated Step 2,3 in my testing, because If MASTER and newly promoted DR-Master are not same in any case then the success is nowhere near by surroundings :) .
 
I will surely retake the test with the steps mentioned on my two VM's and update my finding. 

Once again, Thank you very much to you & Steve for helping me in this scenario.
  
--
Regards
Raghav

_______________________________________________
Slony1-general mailing list
Slony1-general@...
http://lists.slony.info/mailman/listinfo/slony1-general
Raghav | 25 Jun 2012 18:55
Picon

Re: Swapping Providers

Also, one more point on step 3, there cannot be any DML's until this operation is done successfully, because, in step 2 you are stopping SR and doing DML's on Step 3, which is clearly know fact that your DR-master dont have any effect made in Step 3.


--Raghav
_______________________________________________
Slony1-general mailing list
Slony1-general@...
http://lists.slony.info/mailman/listinfo/slony1-general
Raghav | 1 Jul 2012 14:32
Picon

Re: Swapping Providers

Sorry for late on this, its due to actual Production swap implementation. Though I tested as suggested, I didnt get time to post my reply. 
After doing vigorous testing by taking into consideration Steve & Jan points, we succeded it without any problem.

This analysis is flawed. The two event numbers are from different origins and therefore, don't compare to each other. The combination of ev_origin,ev_seqno can never be higher on any node, than it is on the origin itself.

Yes, my analysis has flaw. Its no where related to SEQ.No...which I analyzed and every node seq.no number is independed by itself. 

To simulate this problem, Steve and I were pointing out, do the following:
1. Create your setup as before.
2. Stop the streaming replication (simulating the network communication problem)
3. Update a row on the master and wait for the SYNC to replicate.
4. Stop the slon processes. DO NOT let the streaming replica catch up with the now DEAD master. Assume the master and all its data, including WAL, have become unavailable.
5. Promote DR-master and do the two store path commands.
6. Start slon processes.
7. Update another row on the new master.
8. Compare table content on master and slave.
You can detect the problem before step 5 by comparing the ev_seqno with ev_origin=old-master on the DR-master and slave. Whichever is higher should be promoted to master. In the unlikely case that it is the Slony slave, you will have to rebuild the DR-master from scratch, though.

Yep, I could able to detect before Step 5. 

In short of our actual production swap, we stopped application(NO DML's) and Slony. Checked DR-master(which is warm standby) is complete sync with Master(primary). Promoted DR-master, did changes to store_path on both the ends. Started slon on both the nodes. Sync catch-up well. We have taken complete care of NOT happening any Step 2 or 3 though .. :)

Thanks once again.

--Raghav

_______________________________________________
Slony1-general mailing list
Slony1-general@...
http://lists.slony.info/mailman/listinfo/slony1-general

Gmane