Peter Wilkinson | 21 Jun 2009 12:38
Picon

Replication - thoughts, patches, thinking out loud.

Hi,
I've been thinking about replication of Durus for a while and being  
bored decided to try out a theory I've come up with.

Firstly the issue that I've been stuck on is how to get a Durus  
replication setup going such that the db is rarely more than a few  
seconds out of date. Using rsync is nice and easy but has some fairly  
well known issues (is the replica really the same as the master? how  
to deal with packing? how often do you run it?)

My hope was to have a simple setup that would allow for master->slave  
replication and deal with changing masters due to packing.

Attached are 2 files, a changed shelf.py and a new replicate.py which  
work with the standard Durus 3.8 tar.gz sources.

The changes to shelf.py are fairly simple and basically boil down the  
having a hash value calculated for each transaction that is committed  
and tacking that on at the end of the transaction. ie. the format for  
a transaction looks like: 8 bytes length, x bytes transaction data, 20  
bytes hash. The 20 bytes are included in the transaction length.

This hash is calculated on the previous hash value concatenated with  
the new transaction data. My thinking is that by calculating a hash  
through the file like this a slave can be compared to the master by  
just comparing the last 20 bytes of the slave with the 20 bytes at the  
same index in the master file, any changes in any previous transaction  
would show up as a different hash.

By being able to safely compare the 2 files the replicate script is  
(Continue reading)

Binger David | 26 Jun 2009 15:00

Re: Replication - thoughts, patches, thinking out loud.

It seems like your replication strategy works unnecessarily hard to  
track transaction
boundaries.  If the master fails and the slave has a partial  
transaction,  then the
new server process would need to truncate the partial transaction at  
startup,
just as it would if the same condition happened without replication  
involved.

The careful rsync strategy that I think I've posted here earlier can  
easily
run every minute, and it recognizes when packs happen.  If you need  
more frequent
updates, I think you can use the same inode-checking strategy along with
a remote "tail -f" to get the job done.   Is that not right?

I think what you've done is cool, I'm just not sure if it is cool  
enough to change the
file format.   Am I overlooking something?
Peter Wilkinson | 2 Jul 2009 14:06
Picon

Re: Replication - thoughts, patches, thinking out loud.

Hi David,

Thanks for having a look and sorry for the slow response.

Underlying a lot of my ongoing experimentation is growing databases,  
many GBs so far and continuing to grow. Getting fool proof fast  
replication running against big databases is a priority. I currently  
use full rsyncs but have been trying thinking of ways to be more  
efficient.

One of my hats in my day job is as a sysadmin and we use rsync heavily  
on busy machines and I've grown to be wary of the storm of IO it can  
generate on lots of data. Getting rsync to not read all of the source  
and destination requires the append option that has been spoken about  
before which leads to the issue of ensuring that the destination file  
is a strict subset of the source one.

This is where my experimentation started; how can we know that the  
destination is that strict subset? My first thought was to just make a  
unique header for each file and compare those but then realised that  
to append anything to the slave the data has to come from the same  
point in the master which is hard to know after any interruption of  
the appending occurs, eg. slave server is getting an update and is  
offline for 10 minutes, from where in the master does the data get read?

The two options to this issue, as I see it, are to run a full rysnc  
for each replication run so that the slave state can be anything at  
all and it will be cleaned up or to keep track of some structure in  
the slave and master and compare where the slave is at with the master  
and therefore be able to append cleanly. I was very pleasantly  
(Continue reading)

Neil Schemenauer | 7 Jul 2009 18:57

Re: Replication - thoughts, patches, thinking out loud.

Peter Wilkinson <pfw <at> thirdfloor.com.au> wrote:
> Underlying a lot of my ongoing experimentation is growing databases,  
> many GBs so far and continuing to grow. Getting fool proof fast  
> replication running against big databases is a priority. I currently  
> use full rsyncs but have been trying thinking of ways to be more  
> efficient.

I'm using the following script in combination with rsync.  It makes
the syncronization much faster since rsync will only check the
filename and mtime for most chunks.

    http://python.ca/nas/python/split_durus_fs.py

The test for a packed DB is the only weakness, AFAIK. I had
requested a pack counter be added to the storage file header but
David did not go for it. The inode check I use should be very safe.
While Durus packs the DB, both the old and new file are present,
guaranteeing that they have different inode numbers. The only way
this check could be fooled is if there were multiple packs done
between script runs and by chance the file ended up with original
inode number.

Regards,

  Neil
Binger David | 7 Jul 2009 19:46

Re: Replication - thoughts, patches, thinking out loud.


On Jul 7, 2009, at 12:57 PM, Neil Schemenauer wrote:
> I'm using the following script in combination with rsync.  It makes
> the syncronization much faster since rsync will only check the
> filename and mtime for most chunks.

I don't understand why splitting would be better than using
the --append flag on rsync.

If, in addition to the inode number, you watch for changes in the
ctime of the file and/or the .prepack file, I think you could
avoid the possibility of an inode number being the same in a
twice-packed  file.
Neil Schemenauer | 7 Jul 2009 20:35

Re: Replication - thoughts, patches, thinking out loud.

On Tue, Jul 07, 2009 at 01:46:33PM -0400, Binger David wrote:
> I don't understand why splitting would be better than using
> the --append flag on rsync.

The --append flag does not handle a packed database. I suppose you
could create a script that detected a pack and called rsync without
the --append flag in that case.

> If, in addition to the inode number, you watch for changes in the
> ctime of the file and/or the .prepack file, I think you could
> avoid the possibility of an inode number being the same in a
> twice-packed  file.

I don't see how that is bullet-proof either. The ctime changes at
least as fast as mtime. My solution works 100% as long as the
pack interval is longer than the time between splits.

Regards,

  Neil
Binger David | 8 Jul 2009 06:31

Re: Replication - thoughts, patches, thinking out loud.


On Jul 7, 2009, at 2:35 PM, Neil Schemenauer wrote:

> On Tue, Jul 07, 2009 at 01:46:33PM -0400, Binger David wrote:
>> I don't understand why splitting would be better than using
>> the --append flag on rsync.
>
> The --append flag does not handle a packed database. I suppose you
> could create a script that detected a pack and called rsync without
> the --append flag in that case.

I think that is what the script I posted here does.
It checks the time and inode on the .prepack file instead of the
database file itself.  When those change, you know that
a pack has been completed.

>
>> If, in addition to the inode number, you watch for changes in the
>> ctime of the file and/or the .prepack file, I think you could
>> avoid the possibility of an inode number being the same in a
>> twice-packed  file.
>
> I don't see how that is bullet-proof either. The ctime changes at
> least as fast as mtime. My solution works 100% as long as the
> pack interval is longer than the time between splits.

I think that is why my script uses the .prepack file, and
I think it works for any pack interval.
Binger David | 2 Jul 2009 15:52

Re: Replication - thoughts, patches, thinking out loud.


On Jul 2, 2009, at 8:06 AM, Peter Wilkinson wrote:

> Hi David,
>
> Thanks for having a look and sorry for the slow response.
>
> Underlying a lot of my ongoing experimentation is growing databases,  
> many GBs so far and continuing to grow. Getting fool proof fast  
> replication running against big databases is a priority. I currently  
> use full rsyncs but have been trying thinking of ways to be more  
> efficient.
>
> One of my hats in my day job is as a sysadmin and we use rsync  
> heavily on busy machines and I've grown to be wary of the storm of  
> IO it can generate on lots of data. Getting rsync to not read all of  
> the source and destination requires the append option that has been  
> spoken about before which leads to the issue of ensuring that the  
> destination file is a strict subset of the source one.

Agreed, and we are dealing with that by monitoring for changes in the  
stat of the prepack file
on the remote machine, and by using --append-verify instead of -- 
append, always or occasionally.
This does not seem to generate excessive traffic.

#!/usr/bin/env python
"""
Backup remote Durus database.
This assumes that you have configured your system so that
(Continue reading)


Gmane