Daniel Phillips | 28 Jul 07:46

Two kinds of atomic commit

Here I describe two slightly different methods that Tux3 will use to
implement atomic commit of data and metadata.  Both methods combine 
logical and physical forward logging to perform atomic commits
efficiently.  The discussion below assumes we are updating a btree leaf,
for example to make room for more inode data or to add pointers to a
file index.  The same techniques apply equally well to all structures
in the filesystem.

1) The Update a Clone Method

Instead of directly modifying a disk block corresponding to a btree
leaf, we allocate a new block for the leaf and copy the contents of
the original block to the new block, only in the buffer cache (no copy
is performed on disk).  We can now alter the new block at will and
flush it out to disk without affecting the on-disk btree at all.  But
have not yet linked the new block into the btree.  We could accomplish
that by performing a similar clone update recursively up to the root of
the tree, which creates a new tree root.  The whole chain of new blocks
would then be flushed out to disk and a pointer to the new root stored
at some predictable location so it can be found later.  This is the
"phase tree" method that I invented for Tux2, and is commonly called
"copy on write" these days.  It could also be called the "sue me" method
because Netapp likes to sue those such as Sun who implement it.

Fortunately, there is a better way to do it that I invented recently and
which I have never heard of anyone using before.  We modify only the
leaf node of the btree by cloning and record the pointer to the clone
only in the cached btree index block, not on disk.  To be able to
reconstruct the cached version of the index node after a crash, we log a
logical change record to the disk that says "write this leaf into that
(Continue reading)

Matthew Dillon | 28 Jul 18:58

Re: Two kinds of atomic commit

:1) The Update a Clone Method
:
:2) The Update in Place Method
:
:Daniel

    Hmm.  Those seem fairly complex.  How would you deal with incidental
    operations on the B-Tree such as doing a split?  A single insert
    could wind up requiring a split of every internal node on the way
    down to the leaf.  I guess you could clone the blocks, but there are
    a ton of parent and child linkages that have to be changed when you
    split a node.  It sounds pretty expensive.

    Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
    Incidental meaning those operations (such as splits) which lead up to
    an insertion or deletion but do not actually perform the insertion
    or deletion.   On crash recovery you would undo partially completed
    B-Tree operations and then REDO the related logical operations.

    Having to allocate new blocks for B-Tree deletions could create a big
    issue when the filesystem becomes full or near-full.

					-Matt
					Matthew Dillon 
					<dillon <at> backplane.com>
Matthew Dillon | 28 Jul 18:58

Re: Two kinds of atomic commit

:1) The Update a Clone Method
:
:2) The Update in Place Method
:
:Daniel

    Hmm.  Those seem fairly complex.  How would you deal with incidental
    operations on the B-Tree such as doing a split?  A single insert
    could wind up requiring a split of every internal node on the way
    down to the leaf.  I guess you could clone the blocks, but there are
    a ton of parent and child linkages that have to be changed when you
    split a node.  It sounds pretty expensive.

    Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
    Incidental meaning those operations (such as splits) which lead up to
    an insertion or deletion but do not actually perform the insertion
    or deletion.   On crash recovery you would undo partially completed
    B-Tree operations and then REDO the related logical operations.

    Having to allocate new blocks for B-Tree deletions could create a big
    issue when the filesystem becomes full or near-full.

					-Matt
					Matthew Dillon 
					<dillon <at> backplane.com>

Daniel Phillips | 28 Jul 21:52

Re: Two kinds of atomic commit

On Monday 28 July 2008 09:58, Matthew Dillon wrote:
> :1) The Update a Clone Method
> :
> :2) The Update in Place Method
> :
> :Daniel
> 
>     Hmm.  Those seem fairly complex.

Then I need to rewrite the post so it seems as simple as it is :-)

>     How would you deal with incidental 
>     operations on the B-Tree such as doing a split?  A single insert
>     could wind up requiring a split of every internal node on the way
>     down to the leaf.  I guess you could clone the blocks, but there are
>     a ton of parent and child linkages that have to be changed when you
>     split a node.  It sounds pretty expensive.

In this case a log transaction is created containing all of the split
nodes as physical updates and possibly some merged nodes from the free
tree.  Btree splitting can always be committed atomically and
independently of any other activity on the filesystem.  It is nicely
bounded by the btree depth.  Allocations that do not require splitting
the free tree are logged logically, leaving the affected free tree
blocks dirty in cache.  So the allocation updates come "for free",
being tucked into the same commit block that governs the split btree
leaves.

>     Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
>     Incidental meaning those operations (such as splits) which lead up to
(Continue reading)


Gmane