28 Jul 07:46
Two kinds of atomic commit
Daniel Phillips <phillips <at> phunq.net>
2008-07-28 05:46:45 GMT
2008-07-28 05:46:45 GMT
Here I describe two slightly different methods that Tux3 will use to implement atomic commit of data and metadata. Both methods combine logical and physical forward logging to perform atomic commits efficiently. The discussion below assumes we are updating a btree leaf, for example to make room for more inode data or to add pointers to a file index. The same techniques apply equally well to all structures in the filesystem. 1) The Update a Clone Method Instead of directly modifying a disk block corresponding to a btree leaf, we allocate a new block for the leaf and copy the contents of the original block to the new block, only in the buffer cache (no copy is performed on disk). We can now alter the new block at will and flush it out to disk without affecting the on-disk btree at all. But have not yet linked the new block into the btree. We could accomplish that by performing a similar clone update recursively up to the root of the tree, which creates a new tree root. The whole chain of new blocks would then be flushed out to disk and a pointer to the new root stored at some predictable location so it can be found later. This is the "phase tree" method that I invented for Tux2, and is commonly called "copy on write" these days. It could also be called the "sue me" method because Netapp likes to sue those such as Sun who implement it. Fortunately, there is a better way to do it that I invented recently and which I have never heard of anyone using before. We modify only the leaf node of the btree by cloning and record the pointer to the clone only in the cached btree index block, not on disk. To be able to reconstruct the cached version of the index node after a crash, we log a logical change record to the disk that says "write this leaf into that(Continue reading)
> How would you deal with incidental
> operations on the B-Tree such as doing a split? A single insert
> could wind up requiring a split of every internal node on the way
> down to the leaf. I guess you could clone the blocks, but there are
> a ton of parent and child linkages that have to be changed when you
> split a node. It sounds pretty expensive.
In this case a log transaction is created containing all of the split
nodes as physical updates and possibly some merged nodes from the free
tree. Btree splitting can always be committed atomically and
independently of any other activity on the filesystem. It is nicely
bounded by the btree depth. Allocations that do not require splitting
the free tree are logged logically, leaving the affected free tree
blocks dirty in cache. So the allocation updates come "for free",
being tucked into the same commit block that governs the split btree
leaves.
> Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
> Incidental meaning those operations (such as splits) which lead up to
RSS Feed