Liu Bo | 14 Apr 13:14 2013

[PATCH v2 0/2] Online data deduplication

This is the second attempt for online data deduplication.

NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data!

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

For more implementation details, please refer to PATCH 1.

PATCH 2 is a hang fix when deduplication is on.

* a bit-to-bit comparison callback.
* a IOCTL for enabling deduplication.

All comments are welcome!

* To avoid enlarging the file extent item's size, add another index key used for
  freeing dedup extent.
* Freeing dedup extent is now like how we delete checksum.
* Add support for alternative deduplicatin blocksize larger than PAGESIZE.
* Add a mount option to set deduplication blocksize.
* Add support for those writes that are smaller than deduplication blocksize.

HOW To turn deduplication on:

(Continue reading)

Liu Bo | 14 Apr 13:14 2013

[PATCH v2 2/2] Btrfs: skip merge part for delayed data refs

When we have data deduplication on, we'll hang on the merge part
because it needs to verify every queued delayed data refs related to
this disk offset.

And in the case of delayed data refs, we don't usually have too much
data refs to merge.

So it's safe to shut it down for data refs.

Signed-off-by: Liu Bo < <at>>
 fs/btrfs/delayed-ref.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index b7a0641..34670c8 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
 <at>  <at>  -316,6 +316,13  <at>  <at>  void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
 	struct rb_node *node;
 	u64 seq = 0;

+	/*
+	 * We don't have too much refs to merge in the case of delayed data
+	 * refs.
+	 */
+	if (head->is_data)
+		return;
(Continue reading)

Liu Bo | 14 Apr 13:14 2013

[PATCH v2 1/2] Btrfs: online data deduplication

(NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)

This introduce the online data deduplication feature for btrfs.

(1) WHY do we need deduplication?
    To improve our storage effiency.

(2) WHAT is deduplication?
    Two key ways for practical deduplication implementations,
    *  When the data is deduplicated
       (inband vs background)
    *  The granularity of the deduplication.
       (block level vs file level)

    For btrfs, we choose
    *  inband(synchronous)
    *  block level

    We choose them because of the same reason as how zfs does.
    a)  To get an immediate benefit.
    b)  To remove redundant parts within a file.

    So we have an inband, block level data deduplication here.

(3) HOW does deduplication works?
    This makes full use of file extent back reference, the same way as
    IOCTL_CLONE, which lets us easily store multiple copies of a set of
    data as a single copy along with an index of references to the copy.

    Here we have
(Continue reading)