Christopher Siden | 29 Sep 00:17 2012

3236 zio nop-write

Implemented by George Wilson.
See bug report for full description: https://www.illumos.org/issues/3236
Webrev here: http://cr.illumos.org/~webrev/csiden/illumos-3236/

Delphix leans on this pretty heavily for certain work loads and we
have been testing it extensively. There are a few new test cases for
the ZFS test suite to go along with it that haven't been pushed yet.

Chris

Richard Laager | 29 Sep 01:04 2012

Re: 3236 zio nop-write

On Fri, 2012-09-28 at 15:17 -0700, Christopher Siden wrote:
> See bug report for full description: https://www.illumos.org/issues/3236

From the bug report:
>> This functionality is only enabled if:
>> 1) The old and new blocks are checksummed using the same algorithm.

>> 2) That algorithm is cryptographically secure (e.g. sha256)

Are there real-world users that care about dedup=verify? If so, they're
not going to like this behavior.

How hard would verification be to implement? If it's not too bad, then
nop-write could apply with the default checksum algorithm (fletcher4). I
realize it'd add to the write latency, but only when the checksum
matches. And does this hit the ZIL before or after the nop-write
decision is made? If before, it's probably a non-issue in the
real-world.

>> 3) Compression is enabled on that block.

Why is compression a requirement? (I'm sorry if I'm missing something
obvious.)

--

-- 
Richard

Matthew Ahrens | 29 Sep 01:10 2012

Re: 3236 zio nop-write

On Fri, Sep 28, 2012 at 4:04 PM, Richard Laager <rlaager <at> wiktel.com> wrote:
On Fri, 2012-09-28 at 15:17 -0700, Christopher Siden wrote:
> See bug report for full description: https://www.illumos.org/issues/3236

From the bug report:
>> This functionality is only enabled if:
>> 1) The old and new blocks are checksummed using the same algorithm.

>> 2) That algorithm is cryptographically secure (e.g. sha256)

Are there real-world users that care about dedup=verify? If so, they're
not going to like this behavior.

I assert no.
 
How hard would verification be to implement?

It would be nontrivial to implement without reading from disk in syncing context, which can lead to performance pathologies.
 
>> 3) Compression is enabled on that block.

Why is compression a requirement? (I'm sorry if I'm missing something
obvious.)

Compression is not needed to make this work, but it tells us that the consumer is OK with us allocating less space than they might otherwise expect.  Without compression, one might reasonably expect that zero-filling a file will "tick provision" it.

--matt
illumos-zfs | Archives | Modify Your Subscription
Justin T. Gibbs | 29 Sep 01:50 2012

Re: 3236 zio nop-write

On Sep 28, 2012, at 5:10 PM, Matthew Ahrens <mahrens <at> delphix.com> wrote:

On Fri, Sep 28, 2012 at 4:04 PM, Richard Laager <rlaager <at> wiktel.com> wrote:
On Fri, 2012-09-28 at 15:17 -0700, Christopher Siden wrote:
> See bug report for full description: https://www.illumos.org/issues/3236

From the bug report:
>> This functionality is only enabled if:
>> 1) The old and new blocks are checksummed using the same algorithm.

>> 2) That algorithm is cryptographically secure (e.g. sha256)

Are there real-world users that care about dedup=verify? If so, they're
not going to like this behavior.

I assert no.
 
How hard would verification be to implement?

It would be nontrivial to implement without reading from disk in syncing context, which can lead to performance pathologies.
 
>> 3) Compression is enabled on that block.

Why is compression a requirement? (I'm sorry if I'm missing something
obvious.)

Compression is not needed to make this work, but it tells us that the consumer is OK with us allocating less space than they might otherwise expect.  Without compression, one might reasonably expect that zero-filling a file will "tick provision" it.

You mean in terms of disassociating the file from blocks shared with a snapshot?  The sparse file case doesn't seem to be changed by this patch (see explicit test of BP_IS_HOLE(bp_orig) in zio_nop_write()).

I've heard of folks explicitly rewriting files to manually "un-dedup" a volume after dedup is disabled, but never to divorce files from a snapshot.

--
Justin
illumos-zfs | Archives | Modify Your Subscription
Matthew Ahrens | 29 Sep 01:56 2012

Re: 3236 zio nop-write

On Fri, Sep 28, 2012 at 4:50 PM, Justin T. Gibbs <gibbs <at> scsiguy.com> wrote:
On Sep 28, 2012, at 5:10 PM, Matthew Ahrens <mahrens <at> delphix.com> wrote:

On Fri, Sep 28, 2012 at 4:04 PM, Richard Laager <rlaager <at> wiktel.com> wrote:
On Fri, 2012-09-28 at 15:17 -0700, Christopher Siden wrote:
> See bug report for full description: https://www.illumos.org/issues/3236

From the bug report:
>> This functionality is only enabled if:
>> 1) The old and new blocks are checksummed using the same algorithm.

>> 2) That algorithm is cryptographically secure (e.g. sha256)

Are there real-world users that care about dedup=verify? If so, they're
not going to like this behavior.

I assert no.
 
How hard would verification be to implement?

It would be nontrivial to implement without reading from disk in syncing context, which can lead to performance pathologies.
 
>> 3) Compression is enabled on that block.

Why is compression a requirement? (I'm sorry if I'm missing something
obvious.)

Compression is not needed to make this work, but it tells us that the consumer is OK with us allocating less space than they might otherwise expect.  Without compression, one might reasonably expect that zero-filling a file will "tick provision" it.
^ should be "thick" provision 

You mean in terms of disassociating the file from blocks shared with a snapshot?  The sparse file case doesn't seem to be changed by this patch (see explicit test of BP_IS_HOLE(bp_orig) in zio_nop_write()).

That's right.  Maybe this is an overabundance of caution.  Certainly the refreservation would be a better way of accomplishing this type of thick provisioning.

--matt
illumos-zfs | Archives | Modify Your Subscription
Christopher Siden | 10 Nov 01:30 2012

Re: 3236 zio nop-write

I let this slip. I've update the webrev with the changes updated onto the latest revision (nothing actually changed). I'm going to rebuild, retest and then submit for RTI.



Chris

On Fri, Sep 28, 2012 at 4:56 PM, Matthew Ahrens <mahrens <at> delphix.com> wrote:
On Fri, Sep 28, 2012 at 4:50 PM, Justin T. Gibbs <gibbs <at> scsiguy.com> wrote:
On Sep 28, 2012, at 5:10 PM, Matthew Ahrens <mahrens <at> delphix.com> wrote:

On Fri, Sep 28, 2012 at 4:04 PM, Richard Laager <rlaager <at> wiktel.com> wrote:
On Fri, 2012-09-28 at 15:17 -0700, Christopher Siden wrote:
> See bug report for full description: https://www.illumos.org/issues/3236

From the bug report:
>> This functionality is only enabled if:
>> 1) The old and new blocks are checksummed using the same algorithm.

>> 2) That algorithm is cryptographically secure (e.g. sha256)

Are there real-world users that care about dedup=verify? If so, they're
not going to like this behavior.

I assert no.
 
How hard would verification be to implement?

It would be nontrivial to implement without reading from disk in syncing context, which can lead to performance pathologies.
 
>> 3) Compression is enabled on that block.

Why is compression a requirement? (I'm sorry if I'm missing something
obvious.)

Compression is not needed to make this work, but it tells us that the consumer is OK with us allocating less space than they might otherwise expect.  Without compression, one might reasonably expect that zero-filling a file will "tick provision" it.
^ should be "thick" provision 

You mean in terms of disassociating the file from blocks shared with a snapshot?  The sparse file case doesn't seem to be changed by this patch (see explicit test of BP_IS_HOLE(bp_orig) in zio_nop_write()).

That's right.  Maybe this is an overabundance of caution.  Certainly the refreservation would be a better way of accomplishing this type of thick provisioning.

--matt
illumos-zfs | Archives | Modify Your Subscription

illumos-zfs | Archives | Modify Your Subscription
Garrett D'Amore | 10 Nov 04:55 2012

Re: 3236 zio nop-write


On Nov 9, 2012, at 4:30 PM, Christopher Siden <christopher.siden <at> delphix.com> wrote:

I let this slip. I've update the webrev with the changes updated onto the latest revision (nothing actually changed). I'm going to rebuild, retest and then submit for RTI.


Chris

On Fri, Sep 28, 2012 at 4:56 PM, Matthew Ahrens <mahrens <at> delphix.com> wrote:
On Fri, Sep 28, 2012 at 4:50 PM, Justin T. Gibbs <gibbs <at> scsiguy.com> wrote:
On Sep 28, 2012, at 5:10 PM, Matthew Ahrens <mahrens <at> delphix.com> wrote:

On Fri, Sep 28, 2012 at 4:04 PM, Richard Laager <rlaager <at> wiktel.com> wrote:
On Fri, 2012-09-28 at 15:17 -0700, Christopher Siden wrote:
> See bug report for full description: https://www.illumos.org/issues/3236

From the bug report:
>> This functionality is only enabled if:
>> 1) The old and new blocks are checksummed using the same algorithm.

>> 2) That algorithm is cryptographically secure (e.g. sha256)

Are there real-world users that care about dedup=verify? If so, they're
not going to like this behavior.

I assert no.

So thinking about this problem some more, here's wikipedia's birthday calculations.  For SHA-256, the entry in the table is 64.  (Roughly, although I think there has been some evidence that SHA-256 -- indeed the entire SHA family -- is not perfectly smooth.)

length of
hex string #bits hash space
size
(2#bits) Number of hashed elements such that (probability of at least one hash collision) = p p = 10−18 p = 10−15 p = 10−12 p = 10−9 p = 10−6 p = 0.1% p = 1% p = 25% p = 50% p = 75%
8 32 4.3 × 109 2 2 2 2.9 93 2.9 × 103 9.3 × 103 5.0 × 104 7.7 × 104 1.1 × 105
16 64 1.8 × 1019 6.1 1.9 × 102 6.1 × 103 1.9 × 105 6.1 × 106 1.9 × 108 6.1 × 108 3.3 × 109 5.1 × 109 7.2 × 109
32 128 3.4 × 1038 2.6 × 1010 8.2 × 1011 2.6 × 1013 8.2 × 1014 2.6 × 1016 8.3 × 1017 2.6 × 1018 1.4 × 1019 2.2 × 1019 3.1 × 1019
64 256 1.2 × 1077 4.8 × 1029 1.5 × 1031 4.8 × 1032 1.5 × 1034 4.8 × 1035 1.5 × 1037 4.8 × 1037 2.6 × 1038 4.0 × 1038 5.7 × 1038
(96) (384) (3.9 × 10115) 8.9 × 1048 2.8 × 1050 8.9 × 1051 2.8 × 1053 8.9 × 1054 2.8 × 1056 8.9 × 1056 4.8 × 1057 7.4 × 1057 1.0 × 1058
128 512 1.3 × 10154 1.6 × 1068 5.2 × 1069 1.6 × 1071 5.2 × 1072 1.6 × 1074 5.2 × 1075 1.6 × 1076 8.8 × 1076 1.4 × 1077 1.9 × 1077

So that's 4.8 x 10^29 items (blocks say) before there is even a distant possibility of a collision.  (I can live with p = 10^-18.  :-)  So we're more than 245 million yottabytes of data  before we have a probability of a SHA-256 collision that approaches a probability worth worrying about.  To put that more clearly, a *yottabyte* is about a septillion bytes, or a trillion terabytes.   Think about that -- nearly a quarter of a billion trillion terabyte drives would be needed to hold this much data. We're talking about truly vast amounts of data before we have to worry about a collision.  In all likelihood, I now believe it is unlikely that any SHA-256 collision will be found in my life time, unless there is a deeper flaw found in the SHA algorithm itself.  I'm much more worried personally about other natural causes extinction level events ending all life on earth than I am about a SHA-256 collision corrupting my data.  (Not that I'm particularly worried about such events.)

(Notably, until I went through the math on this, I always had a feeling that not using verify was a recipe for eventual disaster.  And it *could* in theory happen.  If it ever happens, it will be an earth shattering event, and we'll probably want to special case that particular collision in code.  But now I feel bothering to verify is stupid.  Your far far more likely to see in-core corruption of the data during any verification step than you were to see a hash collision in the first place.)

All of which is a very very round-about way of saying I think I finally agree wholeheartedly with Matt here.  There is *zero* (to a very, very, very close approximation of zero) in bothering with verification even for dedup if using a 256-bit uniformly distributed hash.

- Garrett

illumos-zfs | Archives | Modify Your Subscription
Jim Klimov | 10 Nov 13:55 2012
Picon

Re: 3236 zio nop-write

On 2012-11-10 04:55, Garrett D'Amore wrote:
> unlikely that any SHA-256 collision will be found in my life time,
> unless there is a deeper flaw found in the SHA algorithm itself.  I'm

Then again, I found and George Wilson (I believe) fixed a bug
in ZFS itself which caused problems with unverified writes on
a deduped pool. So even if the hash collisions are not likely
mathematically, there are situations which look the same and
dedup verification is not something "too paranoid"- even if
only to catch ZFS bugs.

To recap the "anecdotal reference", my userdata got corrupted
somehow so that some blocks from several files could not be
recovered by raidz2 over 6 disks (I still have little rigid
idea what went wrong, and I think I have some of those corrupted
files left dormant on my old box so I could try to research
their sectors manually some day).

While the block-pointer tree walk (like scrub or normal file
I/O) did find the presence of an error - a block did not match
its checksum - the entry in DDT for this checksum existed and
was not invalidated because of scrub's findings.

Further unverified writes into the pool just incremented the DDT
counter and new files (restored from backup) remained corrupted.

Writes with verification did find that the on-disk blocks differ
(they still did not care that one of them actually mismatches
the checksum) and decreased the DDT counter for "bad block" and
allocated a new unique one for valid data. I believe that for
such collided entries as recovery blocks with valid data, new
normal unique blocks are allocated and never deduped.

Ultimately, there was a panic (the cause of which was what George
fixed in https://www.illumos.org/issues/2649 ) because ZFS did
not do proper housekeeping for such unique blocks, and they were
still regarded as deduped - but not entered into the DDT. So the
further writes into the "dedup=verify" file that tried to release
the unique block from DDT (and which was not there) panicked.

My other related reported bugs on this subject were
https://www.illumos.org/issues/1981
https://www.illumos.org/issues/2024

My fix for not-panicking due to already having metadata corruption
of this type, and just releasing the "non-DDT" deduped blocks when
asked at a risk of "leaks" in the block tree was in the end the
essence of my 2024, and ranks between cosmetic fix and a usability
one (the storage system no longer panics in this case). I don't
think it was integrated - or, after the discussions, if it should
be - but the patch is on the issue tracker for those in need.

The corrupted DDT entries were never addressed, to my knowledge,
so it is still possible to have DDT point to unreadable garbage.
Most of these "corrupted" blocks on my system were flushed when
the DDT counter went to 0, after all copies of the file (block)
were "recovered" from backup or other sources, or deleted.

Garrett D'Amore | 10 Nov 17:55 2012

Re: 3236 zio nop-write

It sounds to me, after the long part of this story, that ZFS scrub (or some tool like it) should validate the
DDT checksum actually matches the data.  This would be an extra level of paranoia.   If it doesn't match, the
correct response would be (IMO) to fix the checksum, and then you wouldn't need to verify each and every
time you write.

Of course, this is only an issue with some bizarre bug or data corruption, not due to a problem with SHA
collisions.  (There have been many dedup bugs over the years, and dedup is one of the more commonly
mis-deployed technologies.  I usually am hesitant to recommend it unless there is compelling evidence
that the data set it will be applied to will achieve substantial benefit from it and the hardware is
suitably beefy -- in terms of RAM or by using a pure SSD pool -- to avoid the performance penalties too often
encountered with dedup.)

	- Garrett

On Nov 10, 2012, at 4:55 AM, Jim Klimov <jimklimov <at> cos.ru> wrote:

> On 2012-11-10 04:55, Garrett D'Amore wrote:
>> unlikely that any SHA-256 collision will be found in my life time,
>> unless there is a deeper flaw found in the SHA algorithm itself.  I'm
> 
> Then again, I found and George Wilson (I believe) fixed a bug
> in ZFS itself which caused problems with unverified writes on
> a deduped pool. So even if the hash collisions are not likely
> mathematically, there are situations which look the same and
> dedup verification is not something "too paranoid"- even if
> only to catch ZFS bugs.
> 
> To recap the "anecdotal reference", my userdata got corrupted
> somehow so that some blocks from several files could not be
> recovered by raidz2 over 6 disks (I still have little rigid
> idea what went wrong, and I think I have some of those corrupted
> files left dormant on my old box so I could try to research
> their sectors manually some day).
> 
> While the block-pointer tree walk (like scrub or normal file
> I/O) did find the presence of an error - a block did not match
> its checksum - the entry in DDT for this checksum existed and
> was not invalidated because of scrub's findings.
> 
> Further unverified writes into the pool just incremented the DDT
> counter and new files (restored from backup) remained corrupted.
> 
> Writes with verification did find that the on-disk blocks differ
> (they still did not care that one of them actually mismatches
> the checksum) and decreased the DDT counter for "bad block" and
> allocated a new unique one for valid data. I believe that for
> such collided entries as recovery blocks with valid data, new
> normal unique blocks are allocated and never deduped.
> 
> Ultimately, there was a panic (the cause of which was what George
> fixed in https://www.illumos.org/issues/2649 ) because ZFS did
> not do proper housekeeping for such unique blocks, and they were
> still regarded as deduped - but not entered into the DDT. So the
> further writes into the "dedup=verify" file that tried to release
> the unique block from DDT (and which was not there) panicked.
> 
> My other related reported bugs on this subject were
> https://www.illumos.org/issues/1981
> https://www.illumos.org/issues/2024
> 
> My fix for not-panicking due to already having metadata corruption
> of this type, and just releasing the "non-DDT" deduped blocks when
> asked at a risk of "leaks" in the block tree was in the end the
> essence of my 2024, and ranks between cosmetic fix and a usability
> one (the storage system no longer panics in this case). I don't
> think it was integrated - or, after the discussions, if it should
> be - but the patch is on the issue tracker for those in need.
> 
> The corrupted DDT entries were never addressed, to my knowledge,
> so it is still possible to have DDT point to unreadable garbage.
> Most of these "corrupted" blocks on my system were flushed when
> the DDT counter went to 0, after all copies of the file (block)
> were "recovered" from backup or other sources, or deleted.
> 
> 
> 
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-cdca798d
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com

Christopher Siden | 13 Nov 19:02 2012

Re: 3236 zio nop-write

<at> Garrett, should I count you as a reviewer on this?

On Sat, Nov 10, 2012 at 8:55 AM, Garrett D'Amore <garrett <at> damore.org> wrote:
It sounds to me, after the long part of this story, that ZFS scrub (or some tool like it) should validate the DDT checksum actually matches the data.  This would be an extra level of paranoia.   If it doesn't match, the correct response would be (IMO) to fix the checksum, and then you wouldn't need to verify each and every time you write.

Of course, this is only an issue with some bizarre bug or data corruption, not due to a problem with SHA collisions.  (There have been many dedup bugs over the years, and dedup is one of the more commonly mis-deployed technologies.  I usually am hesitant to recommend it unless there is compelling evidence that the data set it will be applied to will achieve substantial benefit from it and the hardware is suitably beefy -- in terms of RAM or by using a pure SSD pool -- to avoid the performance penalties too often encountered with dedup.)

        - Garrett

On Nov 10, 2012, at 4:55 AM, Jim Klimov <jimklimov <at> cos.ru> wrote:

> On 2012-11-10 04:55, Garrett D'Amore wrote:
>> unlikely that any SHA-256 collision will be found in my life time,
>> unless there is a deeper flaw found in the SHA algorithm itself.  I'm
>
> Then again, I found and George Wilson (I believe) fixed a bug
> in ZFS itself which caused problems with unverified writes on
> a deduped pool. So even if the hash collisions are not likely
> mathematically, there are situations which look the same and
> dedup verification is not something "too paranoid"- even if
> only to catch ZFS bugs.
>
> To recap the "anecdotal reference", my userdata got corrupted
> somehow so that some blocks from several files could not be
> recovered by raidz2 over 6 disks (I still have little rigid
> idea what went wrong, and I think I have some of those corrupted
> files left dormant on my old box so I could try to research
> their sectors manually some day).
>
> While the block-pointer tree walk (like scrub or normal file
> I/O) did find the presence of an error - a block did not match
> its checksum - the entry in DDT for this checksum existed and
> was not invalidated because of scrub's findings.
>
> Further unverified writes into the pool just incremented the DDT
> counter and new files (restored from backup) remained corrupted.
>
> Writes with verification did find that the on-disk blocks differ
> (they still did not care that one of them actually mismatches
> the checksum) and decreased the DDT counter for "bad block" and
> allocated a new unique one for valid data. I believe that for
> such collided entries as recovery blocks with valid data, new
> normal unique blocks are allocated and never deduped.
>
> Ultimately, there was a panic (the cause of which was what George
> fixed in https://www.illumos.org/issues/2649 ) because ZFS did
> not do proper housekeeping for such unique blocks, and they were
> still regarded as deduped - but not entered into the DDT. So the
> further writes into the "dedup=verify" file that tried to release
> the unique block from DDT (and which was not there) panicked.
>
> My other related reported bugs on this subject were
> https://www.illumos.org/issues/1981
> https://www.illumos.org/issues/2024
>
> My fix for not-panicking due to already having metadata corruption
> of this type, and just releasing the "non-DDT" deduped blocks when
> asked at a risk of "leaks" in the block tree was in the end the
> essence of my 2024, and ranks between cosmetic fix and a usability
> one (the storage system no longer panics in this case). I don't
> think it was integrated - or, after the discussions, if it should
> be - but the patch is on the issue tracker for those in need.
>
> The corrupted DDT entries were never addressed, to my knowledge,
> so it is still possible to have DDT point to unreadable garbage.
> Most of these "corrupted" blocks on my system were flushed when
> the DDT counter went to 0, after all copies of the file (block)
> were "recovered" from backup or other sources, or deleted.
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-cdca798d
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21639088-b97104e3
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

illumos-zfs | Archives | Modify Your Subscription
Garrett D'Amore | 13 Nov 20:15 2012

Re: 3236 zio nop-write


On Nov 13, 2012, at 10:02 AM, Christopher Siden <christopher.siden <at> delphix.com> wrote:

<at> Garrett, should I count you as a reviewer on this?

Probably not.  I didn't actually do any real review of the code except a cursory glance.  If you *need* additional reviewers, I will allocate some time to do so, though.

- Garrett


On Sat, Nov 10, 2012 at 8:55 AM, Garrett D'Amore <garrett <at> damore.org> wrote:
It sounds to me, after the long part of this story, that ZFS scrub (or some tool like it) should validate the DDT checksum actually matches the data.  This would be an extra level of paranoia.   If it doesn't match, the correct response would be (IMO) to fix the checksum, and then you wouldn't need to verify each and every time you write.

Of course, this is only an issue with some bizarre bug or data corruption, not due to a problem with SHA collisions.  (There have been many dedup bugs over the years, and dedup is one of the more commonly mis-deployed technologies.  I usually am hesitant to recommend it unless there is compelling evidence that the data set it will be applied to will achieve substantial benefit from it and the hardware is suitably beefy -- in terms of RAM or by using a pure SSD pool -- to avoid the performance penalties too often encountered with dedup.)

        - Garrett

On Nov 10, 2012, at 4:55 AM, Jim Klimov <jimklimov <at> cos.ru> wrote:

> On 2012-11-10 04:55, Garrett D'Amore wrote:
>> unlikely that any SHA-256 collision will be found in my life time,
>> unless there is a deeper flaw found in the SHA algorithm itself.  I'm
>
> Then again, I found and George Wilson (I believe) fixed a bug
> in ZFS itself which caused problems with unverified writes on
> a deduped pool. So even if the hash collisions are not likely
> mathematically, there are situations which look the same and
> dedup verification is not something "too paranoid"- even if
> only to catch ZFS bugs.
>
> To recap the "anecdotal reference", my userdata got corrupted
> somehow so that some blocks from several files could not be
> recovered by raidz2 over 6 disks (I still have little rigid
> idea what went wrong, and I think I have some of those corrupted
> files left dormant on my old box so I could try to research
> their sectors manually some day).
>
> While the block-pointer tree walk (like scrub or normal file
> I/O) did find the presence of an error - a block did not match
> its checksum - the entry in DDT for this checksum existed and
> was not invalidated because of scrub's findings.
>
> Further unverified writes into the pool just incremented the DDT
> counter and new files (restored from backup) remained corrupted.
>
> Writes with verification did find that the on-disk blocks differ
> (they still did not care that one of them actually mismatches
> the checksum) and decreased the DDT counter for "bad block" and
> allocated a new unique one for valid data. I believe that for
> such collided entries as recovery blocks with valid data, new
> normal unique blocks are allocated and never deduped.
>
> Ultimately, there was a panic (the cause of which was what George
> fixed in https://www.illumos.org/issues/2649 ) because ZFS did
> not do proper housekeeping for such unique blocks, and they were
> still regarded as deduped - but not entered into the DDT. So the
> further writes into the "dedup=verify" file that tried to release
> the unique block from DDT (and which was not there) panicked.
>
> My other related reported bugs on this subject were
> https://www.illumos.org/issues/1981
> https://www.illumos.org/issues/2024
>
> My fix for not-panicking due to already having metadata corruption
> of this type, and just releasing the "non-DDT" deduped blocks when
> asked at a risk of "leaks" in the block tree was in the end the
> essence of my 2024, and ranks between cosmetic fix and a usability
> one (the storage system no longer panics in this case). I don't
> think it was integrated - or, after the discussions, if it should
> be - but the patch is on the issue tracker for those in need.
>
> The corrupted DDT entries were never addressed, to my knowledge,
> so it is still possible to have DDT point to unreadable garbage.
> Most of these "corrupted" blocks on my system were flushed when
> the DDT counter went to 0, after all copies of the file (block)
> were "recovered" from backup or other sources, or deleted.
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-cdca798d
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21639088-b97104e3
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

illumos-zfs | Archives | Modify Your Subscription

illumos-zfs | Archives | Modify Your Subscription

Gmane