Paul Kraus | 14 Sep 18:50 2011

zfs destroy snapshot runs out of memory bug

    I know there was (is ?) a bug where a zfs destroy of a large
snapshot would run a system out of kernel memory, but searching the
list archives and on defects.opensolaris.org I cannot find it. Could
someone here explain the failure mechanism in language a Sys Admin (I
am NOT a developer) could understand. I am running Solaris 10 with
zpool 22 and I am looking for both understanding of the underlying
problem and a way to estimate the amount of kernel memory necessary to
destroy a given snapshot (based on information gathered from zfs, zdb,
and any other necessary commands).

Thanks in advance, and sorry to bring this up again. I am almost
certain I saw mention here that this bug is fixed in Solaris 11
Express and Nexenta (Oracle Support is telling me the bug is fixed in
zpool 26 which is included with Solaris 10U10, but because of our use
of ACLs I don't think I can go there, and upgrading the zpool won't
help with legacy snapshots).

--

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Designer: Frankenstein, A New Musical
(http://www.facebook.com/event.php?eid=123170297765140)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
Richard Elling | 14 Sep 20:30 2011
Picon

Re: zfs destroy snapshot runs out of memory bug

On Sep 14, 2011, at 9:50 AM, Paul Kraus wrote:

   I know there was (is ?) a bug where a zfs destroy of a large
snapshot would run a system out of kernel memory, but searching the
list archives and on defects.opensolaris.org I cannot find it. Could
someone here explain the failure mechanism in language a Sys Admin (I
am NOT a developer) could understand. I am running Solaris 10 with
zpool 22 and I am looking for both understanding of the underlying
problem and a way to estimate the amount of kernel memory necessary to
destroy a given snapshot (based on information gathered from zfs, zdb,
and any other necessary commands).

I don't recall a bug with that description. However, there are several bugs that
relate to how the internals work that were fixed last summer and led to the
on-disk format change to version 26 (Improved snapshot deletion performance).

during the May-July 2010 timeframe. Methinks the most important change was
6948890 snapshot deletion can induce pathologically long spa_sync() times
spa_sync() is called when the transaction group is sync'ed to permanent storage.
-- richard


Thanks in advance, and sorry to bring this up again. I am almost
certain I saw mention here that this bug is fixed in Solaris 11
Express and Nexenta (Oracle Support is telling me the bug is fixed in
zpool 26 which is included with Solaris 10U10, but because of our use
of ACLs I don't think I can go there, and upgrading the zpool won't
help with legacy snapshots).

Sorry, I haven't run Solaris 10 in the past 6 years :-) can't help you there.
But I can say that NexentaStor has this bug fix in 3.0.5. For NexentaStor 3.1+
releases, zpool version is 28.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss <at> opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Paul Kraus | 14 Sep 21:07 2011

Re: zfs destroy snapshot runs out of memory bug

On Wed, Sep 14, 2011 at 2:30 PM, Richard Elling
<richard.elling <at> gmail.com> wrote:

> I don't recall a bug with that description. However, there are several bugs that
> relate to how the internals work that were fixed last summer and led to the
> on-disk format change to version 26 (Improved snapshot deletion performance).
> Look for details in http://src.illumos.org/source/history/illumos-gate/usr/src/uts/common/fs/zfs/
> during the May-July 2010 timeframe. Methinks the most important change was
> 6948890 snapshot deletion can induce pathologically long spa_sync() times
> spa_sync() is called when the transaction group is sync'ed to permanent storage.

I looked through that list, and found the following that looked applicable:
6948911 snapshot deletion can induce unsatisfiable allocations in txg sync
6948890 snapshot deletion can induce pathologically long spa_sync() times

But all I get at bugs.opensolaris.org is a Service Temporarily
Unavailable message (and have for at least the past few weeks). The
MOS lookup of the 6948890 bug yields the title and not much else, no
details. I can't even find the 6948911 bug in MOS.

MOS == My Oracle Support

Thanks for the pointers, I just wish I could find more data that will
lead me to either:
A) a mechanism to estimate the RAM needed to destroy a pre-26 snapshot
-or-
B) indication that there is no way to do A.

From watching the system try to import this pool, it looks like it is
still building a kernel structure in RAM when the system runs out of
RAM. It has not committed anything to disk.

--
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Designer: Frankenstein, A New Musical
(http://www.facebook.com/event.php?eid=123170297765140)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
Richard Elling | 14 Sep 22:31 2011
Picon

Re: zfs destroy snapshot runs out of memory bug

Question below…

On Sep 14, 2011, at 12:07 PM, Paul Kraus wrote:

> On Wed, Sep 14, 2011 at 2:30 PM, Richard Elling
> <richard.elling <at> gmail.com> wrote:
> 
>> I don't recall a bug with that description. However, there are several bugs that
>> relate to how the internals work that were fixed last summer and led to the
>> on-disk format change to version 26 (Improved snapshot deletion performance).
>> Look for details in http://src.illumos.org/source/history/illumos-gate/usr/src/uts/common/fs/zfs/
>> during the May-July 2010 timeframe. Methinks the most important change was
>> 6948890 snapshot deletion can induce pathologically long spa_sync() times
>> spa_sync() is called when the transaction group is sync'ed to permanent storage.
> 
> I looked through that list, and found the following that looked applicable:
> 6948911 snapshot deletion can induce unsatisfiable allocations in txg sync
> 6948890 snapshot deletion can induce pathologically long spa_sync() times
> 
> But all I get at bugs.opensolaris.org is a Service Temporarily
> Unavailable message (and have for at least the past few weeks). The
> MOS lookup of the 6948890 bug yields the title and not much else, no
> details. I can't even find the 6948911 bug in MOS.
> 
> MOS == My Oracle Support
> 
> Thanks for the pointers, I just wish I could find more data that will
> lead me to either:
> A) a mechanism to estimate the RAM needed to destroy a pre-26 snapshot
> -or-
> B) indication that there is no way to do A.
> 
> From watching the system try to import this pool, it looks like it is
> still building a kernel structure in RAM when the system runs out of
> RAM. It has not committed anything to disk.

Did you experience a severe memory shortfall? 
(Do you know how to determine that condition?)
 -- richard
Paul Kraus | 14 Sep 23:36 2011

Re: zfs destroy snapshot runs out of memory bug

On Wed, Sep 14, 2011 at 4:31 PM, Richard Elling
<richard.elling <at> gmail.com> wrote:

>> From watching the system try to import this pool, it looks like it is
>> still building a kernel structure in RAM when the system runs out of
>> RAM. It has not committed anything to disk.
>
> Did you experience a severe memory shortfall?
> (Do you know how to determine that condition?)

T2000 with 32 GB RAM

zpool that hangs the machine by running it out of kernel memory when
trying to import the zpool

zpool has an "incomplete" snapshot from a zfs recv that it is trying
to destroy on import

I *can* import the zpool readonly

    So the answer is yes to the severe memory shortfall. One of the
many things I did to instrument this system was as simple as running
vmstat 10 on the console :-) The last instance before the system hung
showed a scan rate of 900,000 ! In one case I watched as it hung (it
has done this many times as I have troubleshot with Oracle Support)
and did not see *any* user level processes that would account for the
memory shortfall. I have logs of system freemem showing the memory
exhaustion. Oracle Support has confirmed (from a core dump) that it is
some combination of the two bugs you mentioned (plus they created a
new Bug ID for this specific problem).

    I have asked multiple times if the incomplete snapshot could be
corrupt in a way that would cause this (early on then led us to
believe the incomplete snapshot was 7 TB when it should be about 2.5
TB), but have not gotten anything substantive back (just a one line,
"The snapshot is not corrupt.").

    What I am looking for is a way to estimate the kernel memory
necessary to destroy a given snapshot so that I can see if any of the
snapshots on my production server (M4000 with 16 GB) will run the
machine out of memory.

--

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Designer: Frankenstein, A New Musical
(http://www.facebook.com/event.php?eid=123170297765140)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
Daniel Carosone | 15 Sep 07:20 2011
Picon

Re: zfs destroy snapshot runs out of memory bug

On Wed, Sep 14, 2011 at 05:36:53PM -0400, Paul Kraus wrote:
> T2000 with 32 GB RAM
> 
> zpool that hangs the machine by running it out of kernel memory when
> trying to import the zpool
> 
> zpool has an "incomplete" snapshot from a zfs recv that it is trying
> to destroy on import
> 
> I *can* import the zpool readonly

Can you import it booting from a newer kernel (say liveDVD), and allow
that to complete the deletion? Or does this not help until the pool is
upgraded past the on-disk format in question, for which it must first
be imported writable?  

If you can import it read-only, would it be faster to just send it
somewhere else?  Is there a new-enough snapshot near the current data?

--
Dan.

_______________________________________________
zfs-discuss mailing list
zfs-discuss <at> opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Paul Kraus | 15 Sep 14:17 2011

Re: zfs destroy snapshot runs out of memory bug

On Thu, Sep 15, 2011 at 1:20 AM, Daniel Carosone <dan <at> geek.com.au> wrote:

> Can you import it booting from a newer kernel (say liveDVD), and allow
> that to complete the deletion?

I have not tried anything newer than the latest patched 5.10.

> Or does this not help until the pool is
> upgraded past the on-disk format in question, for which it must first
> be imported writable?

Support is telling me that no matter what, due to the on disk format,
it will take more RAM to destroy the incomplete snapshot... and I
can't do that with the pool imported read-only, and when I try to
import it read-write the import operation tries to destroy the
incomplete snapshot and runs the machine out of memory.

> If you can import it read-only, would it be faster to just send it
> somewhere else?  Is there a new-enough snapshot near the current data?

Support has given us that as Option B, which would be viable for the
backup server, if we had a spare 20+ TB of storage just sitting
around. Copying off is NOT an option for production due to outage
window _and_ lack of spare 20+ storage :-(

--

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Designer: Frankenstein, A New Musical
(http://www.facebook.com/event.php?eid=123170297765140)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
Jim Klimov | 30 Oct 22:13 2011
Picon

Re: zfs destroy snapshot runs out of memory bug

>      I know there was (is ?) a bug where a zfs destroy of a large
> snapshot would run a system out of kernel memory, but searching the
> list archives and on defects.opensolaris.org I cannot find it. Could
> someone here explain the failure mechanism in language a Sys Admin (I
> am NOT a developer) could understand. I am running Solaris 10 with
> zpool 22 and I am looking for both understanding of the underlying
> problem and a way to estimate the amount of kernel memory necessary to
> destroy a given snapshot (based on information gathered from zfs, zdb,
> and any other necessary commands).
>
> Thanks in advance, and sorry to bring this up again. I am almost
> certain I saw mention here that this bug is fixed in Solaris 11
> Express and Nexenta (Oracle Support is telling me the bug is fixed in
> zpool 26 which is included with Solaris 10U10, but because of our use
> of ACLs I don't think I can go there, and upgrading the zpool won't
> help with legacy snapshots).

Sorry, I am late.

Still, as I recently posted, I have had a similar bug with oi_148a
installed this spring, and it seems that box is still having it.
I am trying to upgrade to oi_151a, but it has hung so far and I'm
waiting for someone to get to my home and reset it.

Symptoms are like what you've described, including the huge scanrate
just before the system dies (becomes unresponsive). Also if you try 
running with "vmstat 1" you can see that in the last few seconds of
uptime the system would go from several hundred free MBs (or even
over a GB free RAM) down to under 32Mb very quickly - consuming
hundreds of MBs per second.

Unlike your system, my pool started with ZFSv28 (oi_148a), so any
bugfixes and on-disk layout fixes relevant for ZFSv26 patches are
in place already.

According to my research (flushed out with the Jive Forums, so I'd
repeat here) it seems that (MY SPECULATION FOLLOWS):
1) some kernel module (probably related to ZFS) takes hold of more
and more RAM;
2) since it is kernel memory, it can not be swapped out;
3) since all RAM is depleted but there are requests for RAM allocation,
the kernel scans all allocated memory to find candidates for swapping
out (hence the high scanrate).
4) Since all RAM is now consumed by a BADLY DESIGNED kernel module
which can not be swapped out, the system dies in a high-scanrate
agony, because there is no RAM available to do anything. It can be
"pinged" for a while, but not much more.
I stress that the module is BADLY DESIGNED as it is in my current
running version of the OS (I don't know yet if it was fixed in
oi_151a), because probably it is trying to build the full ZFS
tree in its adressable memory - regardles of whether it can fit
there. IMHO the module should try to process the pool in smaller
chunks, or allow swapping out, if the hardware constraints like
insufficient RAM force it to.

While debugging my system, I removed the /etc/zfs/zpool.cache file
and imported the pool without using a cachefile, so I could at least
boot the system and do some postmortems. Further on I made an SMF
service importing the pool following a configured timeout, so that
I could automate the import-reboot cycles as well as intervene to
abort a delayed pool import attempt and run some ZDB diags instead.

I found that walking the pool with "zdb" has a similar pattern of
RAM consumption (no surprise - the algorithms must have something
in common with live ZFS code), however, as a userspace process it
could be swapped out to disk. In my case ZDB consumed up to 20-30Gb
swap and ran for about 20 hours to analyze my pool - successfully.
A "zpool import" attempt halted the 8Gb system in 1 to 3 hours.

However, with ZDB analysis I managed to find some counter of free
blocks - those which belonged to a killed dataset. Seems that at
first they are quickly marked for deletion (i.e. are not referenced
by any dataset, but are still in the ZFS block tree), and then
during pool's current uptime or further import attempts, these
blocks are actually walked and excluded from the ZFS tree.
In my case I saw that between reboots and import attempts this
counter went down by some 3 million blocks every uptime, and
after a couple of stressful weeks the destroyed dataset was gone
and the pool just worked on and on.

So if you still have this problem, try running ZDB to see if
deferred-free count is decreasing between pool import attempts:

# time zdb -bsvL -e <POOL-GUID-NUMBER>
...
976K 114G 113G 172G 180K 1.01 1.56 deferred free
...

In order to facilitate the process of rebooting, I made a simple
watchdog which forcedly soft-resets the OS (with uadmin call)
if fast memory exhaustion is detected. This is based on vmstat
code, and includes an SMF service to run. Since RAM usage is
only updated once per second in kernel probes, the watchdog
program might not catch the problem soon enough to react.

http://thumper.cos.ru/~jim/freeram-watchdog-20110610-v0.11.tgz

Note that it WILL crash your system in case of RAM depletion,
without syncs or service shutdowns. Since the RAM depletion
happens quickly, it might not even have enough time to reset
your OS. In your case with T2000 you might be better off with
a hardware watchdog instead (if it doen't "ping" the driver
for too long, BMC would reset the box).

//Jim
Jim Klimov | 30 Oct 22:37 2011
Picon

Re: zfs destroy snapshot runs out of memory bug

2011-10-31 1:13, Jim Klimov пишет:
> Sorry, I am late.
...

If my memory and GoogleCache don't fail me too much, I ended
up with the following incantations for pool-import attempts:

:; echo zfs_vdev_max_pending/W0t5 | mdb -kw
:; echo "aok/W 1" | mdb -kw
:; echo "zfs_recover/W 1" | mdb -kw
:; echo zfs_resilver_delay/W0t0 | mdb -kw
:; echo zfs_resilver_min_time_ms/W0t20000 | mdb -kw
:; echo zfs_txg_synctime/W0t1 | mdb -kw
### These intend to boost zfs self-repair priorities and
### allow self-repair somehow. Voodoo magic ;)

:; /root/freeram-watchdog.i386 &
:; time zpool import -o altroot=/pool -o cachefile=none 1601233584937321596
### This starts the watchdog (to have some on-screen logs)
### and imports the pool bu GUID without cache file usage.

:; df -k
:; zfs list
:; zpool list
:; zpool status
### Just in case the import succeeds, these commands
### are cached by the terminal ;)

//Jim

_______________________________________________
zfs-discuss mailing list
zfs-discuss <at> opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Paul Kraus | 31 Oct 13:28 2011

Re: zfs destroy snapshot runs out of memory bug

On Sun, Oct 30, 2011 at 5:13 PM, Jim Klimov <jimklimov <at> cos.ru> wrote:
>>     I know there was (is ?) a bug where a zfs destroy of a large
>> snapshot would run a system out of kernel memory, but searching the

> Symptoms are like what you've described, including the huge scanrate
> just before the system dies (becomes unresponsive). Also if you try running
> with "vmstat 1" you can see that in the last few seconds of
> uptime the system would go from several hundred free MBs (or even
> over a GB free RAM) down to under 32Mb very quickly - consuming
> hundreds of MBs per second.

    That is the traditional symptoms of a Solaris kernel memory bug :-)

> Unlike your system, my pool started with ZFSv28 (oi_148a), so any
> bugfixes and on-disk layout fixes relevant for ZFSv26 patches are
> in place already.

    Ahhh, but jumping to the end...

> In my case I saw that between reboots and import attempts this
> counter went down by some 3 million blocks every uptime, and
> after a couple of stressful weeks the destroyed dataset was gone
> and the pool just worked on and on.

    So your pool does have the fix. With zpool 22 NO PROGRESS is made
at all with each boot-import-habg cycle. I have an mdb command that I
got from Oracle support to determine the size of the snapshot that is
being destroyed. The bug in 22 is that a snapshot destroy is committed
as a single TXG. In 26 this is fixed (I assume there are on disk
checkpoints to permit a snapshot to be destroyed in multiple TXG).

    How big is / was the snapshot and dataset ? I am dealing with a 7
TB dataset and a 2.5 TB snapshot on a system with 32 GB RAM. Oracle
has provided a loaner system with 128 GB RAM and it took 75 GB of RAM
to destroy the problem snapshot). I had not yet posted a summary as we
are still working through the overall problem (we tripped over this on
the replica, now we are working on it on the production copy).

--

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
Jim Klimov | 31 Oct 14:07 2011
Picon

Re: zfs destroy snapshot runs out of memory bug

2011-10-31 16:28, Paul Kraus wrote:
>      How big is / was the snapshot and dataset ? I am dealing with a 7
> TB dataset and a 2.5 TB snapshot on a system with 32 GB RAM.

I had a smaller-scale problem, with datasets and snapshots sized
several hundred GB, but on an 8Gb RAM system. So proportionally
it seems similar ;)

I have deduped data on the system, which adds to the strain of
dataset removal. The plan was to save some archive data there,
with few to no removals planned. But during testing of different
dataset layout hierarchies, things got out of hand ;)

I've also had an approx. 4Tb dataset to destroy (a volume where
I kept another pool), but armed with the knowledge of how things
are expected to fail, I did its cleanup in small steps and very
few (perhaps no?) hangs while evacuating the data to the toplevel
pool (which contained this volume).

> Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of RAM
> to destroy the problem snapshot). I had not yet posted a summary as we
> are still working through the overall problem (we tripped over this on
> the replica, now we are working on it on the production copy).

Good for you ;)
Does Oracle loan such systems free to support their own foul-ups?
Or do you have to pay a lease anyway? ;)
Paul Kraus | 31 Oct 14:41 2011

Re: zfs destroy snapshot runs out of memory bug

On Mon, Oct 31, 2011 at 9:07 AM, Jim Klimov <jimklimov <at> cos.ru> wrote:
> 2011-10-31 16:28, Paul Kraus wrote:

>> Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of
>> RAM
>> to destroy the problem snapshot). I had not yet posted a summary as we
>> are still working through the overall problem (we tripped over this on
>> the replica, now we are working on it on the production copy).
>
> Good for you ;)
> Does Oracle loan such systems free to support their own foul-ups?
> Or do you have to pay a lease anyway? ;)

    If you are paying for a support contract, _demand_ what is needed
to fix the problem. If you are not paying for support, well, then you
are on your own (as I believe the license says).

    Maybe I've been in this business longer than many of the folks
here, but I both expect software to have bugs and I do NOT expect
commercial software vendors to provide fixes for free.

--

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Gmane