Re: zfs destroy snapshot runs out of memory bug
Jim Klimov <jimklimov <at> cos.ru>
2011-10-30 21:13:17 GMT
> I know there was (is ?) a bug where a zfs destroy of a large
> snapshot would run a system out of kernel memory, but searching the
> list archives and on defects.opensolaris.org I cannot find it. Could
> someone here explain the failure mechanism in language a Sys Admin (I
> am NOT a developer) could understand. I am running Solaris 10 with
> zpool 22 and I am looking for both understanding of the underlying
> problem and a way to estimate the amount of kernel memory necessary to
> destroy a given snapshot (based on information gathered from zfs, zdb,
> and any other necessary commands).
> Thanks in advance, and sorry to bring this up again. I am almost
> certain I saw mention here that this bug is fixed in Solaris 11
> Express and Nexenta (Oracle Support is telling me the bug is fixed in
> zpool 26 which is included with Solaris 10U10, but because of our use
> of ACLs I don't think I can go there, and upgrading the zpool won't
> help with legacy snapshots).
Sorry, I am late.
Still, as I recently posted, I have had a similar bug with oi_148a
installed this spring, and it seems that box is still having it.
I am trying to upgrade to oi_151a, but it has hung so far and I'm
waiting for someone to get to my home and reset it.
Symptoms are like what you've described, including the huge scanrate
just before the system dies (becomes unresponsive). Also if you try
running with "vmstat 1" you can see that in the last few seconds of
uptime the system would go from several hundred free MBs (or even
over a GB free RAM) down to under 32Mb very quickly - consuming
hundreds of MBs per second.
Unlike your system, my pool started with ZFSv28 (oi_148a), so any
bugfixes and on-disk layout fixes relevant for ZFSv26 patches are
in place already.
According to my research (flushed out with the Jive Forums, so I'd
repeat here) it seems that (MY SPECULATION FOLLOWS):
1) some kernel module (probably related to ZFS) takes hold of more
and more RAM;
2) since it is kernel memory, it can not be swapped out;
3) since all RAM is depleted but there are requests for RAM allocation,
the kernel scans all allocated memory to find candidates for swapping
out (hence the high scanrate).
4) Since all RAM is now consumed by a BADLY DESIGNED kernel module
which can not be swapped out, the system dies in a high-scanrate
agony, because there is no RAM available to do anything. It can be
"pinged" for a while, but not much more.
I stress that the module is BADLY DESIGNED as it is in my current
running version of the OS (I don't know yet if it was fixed in
oi_151a), because probably it is trying to build the full ZFS
tree in its adressable memory - regardles of whether it can fit
there. IMHO the module should try to process the pool in smaller
chunks, or allow swapping out, if the hardware constraints like
insufficient RAM force it to.
While debugging my system, I removed the /etc/zfs/zpool.cache file
and imported the pool without using a cachefile, so I could at least
boot the system and do some postmortems. Further on I made an SMF
service importing the pool following a configured timeout, so that
I could automate the import-reboot cycles as well as intervene to
abort a delayed pool import attempt and run some ZDB diags instead.
I found that walking the pool with "zdb" has a similar pattern of
RAM consumption (no surprise - the algorithms must have something
in common with live ZFS code), however, as a userspace process it
could be swapped out to disk. In my case ZDB consumed up to 20-30Gb
swap and ran for about 20 hours to analyze my pool - successfully.
A "zpool import" attempt halted the 8Gb system in 1 to 3 hours.
However, with ZDB analysis I managed to find some counter of free
blocks - those which belonged to a killed dataset. Seems that at
first they are quickly marked for deletion (i.e. are not referenced
by any dataset, but are still in the ZFS block tree), and then
during pool's current uptime or further import attempts, these
blocks are actually walked and excluded from the ZFS tree.
In my case I saw that between reboots and import attempts this
counter went down by some 3 million blocks every uptime, and
after a couple of stressful weeks the destroyed dataset was gone
and the pool just worked on and on.
So if you still have this problem, try running ZDB to see if
deferred-free count is decreasing between pool import attempts:
# time zdb -bsvL -e <POOL-GUID-NUMBER>
976K 114G 113G 172G 180K 1.01 1.56 deferred free
In order to facilitate the process of rebooting, I made a simple
watchdog which forcedly soft-resets the OS (with uadmin call)
if fast memory exhaustion is detected. This is based on vmstat
code, and includes an SMF service to run. Since RAM usage is
only updated once per second in kernel probes, the watchdog
program might not catch the problem soon enough to react.
Note that it WILL crash your system in case of RAM depletion,
without syncs or service shutdowns. Since the RAM depletion
happens quickly, it might not even have enough time to reset
your OS. In your case with T2000 you might be better off with
a hardware watchdog instead (if it doen't "ping" the driver
for too long, BMC would reset the box).