Wayne | 5 Jan 17:10 2011
Picon

JVM OOM

I am still struggling with the JVM. We just had a hard OOM crash of a region
server after only running for 36 hours. Any help would be greatly
appreciated. Do we need to restart nodes every 24 hours under load?  GC
Pauses are something we are trying to plan for, but full out OOM crashes are
a new problem.

The message below seems to be where it starts going bad. It is followed by
no less than 63 Concurrent Mode Failure errors over a 16 minute period.

*GC locker: Trying a full collection because scavenge failed*

Lastly here is the end (after the 63 CMF errors).

Heap
 par new generation   total 1887488K, used 303212K [0x00000005fae00000,
0x000000067ae00000, 0x000000067ae00000)
  eden space 1677824K,  18% used [0x00000005fae00000, 0x000000060d61b078,
0x0000000661480000)
  from space 209664K,   0% used [0x000000066e140000, 0x000000066e140000,
0x000000067ae00000)
  to   space 209664K,   0% used [0x0000000661480000, 0x0000000661480000,
0x000000066e140000)
 concurrent mark-sweep generation total 6291456K, used 2440155K
[0x000000067ae00000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 31704K, used 18999K
[0x00000007fae00000, 0x00000007fccf6000, 0x0000000800000000)

Here again are our custom settings in case there are some suggestions out
there. Are we making it worse with these settings? What should we try next?

(Continue reading)

Stack | 5 Jan 17:45 2011
Picon

Re: JVM OOM

What was the server carrying?  How many regions?  What kinda of
loading was on the cluster?  We should not be OOME'ing.  Do you have
the heap dump lying around (We dump heap on OOME... its named *.hprof
or something.  If you have it, want to put it somewhere for me to pull
it so I can take a look?).  Any chance of a errant big cells?  Lots of
them?  What JVM version?

St.Ack

On Wed, Jan 5, 2011 at 8:10 AM, Wayne <wav100@...> wrote:
> I am still struggling with the JVM. We just had a hard OOM crash of a region
> server after only running for 36 hours. Any help would be greatly
> appreciated. Do we need to restart nodes every 24 hours under load?  GC
> Pauses are something we are trying to plan for, but full out OOM crashes are
> a new problem.
>
> The message below seems to be where it starts going bad. It is followed by
> no less than 63 Concurrent Mode Failure errors over a 16 minute period.
>
> *GC locker: Trying a full collection because scavenge failed*
>
> Lastly here is the end (after the 63 CMF errors).
>
> Heap
>  par new generation   total 1887488K, used 303212K [0x00000005fae00000,
> 0x000000067ae00000, 0x000000067ae00000)
>  eden space 1677824K,  18% used [0x00000005fae00000, 0x000000060d61b078,
> 0x0000000661480000)
>  from space 209664K,   0% used [0x000000066e140000, 0x000000066e140000,
> 0x000000067ae00000)
(Continue reading)

Wayne | 5 Jan 18:13 2011
Picon

Re: JVM OOM

It was carrying ~9k writes/sec and has been for the last 24+ hours. There
are 500+ regions on that node. I could not find the heap dump (location?)
but we do have some errant big rows that have crashed before. When we query
those big rows thrift has been crashing. Maybe major compaction kicked in
for those rows (see last log entry below)? There are 30 million columns with
all small cell values but the 30 million is definitely too much.

Here are some errors from the hadoop log. It looks like it kept getting
stuck on something which may point to the data being too big? The error
below occured 12 times in a row.

org.apache.hadoop.ipc.RemoteException: java.io.IOException:
blk_2176268114489978801_654636 is already commited, storedBlock == null.

Here is the entry from the HBase log.

15:26:44,946 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Broken pipe
15:26:44,946 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer:
Aborting region server serverName=sacdnb08.dataraker.net,60020,1294089592450,
load=(requests=0, regions=552, usedHeap=7977, maxHeap=7987): Uncaught
exception in service thread regionserver60020.compactor
java.lang.OutOfMemoryError: Java heap space

Thanks.

On Wed, Jan 5, 2011 at 11:45 AM, Stack <stack@...> wrote:

> What was the server carrying?  How many regions?  What kinda of
> loading was on the cluster?  We should not be OOME'ing.  Do you have
(Continue reading)

Wayne | 5 Jan 20:24 2011
Picon

Re: JVM OOM

Pretty sure this is compaction. The same node OOME again along with another
node after starting compaction. Like cass* .6 I guess hbase can not handle a
row bigger than it can hold in memory. I always read a lot about big cells
being a problem, but this problem is big rows.

Thanks.

On Wed, Jan 5, 2011 at 12:13 PM, Wayne <wav100@...> wrote:

> It was carrying ~9k writes/sec and has been for the last 24+ hours. There
> are 500+ regions on that node. I could not find the heap dump (location?)
> but we do have some errant big rows that have crashed before. When we query
> those big rows thrift has been crashing. Maybe major compaction kicked in
> for those rows (see last log entry below)? There are 30 million columns with
> all small cell values but the 30 million is definitely too much.
>
> Here are some errors from the hadoop log. It looks like it kept getting
> stuck on something which may point to the data being too big? The error
> below occured 12 times in a row.
>
> org.apache.hadoop.ipc.RemoteException: java.io.IOException:
> blk_2176268114489978801_654636 is already commited, storedBlock == null.
>
> Here is the entry from the HBase log.
>
> 15:26:44,946 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Broken pipe
> 15:26:44,946 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer:
> Aborting region server serverName=sacdnb08.dataraker.net,60020,1294089592450,
> load=(requests=0, regions=552, usedHeap=7977, maxHeap=7987): Uncaught
(Continue reading)

Stack | 5 Jan 21:36 2011
Picon

Re: JVM OOM

That could be it.  Do you know names of participating files?  The
should be printed in the log.  Can you save aside that Store, tar it
up and put it somewhere we can pull (if data is ok for others to see)?
 If we are OOME'ing on big rows, thats something I know some fellas
for sure would be interested in fixing.

(There is actually a 'limit' on number of kvs to get at a time when
scanning unused by the compaction code -- perhaps thats all it would
take to fix this issue?)

Its highly unlikely but I'll ask anyways, was the OOME in same
location both times?  (OOME stack trace may be in the .out file rather
than the .log file).

Location of the hprof is usually where the program was launched from
(check $HBASE_HOME dir).

St.Ack

On Wed, Jan 5, 2011 at 11:24 AM, Wayne <wav100@...> wrote:
> Pretty sure this is compaction. The same node OOME again along with another
> node after starting compaction. Like cass* .6 I guess hbase can not handle a
> row bigger than it can hold in memory. I always read a lot about big cells
> being a problem, but this problem is big rows.
>
> Thanks.
>
> On Wed, Jan 5, 2011 at 12:13 PM, Wayne <wav100@...> wrote:
>
>> It was carrying ~9k writes/sec and has been for the last 24+ hours. There
(Continue reading)

Wayne | 5 Jan 22:08 2011
Picon

Re: JVM OOM

Assuming it was big rows I dropped the table and have started loading it
again (I could not share the data anyway). We will see it again eventually
if it is not the large rows. We have changed our data model to be "taller"
so the 30 million column row is now gone for good.

Below is the contents of the .out file from the first OOM. I found the hprof
files, but I have no public place to put them. At 10GB I can not even get
them to my laptop to look at either.

Thanks.

[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor2]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor3]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor1]
[Unloading class sun.reflect.GeneratedConstructorAccessor10]
[Unloading class sun.reflect.GeneratedConstructorAccessor21]
[Unloading class sun.reflect.GeneratedConstructorAccessor23]
[Unloading class sun.reflect.GeneratedMethodAccessor17]
[Unloading class sun.reflect.GeneratedMethodAccessor1]
[Unloading class sun.reflect.GeneratedMethodAccessor19]
[Unloading class sun.reflect.GeneratedMethodAccessor18]
[Unloading class sun.reflect.GeneratedMethodAccessor16]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid17874.hprof ...
Exception in thread "pool-1-thread-6" java.lang.OutOfMemoryError: Java
heap space
	at org.apache.hadoop.hbase.client.Put.readFields(Put.java:500)
Heap dump file created [9689854890 bytes in 151.877 secs]
	at org.apache.hadoop.hbase.client.MultiPut.readFields(MultiPut.java:111)
	at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:461)
(Continue reading)

Stack | 5 Jan 22:33 2011
Picon

Re: JVM OOM

k.  Thanks for stack trace.  Its unrelated to compactions but that
don't mean compaction not responsible.

I'm filing an issue for wide rows causing OOME (HBASE-3421).  Maybe
its not the case but we should check it out at least because I know of
others who have schemas where the row width can grow unbounded.

Thanks,
St.Ack

On Wed, Jan 5, 2011 at 1:08 PM, Wayne <wav100@...> wrote:
> Assuming it was big rows I dropped the table and have started loading it
> again (I could not share the data anyway). We will see it again eventually
> if it is not the large rows. We have changed our data model to be "taller"
> so the 30 million column row is now gone for good.
>
> Below is the contents of the .out file from the first OOM. I found the hprof
> files, but I have no public place to put them. At 10GB I can not even get
> them to my laptop to look at either.
>
> Thanks.
>
> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor2]
> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor3]
> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor1]
> [Unloading class sun.reflect.GeneratedConstructorAccessor10]
> [Unloading class sun.reflect.GeneratedConstructorAccessor21]
> [Unloading class sun.reflect.GeneratedConstructorAccessor23]
> [Unloading class sun.reflect.GeneratedMethodAccessor17]
> [Unloading class sun.reflect.GeneratedMethodAccessor1]
(Continue reading)

Stack | 6 Jan 00:16 2011
Picon

Re: JVM OOM

Wayne:

In case you didn't see it, one of the lads had a suggestion over in
https://issues.apache.org/jira/browse/HBASE-3421.

St.Ack

On Wed, Jan 5, 2011 at 1:33 PM, Stack <stack@...> wrote:
> k.  Thanks for stack trace.  Its unrelated to compactions but that
> don't mean compaction not responsible.
>
> I'm filing an issue for wide rows causing OOME (HBASE-3421).  Maybe
> its not the case but we should check it out at least because I know of
> others who have schemas where the row width can grow unbounded.
>
> Thanks,
> St.Ack
>
>
> On Wed, Jan 5, 2011 at 1:08 PM, Wayne <wav100@...> wrote:
>> Assuming it was big rows I dropped the table and have started loading it
>> again (I could not share the data anyway). We will see it again eventually
>> if it is not the large rows. We have changed our data model to be "taller"
>> so the 30 million column row is now gone for good.
>>
>> Below is the contents of the .out file from the first OOM. I found the hprof
>> files, but I have no public place to put them. At 10GB I can not even get
>> them to my laptop to look at either.
>>
>> Thanks.
(Continue reading)


Gmane