anil gupta | 5 Mar 20:48 2012
Picon

Bulk loading a CSV file into HBase

Hi All,

I am getting a "Bad line at offset" error in Stderr log of tasks while
testing bulk loading a CSV file into HBase. I am using cdh3u2. Import of a
TSV works fine.

Here is the command i ran:
sudo -u hdfs hadoop jar /usr/lib/hbase/hbase-0.90.4-cdh3u2.jar importtsv
-Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:city  testload  /temp/csv
-Dimporttsv.skip.bad.lines=true '-Dimporttsv.separator=,'

Job Stdout logs:
[root <at> ihub-namenode1 ihub]# sudo -u hdfs hadoop jar
/usr/lib/hbase/hbase-0.90.4-cdh3u2.jar importtsv
-Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:city  testload  /temp/csv
-Dimporttsv.skip.bad.lines=true '-Dimporttsv.separator=,'
12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:host.name
=ihub-namenode1
12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client
environment:java.version=1.6.0_20
12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client
environment:java.vendor=Sun Microsystems Inc.
12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client
environment:java.home=/usr/java/jdk1.6.0_20/jre
12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client
environment:java.class.path=/usr/lib/hadoop-0.20/conf:/usr/java/jdk1.6.0_20/jre//lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u2.jar:/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/guava-r06.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u2.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/zookeeper.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar:/usr/lib/hadoop/lib:/usr/lib/hbase/lib:/usr/lib/sqoop/lib:/etc/hbase/conf
12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client
environment:java.library.path=/usr/java/jdk1.6.0_20/jre/lib/amd64/server:/usr/java/jdk1.6.0_20/jre/lib/amd64:/usr/java/jdk1.6.0_20/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
(Continue reading)

Stack | 5 Mar 23:58 2012
Picon

Re: Bulk loading a CSV file into HBase

On Mon, Mar 5, 2012 at 11:48 AM, anil gupta <anilgupta84@...> wrote:
> I am getting a "Bad line at offset" error in Stderr log of tasks while
> testing bulk loading a CSV file into HBase. I am using cdh3u2. Import of a
> TSV works fine.
>

Its your encoding of the tsv and csv or its a problem w/ the parsing
code in importtsv tool.  Can you figure which it is?  Can you add a
bit of debug for the next time you run the job?

Thanks,
St.Ack

anil gupta | 6 Mar 01:54 2012

Re: Bulk loading a CSV file into HBase

Hi St.Ack,

Thanks for the response. Both the tsv and csv are UTF-8 file. Could you
please let me know how to run bulk loading in Debug mode? I dont know of
any hadoop option which can run a job in Debug mode.

Thanks,
Anil

On Mon, Mar 5, 2012 at 2:58 PM, Stack <stack@...> wrote:

> On Mon, Mar 5, 2012 at 11:48 AM, anil gupta <anilgupta84@...> wrote:
> > I am getting a "Bad line at offset" error in Stderr log of tasks while
> > testing bulk loading a CSV file into HBase. I am using cdh3u2. Import of
> a
> > TSV works fine.
> >
>
> Its your encoding of the tsv and csv or its a problem w/ the parsing
> code in importtsv tool.  Can you figure which it is?  Can you add a
> bit of debug for the next time you run the job?
>
> Thanks,
> St.Ack
>

--

-- 
Thanks & Regards,
Anil Gupta
(Continue reading)

Shrijeet Paliwal | 6 Mar 02:06 2012

Re: Bulk loading a CSV file into HBase

Anil,
Stack meant adding debug statements yourself in tool.

-Shrijeet

On Mon, Mar 5, 2012 at 4:54 PM, anil gupta <anilgupt@...> wrote:

> Hi St.Ack,
>
> Thanks for the response. Both the tsv and csv are UTF-8 file. Could you
> please let me know how to run bulk loading in Debug mode? I dont know of
> any hadoop option which can run a job in Debug mode.
>
> Thanks,
> Anil
>
> On Mon, Mar 5, 2012 at 2:58 PM, Stack <stack@...> wrote:
>
> > On Mon, Mar 5, 2012 at 11:48 AM, anil gupta <anilgupta84@...>
> wrote:
> > > I am getting a "Bad line at offset" error in Stderr log of tasks while
> > > testing bulk loading a CSV file into HBase. I am using cdh3u2. Import
> of
> > a
> > > TSV works fine.
> > >
> >
> > Its your encoding of the tsv and csv or its a problem w/ the parsing
> > code in importtsv tool.  Can you figure which it is?  Can you add a
> > bit of debug for the next time you run the job?
(Continue reading)

anil gupta | 8 Mar 08:59 2012

Re: Bulk loading a CSV file into HBase

Hi Stack,

I decompiled the ImportTsv class and added some sysout statements in main()
to figure out the problem. Please find the modified class here:
http://pastebin.com/sKQcMXe4

 With help of Keshav, i got to know that csv import works fine when i
provide "-Dimporttsv.separator=," as first commandline parameter after
specifying the classname.

Here is the command and console log  of the successful import of csv file:
sudo -u hdfs hadoop jar /usr/lib/hadoop/importdata.jar
com.intuit.ihub.hbase.poc.ImportData -Dimporttsv.separator=,
-Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:city  testload  /temp/csv
-Dimporttsv.skip.bad.lines=true
Command line Arguments::-Dimporttsv.separator=,
Command line
Arguments::-Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:city
Command line Arguments::testload
Command line Arguments::/temp/csv
Command line Arguments::-Dimporttsv.skip.bad.lines=true
OtherArguments==>testload
OtherArguments==>/temp/csv
OtherArguments==>-D
OtherArguments==>importtsv.skip.bad.lines=true
SEPARATOR as per jobconf:,
12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:host.name
=ihub-namenode1
(Continue reading)

Stack | 8 Mar 18:12 2012
Picon

Re: Bulk loading a CSV file into HBase

On Wed, Mar 7, 2012 at 11:59 PM, anil gupta <anilgupt@...> wrote:
> I tried to analyze the problem and as per my analysis there is a problem
> with "String[] otherArgs = new GenericOptionsParser(conf,
> args).getRemainingArgs();" on line#102. Let me know you views.
>

So, its just where you put the option on the command line?  If its on
the end, my guess is its presumed the arg is for the program.  If its
before the program name, then its for GenericOptionsParser to digest.
Thats sort of how it is expected to work I'd say.  Its confusing
though?  Can we do anything in the usage for the importtsv tool to
make it so others don't have this issue?

Thanks,
St.Ack

anil gupta | 8 Mar 20:14 2012

Re: Bulk loading a CSV file into HBase

Hi Stack,

Yes, the separator argument is sensitive to position in the command.
Currently, it needs to be specified just after program name. The same is
not mentioned in the docs.

I have got two suggestion for fixing this so that other don't run into same
problem:

1. Update the HBase bulk load documentation and specify that separator
argument should be next to program name.
2. Fix the problem in the code itself by handling the separator argument
explicitly. (Still, i am wondering why only separator value is not being
set in jobconf automatically if it is not provided next to program name??)

What's your take?

Thanks,
Anil

On Thu, Mar 8, 2012 at 9:12 AM, Stack <stack@...> wrote:

> On Wed, Mar 7, 2012 at 11:59 PM, anil gupta <anilgupt@...> wrote:
> > I tried to analyze the problem and as per my analysis there is a problem
> > with "String[] otherArgs = new GenericOptionsParser(conf,
> > args).getRemainingArgs();" on line#102. Let me know you views.
> >
>
> So, its just where you put the option on the command line?  If its on
> the end, my guess is its presumed the arg is for the program.  If its
(Continue reading)

Stack | 8 Mar 20:27 2012
Picon

Re: Bulk loading a CSV file into HBase

On Thu, Mar 8, 2012 at 11:14 AM, anil gupta <anilgupt@...> wrote:
> 1. Update the HBase bulk load documentation and specify that separator
> argument should be next to program name.

This would help.

> 2. Fix the problem in the code itself by handling the separator argument
> explicitly. (Still, i am wondering why only separator value is not being
> set in jobconf automatically if it is not provided next to program name??)
>

This is probably too late IIRC.  I haven't looked at code but
GenericOptionsParser has probably already been run by the time the
application starts to process args.  Duplicating what GOP in the
application is probably not the way to go either?

St.Ack

Shrijeet Paliwal | 8 Mar 21:06 2012

Re: Bulk loading a CSV file into HBase

GenericOptionsParser stops parsing the arguments as soon as first non
option is specified (refer :
http://commons.apache.org/cli/api-1.2/org/apache/commons/cli/Parser.html#parse(org.apache.commons.cli.Options,
java.lang.String[], boolean))

So in this cases as soon parses sees the table name arg , it ignore all
other properties specified with -D opt. Note it not only ignores separator
it is also ignoring importtsv.skip.bad.lines option in your run which
failed.

On Thu, Mar 8, 2012 at 11:27 AM, Stack <stack@...> wrote:

> On Thu, Mar 8, 2012 at 11:14 AM, anil gupta <anilgupt@...> wrote:
> > 1. Update the HBase bulk load documentation and specify that separator
> > argument should be next to program name.
>
> This would help.
>
> > 2. Fix the problem in the code itself by handling the separator argument
> > explicitly. (Still, i am wondering why only separator value is not being
> > set in jobconf automatically if it is not provided next to program
> name??)
> >
>
> This is probably too late IIRC.  I haven't looked at code but
> GenericOptionsParser has probably already been run by the time the
> application starts to process args.  Duplicating what GOP in the
> application is probably not the way to go either?
>
> St.Ack
(Continue reading)

anil gupta | 8 Mar 22:42 2012

Re: Bulk loading a CSV file into HBase

Yeah after digging further into the code: Line#374 in
GenericOptionsParser.java "commandLine = parser.parse(opts, args, true);"
is the culprit. Nice find, Shrijeet. That answers my question. :)

Stack:
Could you please tell me the meaning of "IIRC"? Updating the document is
good but as per the behavior of parse() other -D option will also be
ignored if  tablename is followed by any -D option .
Duplicating the GOP functionality does not seems to be a good idea . Maybe
instead of invoking "parser.parse(opts, args, true);" if somehow we can
invoke "parser.parse(opts, args, false);" then all will be good. I haven't
looked at the api to know about the possibility of same. This is just food
for thought.

Thanks,
Anil

On Thu, Mar 8, 2012 at 12:06 PM, Shrijeet Paliwal
<shrijeet@...>wrote:

> GenericOptionsParser stops parsing the arguments as soon as first non
> option is specified (refer :
>
> http://commons.apache.org/cli/api-1.2/org/apache/commons/cli/Parser.html#parse(org.apache.commons.cli.Options
> ,
> java.lang.String[], boolean))
>
> So in this cases as soon parses sees the table name arg , it ignore all
> other properties specified with -D opt. Note it not only ignores separator
> it is also ignoring importtsv.skip.bad.lines option in your run which
(Continue reading)

Laxman | 9 Mar 09:20 2012

RE: Bulk loading a CSV file into HBase

Hi Anil,

> instead of invoking "parser.parse(opts, args, true);" if somehow we can
> invoke "parser.parse(opts, args, false);" then all will be good. I
> haven't
> looked at the api to know about the possibility of same.

Changing to parser.parse(opts, args, false) solves this problem.
I think, we need to consider the following before going for this change.

This involves behavior change in legacy hadoop code.
Directly changing from true to false may cause behavioral compatibility
issue.

Also, Setting it to false may not be correct all the times.

Case #1 java
"java -Dprop1=val1 <Class> arg1 arg2" is different from "java <Class> arg1
arg2 -Dprop1=val1

In this case it looks like parser.parse(opts, args, true) is correct

Case #2 linux
"ls -l /home" is same as "ls /home -l"

In this case it looks like parser.parse(opts, args, false) is correct

>> This is probably too late IIRC
Hope, Stack also meant the same point here.

(Continue reading)

anil gupta | 9 Mar 17:29 2012

Re: Bulk loading a CSV file into HBase

Hi Lakshman,

As per your last email, it seems that updating the doc seems to be an easy
and right approach.

Thanks,
Anil Gupta

On Fri, Mar 9, 2012 at 12:20 AM, Laxman <lakshman.ch@...> wrote:

> Hi Anil,
>
> > instead of invoking "parser.parse(opts, args, true);" if somehow we can
> > invoke "parser.parse(opts, args, false);" then all will be good. I
> > haven't
> > looked at the api to know about the possibility of same.
>
> Changing to parser.parse(opts, args, false) solves this problem.
> I think, we need to consider the following before going for this change.
>
> This involves behavior change in legacy hadoop code.
> Directly changing from true to false may cause behavioral compatibility
> issue.
>
> Also, Setting it to false may not be correct all the times.
>
> Case #1 java
> "java -Dprop1=val1 <Class> arg1 arg2" is different from "java <Class> arg1
> arg2 -Dprop1=val1
>
(Continue reading)

Harsh J | 28 May 14:36 2012

Re: Bulk loading a CSV file into HBase

Anil,

Sorry for the late bump but just for your reference, this is cause of:
https://issues.apache.org/jira/browse/HADOOP-7995

On Fri, Mar 9, 2012 at 9:59 PM, anil gupta <anilgupt@...> wrote:
> Hi Lakshman,
>
> As per your last email, it seems that updating the doc seems to be an easy
> and right approach.
>
> Thanks,
> Anil Gupta
>
> On Fri, Mar 9, 2012 at 12:20 AM, Laxman <lakshman.ch@...> wrote:
>
>> Hi Anil,
>>
>> > instead of invoking "parser.parse(opts, args, true);" if somehow we can
>> > invoke "parser.parse(opts, args, false);" then all will be good. I
>> > haven't
>> > looked at the api to know about the possibility of same.
>>
>> Changing to parser.parse(opts, args, false) solves this problem.
>> I think, we need to consider the following before going for this change.
>>
>> This involves behavior change in legacy hadoop code.
>> Directly changing from true to false may cause behavioral compatibility
>> issue.
>>
(Continue reading)


Gmane