Brendan W. McAdams | 1 Dec 2011 01:34
Gravatar

Re: Re: Mongo-hadoop: map input records is wrong.

Artem,


I haven't seen behavior like this on my end but it is certainly possible you have found a bug.

A few questions that can help narrow it down: 

- What Distribution and version of Hadoop are you using? 

- Is your MongoDB Sharded or unsharded?

Because you are using a query (per your settings above) I suspect that may be the culprit; queries aren't snapshotted so it's possible for the contents of the cursor to shift as you go through.  The query of course can also account for the discrepancy in numbers.

On Wed, Nov 30, 2011 at 9:10 PM, yankov <artem.yankov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Oh, I've just noticed a response.
I can't see a number of docs processed in job output. Is there a
special place I can check it out?
Also I haven't change any special settings for adapter.

Here's the job configuration:


fs.s3n.impl     org.apache.hadoop.fs.s3native.NativeS3FileSystem
mapred.task.cache.levels        2
hadoop.tmp.dir  /mnt/hadoop
hadoop.native.lib       true
map.sort.class  org.apache.hadoop.util.QuickSort
dfs.namenode.decommission.nodes.per.interval    5
dfs.https.need.client.auth      false
mongo.job.reducer
com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
ipc.client.idlethreshold        4000
dfs.datanode.data.dir.perm      755
mapred.system.dir       ${hadoop.tmp.dir}/mapred/system
mapred.job.tracker.persist.jobstatus.hours      0
dfs.datanode.address    0.0.0.0:50010
dfs.namenode.logging.level      info
dfs.block.access.token.enable   false
io.skip.checksum.errors false
mongo.output.uri        mongodb://host/db/collection
fs.default.name hdfs://x.x.x.x:50001
mapred.cluster.reduce.memory.mb -1
mapred.reducer.new-api  true
mapred.child.tmp        ./tmp
fs.har.impl.disable.cache       true
dfs.safemode.threshold.pct      0.999f
mapred.skip.reduce.max.skip.groups      0
dfs.namenode.handler.count      10
dfs.blockreport.initialDelay    0
mapred.heartbeats.in.second     100
mapred.tasktracker.dns.nameserver       default
io.sort.factor  10
mapred.task.timeout     4200000
mapred.max.tracker.failures     4
hadoop.rpc.socket.factory.class.default
org.apache.hadoop.net.StandardSocketFactory
mapred.job.tracker.jobhistory.lru.cache.size    5
fs.hdfs.impl    org.apache.hadoop.hdfs.DistributedFileSystem
mapred.queue.default.acl-administer-jobs        *
mapred.output.key.class com.mongodb.hadoop.io.BSONWritable
dfs.block.access.key.update.interval    600
mapred.skip.map.auto.incr.proc.count    true
mongo.input.skip        0
mapreduce.job.complete.cancel.delegation.tokens true
io.mapfile.bloom.size   1048576
mapreduce.reduce.shuffle.connect.timeout        180000
dfs.safemode.extension  30000
mapred.jobtracker.blacklist.fault-timeout-window        180
tasktracker.http.threads        80
mapred.job.shuffle.merge.percent        0.66
mapreduce.inputformat.class     com.mongodb.hadoop.MongoInputFormat
fs.ftp.impl     org.apache.hadoop.fs.ftp.FTPFileSystem
user.name       root
mapred.output.compress  true
io.bytes.per.checksum   512
mapred.healthChecker.script.timeout     600000
mongo.job.input.format  com.mongodb.hadoop.MongoInputFormat
topology.node.switch.mapping.impl
org.apache.hadoop.net.ScriptBasedMapping
dfs.https.server.keystore.resource      ssl-server.xml
mapred.reduce.slowstart.completed.maps  0.05
mongo.input.split.read_from_shards      false
mapred.reduce.max.attempts      4
fs.ramfs.impl   org.apache.hadoop.fs.InMemoryFileSystem
dfs.block.access.token.lifetime 600
dfs.name.edits.dir      ${dfs.name.dir}
mapred.skip.map.max.skip.records        0
mapred.cluster.map.memory.mb    -1
hadoop.security.group.mapping
org.apache.hadoop.security.ShellBasedUnixGroupsMapping
mongo.job.output.format com.mongodb.hadoop.MongoOutputFormat
mapred.job.tracker.persist.jobstatus.dir        /jobtracker/jobsInfo
mapred.jar      hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_201111282214_20169/job.jar
dfs.block.size  67108864
fs.s3.buffer.dir        ${hadoop.tmp.dir}/s3
job.end.retry.attempts  0
fs.file.impl    org.apache.hadoop.fs.LocalFileSystem
mapred.local.dir.minspacestart  0
mapred.output.compression.type  BLOCK
dfs.datanode.ipc.address        0.0.0.0:50020
dfs.permissions true
topology.script.number.args     100
io.mapfile.bloom.error.rate     0.005
mapred.cluster.max.reduce.memory.mb     -1
mapred.max.tracker.blacklists   4
mapred.task.profile.maps        0-2
mongo.input.limit       0
dfs.datanode.https.address      0.0.0.0:50475
mapred.userlog.retain.hours     24
dfs.secondary.http.address      0.0.0.0:50090
dfs.replication.max     512
mapred.job.tracker.persist.jobstatus.active     false
hadoop.security.authorization   false
local.cache.size        10737418240
dfs.namenode.delegation.token.renew-interval    86400000
mapred.min.split.size   0
mapred.map.tasks        1
mapred.child.java.opts  -Xmx200m
mapreduce.job.counters.limit    120
mapred.output.value.class       org.apache.hadoop.io.DoubleWritable
dfs.https.client.keystore.resource      ssl-client.xml
mapred.job.queue.name   default
mongo.job.name  samsung#4ddd98a8c47eed304e0002ff
dfs.https.address       0.0.0.0:50470
mapred.job.tracker.retiredjobs.cache.size       1000
dfs.balance.bandwidthPerSec     1048576
ipc.server.listen.queue.size    128
mapred.inmem.merge.threshold    1000
job.end.retry.interval  30000
mapred.skip.attempts.to.start.skipping  2
fs.checkpoint.dir       ${hadoop.tmp.dir}/dfs/namesecondary
mapred.reduce.tasks     1
mongo.job.output.value  org.apache.hadoop.io.DoubleWritable
mapred.merge.recordsBeforeProgress      10000
mapred.userlog.limit.kb 0
mapred.job.reduce.memory.mb     -1
dfs.max.objects 0
webinterface.private.actions    false
io.sort.spill.percent   0.80
mapred.job.shuffle.input.buffer.percent 0.70
mongo.job.background    false
mapred.job.name samsung#4ddd98a8c47eed304e0002ff
dfs.datanode.dns.nameserver     default
mapred.map.tasks.speculative.execution  true
hadoop.util.hash.type   murmur
dfs.blockreport.intervalMsec    3600000
mapred.map.max.attempts 4
mapreduce.job.acl-view-job
dfs.client.block.write.retries  3
mapred.job.tracker.handler.count        10
mapreduce.reduce.shuffle.read.timeout   180000
mapred.tasktracker.expiry.interval      600000
dfs.https.enable        false
mapred.jobtracker.maxtasks.per.job      -1
mapred.jobtracker.job.history.block.size        3145728
keep.failed.task.files  false
mapreduce.outputformat.class    com.mongodb.hadoop.MongoOutputFormat
dfs.datanode.failed.volumes.tolerated   0
ipc.client.tcpnodelay   false
mapred.task.profile.reduces     0-2
mapred.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
io.map.index.skip       0
mapred.working.dir      hdfs://x.x.x.x:50001/user/root
ipc.server.tcpnodelay   false
mapred.jobtracker.blacklist.fault-bucket-width  15
dfs.namenode.delegation.key.update-interval     86400000
mapred.used.genericoptionsparser        true
mapred.mapper.new-api   true
mapred.job.map.memory.mb        -1
dfs.default.chunk.view.size     32768
hadoop.logfile.size     10000000
mapred.reduce.tasks.speculative.execution       true
mapreduce.job.dir       hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_201111282214_20169
mapreduce.tasktracker.outofband.heartbeat       false
mapreduce.reduce.input.limit    -1
mongo.job.mapper.output.value   org.apache.hadoop.io.DoubleWritable
dfs.datanode.du.reserved        0
mongo.input.split.read_shard_chunks     false
hadoop.security.authentication  simple
fs.checkpoint.period    3600
dfs.web.ugi     webuser,webgroup
mapred.job.reuse.jvm.num.tasks  1
mapred.jobtracker.completeuserjobs.maximum      100
dfs.df.interval 60000
dfs.data.dir    ${hadoop.tmp.dir}/dfs/data
mapred.task.tracker.task-controller
org.apache.hadoop.mapred.DefaultTaskController
mongo.job.verbose       true
fs.s3.maxRetries        4
dfs.datanode.dns.interface      default
mapred.cluster.max.map.memory.mb        -1
mapred.map.child.java.opts      -Xmx4000m
dfs.support.append      false
mapreduce.job.acl-modify-job
dfs.permissions.supergroup      supergroup
mapred.local.dir        ${hadoop.tmp.dir}/mapred/local
fs.hftp.impl    org.apache.hadoop.hdfs.HftpFileSystem
fs.trash.interval       0
fs.s3.sleepTimeSeconds  10
dfs.replication.min     1
mapred.submit.replication       10
fs.har.impl     org.apache.hadoop.fs.HarFileSystem
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
mapred.tasktracker.dns.interface        default
dfs.namenode.decommission.interval      30
dfs.http.address        0.0.0.0:50070
dfs.heartbeat.interval  3
mapred.job.tracker      hdfs://x.x.x.x:50002
mapreduce.job.submithost        x.x.x.x
io.seqfile.sorter.recordlimit   1000000
dfs.name.dir    ${hadoop.tmp.dir}/dfs/name
mapred.line.input.format.linespermap    1
mapred.jobtracker.taskScheduler
org.apache.hadoop.mapred.JobQueueTaskScheduler
dfs.datanode.http.address       0.0.0.0:50075
mapred.local.dir.minspacekill   0
dfs.replication.interval        3
io.sort.record.percent  0.05
mapreduce.reduce.class
com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
fs.kfs.impl     org.apache.hadoop.fs.kfs.KosmosFileSystem
mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
mapred.tasktracker.reduce.tasks.maximum 3
dfs.replication 3
fs.checkpoint.edits.dir ${fs.checkpoint.dir}
mapred.tasktracker.tasks.sleeptime-before-sigkill       5000
mapred.job.reduce.input.buffer.percent  0.0
mongo.input.query       {"user_id":{"$in":[{"$oid":"some_id_here"},
{"$oid":"another_id"}]}}
mapred.tasktracker.indexcache.mb        10
mapreduce.job.split.metainfo.maxsize    10000000
hadoop.logfile.count    10
mapred.skip.reduce.auto.incr.proc.count true
mapreduce.job.submithostaddress 10.6.91.183
mongo.job.mapper
com.mongodb.hadoop.examples.leaderboard.LeaderboardMapper
io.seqfile.compress.blocksize   1000000
fs.s3.block.size        67108864
mapred.tasktracker.taskmemorymanager.monitoring-interval        5000
mongo.job.output.key    com.mongodb.hadoop.io.BSONWritable
mapred.queue.default.state      RUNNING
mapred.acls.enabled     false
mapreduce.jobtracker.staging.root.dir   ${hadoop.tmp.dir}/mapred/staging
mapred.queue.names      default
dfs.access.time.precision       3600000
fs.hsftp.impl   org.apache.hadoop.hdfs.HsftpFileSystem
mapred.task.tracker.http.address        0.0.0.0:50060
mapreduce.combine.class
com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
mapred.reduce.parallel.copies   5
io.seqfile.lazydecompress       true
io.sort.mb      1000
ipc.client.connection.maxidletime       10000
mapred.compress.map.output      false
hadoop.security.uid.cache.secs  14400
mapred.task.tracker.report.address      127.0.0.1:0
mongo.job.combiner
com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
mapred.healthChecker.interval   60000
ipc.client.kill.max     10
ipc.client.connect.max.retries  10
mapreduce.map.class
com.mongodb.hadoop.examples.leaderboard.LeaderboardMapper
fs.s3.impl      org.apache.hadoop.fs.s3.S3FileSystem
mapred.user.jobconf.limit       5242880
mapred.job.tracker.http.address 0.0.0.0:50030
io.file.buffer.size     4096
mapred.jobtracker.restart.recover       false
io.serializations
org.apache.hadoop.io.serializer.WritableSerialization
dfs.datanode.handler.count      3
mapred.reduce.copy.backoff      300
mapred.task.profile     false
dfs.replication.considerLoad    true
jobclient.output.filter FAILED
dfs.namenode.delegation.token.max-lifetime      604800000
mapred.tasktracker.map.tasks.maximum    3
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
fs.checkpoint.size      67108864
mongo.input.uri mongodb://host/db/collection




On Nov 27, 10:54 pm, Eliot Horowitz <el...-Ot75HdpNzd8AvxtiuMwx3w@public.gmane.org> wrote:
> How is the adapter configured?
> What is the number of docs processed?
>
>
>
>
>
>
>
> On Tue, Nov 22, 2011 at 8:54 PM, yankov <artem.yan...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > I'm using mongo-hadoop to pull data from mongo and do mapreduce in
> > hadoop.
> > Looks like map input records is always slightly different comparing to
> > the real number of records in the collection.
>
> > Especially I can see it on a big collection >300k records when number
> > is totally wrong which leads to wrong calculations.
>
> > Any ideas why it can be happening?
>
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> > To post to this group, send email to mongodb-user-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
> > To unsubscribe from this group, send email to mongodb-user+unsubscribe <at> googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To unsubscribe from this group, send email to mongodb-user+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to mongodb-user+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
yankov | 1 Dec 2011 01:42
Picon
Gravatar

Re: Mongo-hadoop: map input records is wrong.


Hey Brendan,
Thanks for the answer!

Hadoop version 0.20.203.0, r1099333
Our MongoDB is sharded. But I set mongo.input.split.read_shard_chunks
and mongo.input.split.read_from_shards to false.
Is there anything i can do to confirm your suspicion?

On Nov 30, 4:34 pm, "Brendan W. McAdams" <bren...@...> wrote:
> Artem,
>
> I haven't seen behavior like this on my end but it is certainly possible
> you have found a bug.
>
> A few questions that can help narrow it down:
>
> - What Distribution and version of Hadoop are you using?
>
> - Is your MongoDB Sharded or unsharded?
>
> Because you are using a query (per your settings above) I suspect that may
> be the culprit; queries aren't snapshotted so it's possible for the
> contents of the cursor to shift as you go through.  The query of course can
> also account for the discrepancy in numbers.
>
>
>
>
>
>
>
> On Wed, Nov 30, 2011 at 9:10 PM, yankov <artem.yan...@...> wrote:
> > Oh, I've just noticed a response.
> > I can't see a number of docs processed in job output. Is there a
> > special place I can check it out?
> > Also I haven't change any special settings for adapter.
>
> > Here's the job configuration:
>
> > fs.s3n.impl     org.apache.hadoop.fs.s3native.NativeS3FileSystem
> > mapred.task.cache.levels        2
> > hadoop.tmp.dir  /mnt/hadoop
> > hadoop.native.lib       true
> > map.sort.class  org.apache.hadoop.util.QuickSort
> > dfs.namenode.decommission.nodes.per.interval    5
> > dfs.https.need.client.auth      false
> > mongo.job.reducer
> > com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
> > ipc.client.idlethreshold        4000
> > dfs.datanode.data.dir.perm      755
> > mapred.system.dir       ${hadoop.tmp.dir}/mapred/system
> > mapred.job.tracker.persist.jobstatus.hours      0
> > dfs.datanode.address    0.0.0.0:50010
> > dfs.namenode.logging.level      info
> > dfs.block.access.token.enable   false
> > io.skip.checksum.errors false
> > mongo.output.uri        mongodb://host/db/collection
> > fs.default.name hdfs://x.x.x.x:50001
> > mapred.cluster.reduce.memory.mb -1
> > mapred.reducer.new-api  true
> > mapred.child.tmp        ./tmp
> > fs.har.impl.disable.cache       true
> > dfs.safemode.threshold.pct      0.999f
> > mapred.skip.reduce.max.skip.groups      0
> > dfs.namenode.handler.count      10
> > dfs.blockreport.initialDelay    0
> > mapred.heartbeats.in.second     100
> > mapred.tasktracker.dns.nameserver       default
> > io.sort.factor  10
> > mapred.task.timeout     4200000
> > mapred.max.tracker.failures     4
> > hadoop.rpc.socket.factory.class.default
> > org.apache.hadoop.net.StandardSocketFactory
> > mapred.job.tracker.jobhistory.lru.cache.size    5
> > fs.hdfs.impl    org.apache.hadoop.hdfs.DistributedFileSystem
> > mapred.queue.default.acl-administer-jobs        *
> > mapred.output.key.class com.mongodb.hadoop.io.BSONWritable
> > dfs.block.access.key.update.interval    600
> > mapred.skip.map.auto.incr.proc.count    true
> > mongo.input.skip        0
> > mapreduce.job.complete.cancel.delegation.tokens true
> > io.mapfile.bloom.size   1048576
> > mapreduce.reduce.shuffle.connect.timeout        180000
> > dfs.safemode.extension  30000
> > mapred.jobtracker.blacklist.fault-timeout-window        180
> > tasktracker.http.threads        80
> > mapred.job.shuffle.merge.percent        0.66
> > mapreduce.inputformat.class     com.mongodb.hadoop.MongoInputFormat
> > fs.ftp.impl     org.apache.hadoop.fs.ftp.FTPFileSystem
> > user.name       root
> > mapred.output.compress  true
> > io.bytes.per.checksum   512
> > mapred.healthChecker.script.timeout     600000
> > mongo.job.input.format  com.mongodb.hadoop.MongoInputFormat
> > topology.node.switch.mapping.impl
> > org.apache.hadoop.net.ScriptBasedMapping
> > dfs.https.server.keystore.resource      ssl-server.xml
> > mapred.reduce.slowstart.completed.maps  0.05
> > mongo.input.split.read_from_shards      false
> > mapred.reduce.max.attempts      4
> > fs.ramfs.impl   org.apache.hadoop.fs.InMemoryFileSystem
> > dfs.block.access.token.lifetime 600
> > dfs.name.edits.dir      ${dfs.name.dir}
> > mapred.skip.map.max.skip.records        0
> > mapred.cluster.map.memory.mb    -1
> > hadoop.security.group.mapping
> > org.apache.hadoop.security.ShellBasedUnixGroupsMapping
> > mongo.job.output.format com.mongodb.hadoop.MongoOutputFormat
> > mapred.job.tracker.persist.jobstatus.dir        /jobtracker/jobsInfo
> > mapred.jar
> >  hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_2011112822 14_20169/job.jar
> > dfs.block.size  67108864
> > fs.s3.buffer.dir        ${hadoop.tmp.dir}/s3
> > job.end.retry.attempts  0
> > fs.file.impl    org.apache.hadoop.fs.LocalFileSystem
> > mapred.local.dir.minspacestart  0
> > mapred.output.compression.type  BLOCK
> > dfs.datanode.ipc.address        0.0.0.0:50020
> > dfs.permissions true
> > topology.script.number.args     100
> > io.mapfile.bloom.error.rate     0.005
> > mapred.cluster.max.reduce.memory.mb     -1
> > mapred.max.tracker.blacklists   4
> > mapred.task.profile.maps        0-2
> > mongo.input.limit       0
> > dfs.datanode.https.address      0.0.0.0:50475
> > mapred.userlog.retain.hours     24
> > dfs.secondary.http.address      0.0.0.0:50090
> > dfs.replication.max     512
> > mapred.job.tracker.persist.jobstatus.active     false
> > hadoop.security.authorization   false
> > local.cache.size        10737418240
> > dfs.namenode.delegation.token.renew-interval    86400000
> > mapred.min.split.size   0
> > mapred.map.tasks        1
> > mapred.child.java.opts  -Xmx200m
> > mapreduce.job.counters.limit    120
> > mapred.output.value.class       org.apache.hadoop.io.DoubleWritable
> > dfs.https.client.keystore.resource      ssl-client.xml
> > mapred.job.queue.name   default
> > mongo.job.name  samsung#4ddd98a8c47eed304e0002ff
> > dfs.https.address       0.0.0.0:50470
> > mapred.job.tracker.retiredjobs.cache.size       1000
> > dfs.balance.bandwidthPerSec     1048576
> > ipc.server.listen.queue.size    128
> > mapred.inmem.merge.threshold    1000
> > job.end.retry.interval  30000
> > mapred.skip.attempts.to.start.skipping  2
> > fs.checkpoint.dir       ${hadoop.tmp.dir}/dfs/namesecondary
> > mapred.reduce.tasks     1
> > mongo.job.output.value  org.apache.hadoop.io.DoubleWritable
> > mapred.merge.recordsBeforeProgress      10000
> > mapred.userlog.limit.kb 0
> > mapred.job.reduce.memory.mb     -1
> > dfs.max.objects 0
> > webinterface.private.actions    false
> > io.sort.spill.percent   0.80
> > mapred.job.shuffle.input.buffer.percent 0.70
> > mongo.job.background    false
> > mapred.job.name samsung#4ddd98a8c47eed304e0002ff
> > dfs.datanode.dns.nameserver     default
> > mapred.map.tasks.speculative.execution  true
> > hadoop.util.hash.type   murmur
> > dfs.blockreport.intervalMsec    3600000
> > mapred.map.max.attempts 4
> > mapreduce.job.acl-view-job
> > dfs.client.block.write.retries  3
> > mapred.job.tracker.handler.count        10
> > mapreduce.reduce.shuffle.read.timeout   180000
> > mapred.tasktracker.expiry.interval      600000
> > dfs.https.enable        false
> > mapred.jobtracker.maxtasks.per.job      -1
> > mapred.jobtracker.job.history.block.size        3145728
> > keep.failed.task.files  false
> > mapreduce.outputformat.class    com.mongodb.hadoop.MongoOutputFormat
> > dfs.datanode.failed.volumes.tolerated   0
> > ipc.client.tcpnodelay   false
> > mapred.task.profile.reduces     0-2
> > mapred.output.compression.codec
> > org.apache.hadoop.io.compress.DefaultCodec
> > io.map.index.skip       0
> > mapred.working.dir      hdfs://x.x.x.x:50001/user/root
> > ipc.server.tcpnodelay   false
> > mapred.jobtracker.blacklist.fault-bucket-width  15
> > dfs.namenode.delegation.key.update-interval     86400000
> > mapred.used.genericoptionsparser        true
> > mapred.mapper.new-api   true
> > mapred.job.map.memory.mb        -1
> > dfs.default.chunk.view.size     32768
> > hadoop.logfile.size     10000000
> > mapred.reduce.tasks.speculative.execution       true
> > mapreduce.job.dir
> > hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_2011112822 14_20169
> > mapreduce.tasktracker.outofband.heartbeat       false
> > mapreduce.reduce.input.limit    -1
> > mongo.job.mapper.output.value   org.apache.hadoop.io.DoubleWritable
> > dfs.datanode.du.reserved        0
> > mongo.input.split.read_shard_chunks     false
> > hadoop.security.authentication  simple
> > fs.checkpoint.period    3600
> > dfs.web.ugi     webuser,webgroup
> > mapred.job.reuse.jvm.num.tasks  1
> > mapred.jobtracker.completeuserjobs.maximum      100
> > dfs.df.interval 60000
> > dfs.data.dir    ${hadoop.tmp.dir}/dfs/data
> > mapred.task.tracker.task-controller
> > org.apache.hadoop.mapred.DefaultTaskController
> > mongo.job.verbose       true
> > fs.s3.maxRetries        4
> > dfs.datanode.dns.interface      default
> > mapred.cluster.max.map.memory.mb        -1
> > mapred.map.child.java.opts      -Xmx4000m
> > dfs.support.append      false
> > mapreduce.job.acl-modify-job
> > dfs.permissions.supergroup      supergroup
> > mapred.local.dir        ${hadoop.tmp.dir}/mapred/local
> > fs.hftp.impl    org.apache.hadoop.hdfs.HftpFileSystem
> > fs.trash.interval       0
> > fs.s3.sleepTimeSeconds  10
> > dfs.replication.min     1
> > mapred.submit.replication       10
> > fs.har.impl     org.apache.hadoop.fs.HarFileSystem
> > mapred.map.output.compression.codec
> > org.apache.hadoop.io.compress.DefaultCodec
> > mapred.tasktracker.dns.interface        default
> > dfs.namenode.decommission.interval      30
> > dfs.http.address        0.0.0.0:50070
> > dfs.heartbeat.interval  3
> > mapred.job.tracker      hdfs://x.x.x.x:50002
> > mapreduce.job.submithost        x.x.x.x
> > io.seqfile.sorter.recordlimit   1000000
> > dfs.name.dir    ${hadoop.tmp.dir}/dfs/name
> > mapred.line.input.format.linespermap    1
> > mapred.jobtracker.taskScheduler
> > org.apache.hadoop.mapred.JobQueueTaskScheduler
> > dfs.datanode.http.address       0.0.0.0:50075
> > mapred.local.dir.minspacekill   0
> > dfs.replication.interval        3
> > io.sort.record.percent  0.05
> > mapreduce.reduce.class
> > com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
> > fs.kfs.impl     org.apache.hadoop.fs.kfs.KosmosFileSystem
> > mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
> > mapred.tasktracker.reduce.tasks.maximum 3
> > dfs.replication 3
> > fs.checkpoint.edits.dir ${fs.checkpoint.dir}
> > mapred.tasktracker.tasks.sleeptime-before-sigkill       5000
> > mapred.job.reduce.input.buffer.percent  0.0
> > mongo.input.query       {"user_id":{"$in":[{"$oid":"some_id_here"},
> > {"$oid":"another_id"}]}}
> > mapred.tasktracker.indexcache.mb        10
> > mapreduce.job.split.metainfo.maxsize    10000000
> > hadoop.logfile.count    10
> > mapred.skip.reduce.auto.incr.proc.count true
>
> ...
>
> read more »

--

-- 
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user@...
To unsubscribe from this group, send email to mongodb-user+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Brendan W. McAdams | 1 Dec 2011 01:47
Gravatar

Re: Re: Mongo-hadoop: map input records is wrong.

What happens if you set the sharding settings on? What about running without a query?

How long does the job take to run? Is there heavy write and/or chunk migration occuring during this period?

On Dec 1, 2011 12:42 AM, "yankov" <artem.yankov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

Hey Brendan,
Thanks for the answer!

Hadoop version 0.20.203.0, r1099333
Our MongoDB is sharded. But I set mongo.input.split.read_shard_chunks
and mongo.input.split.read_from_shards to false.
Is there anything i can do to confirm your suspicion?


On Nov 30, 4:34 pm, "Brendan W. McAdams" <bren...-Ot75HdpNzd8AvxtiuMwx3w@public.gmane.org> wrote:
> Artem,
>
> I haven't seen behavior like this on my end but it is certainly possible
> you have found a bug.
>
> A few questions that can help narrow it down:
>
> - What Distribution and version of Hadoop are you using?
>
> - Is your MongoDB Sharded or unsharded?
>
> Because you are using a query (per your settings above) I suspect that may
> be the culprit; queries aren't snapshotted so it's possible for the
> contents of the cursor to shift as you go through.  The query of course can
> also account for the discrepancy in numbers.
>
>
>
>
>
>
>
> On Wed, Nov 30, 2011 at 9:10 PM, yankov <artem.yan...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Oh, I've just noticed a response.
> > I can't see a number of docs processed in job output. Is there a
> > special place I can check it out?
> > Also I haven't change any special settings for adapter.
>
> > Here's the job configuration:
>
> > fs.s3n.impl     org.apache.hadoop.fs.s3native.NativeS3FileSystem
> > mapred.task.cache.levels        2
> > hadoop.tmp.dir  /mnt/hadoop
> > hadoop.native.lib       true
> > map.sort.class  org.apache.hadoop.util.QuickSort
> > dfs.namenode.decommission.nodes.per.interval    5
> > dfs.https.need.client.auth      false
> > mongo.job.reducer
> > com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
> > ipc.client.idlethreshold        4000
> > dfs.datanode.data.dir.perm      755
> > mapred.system.dir       ${hadoop.tmp.dir}/mapred/system
> > mapred.job.tracker.persist.jobstatus.hours      0
> > dfs.datanode.address    0.0.0.0:50010
> > dfs.namenode.logging.level      info
> > dfs.block.access.token.enable   false
> > io.skip.checksum.errors false
> > mongo.output.uri        mongodb://host/db/collection
> > fs.default.name hdfs://x.x.x.x:50001
> > mapred.cluster.reduce.memory.mb -1
> > mapred.reducer.new-api  true
> > mapred.child.tmp        ./tmp
> > fs.har.impl.disable.cache       true
> > dfs.safemode.threshold.pct      0.999f
> > mapred.skip.reduce.max.skip.groups      0
> > dfs.namenode.handler.count      10
> > dfs.blockreport.initialDelay    0
> > mapred.heartbeats.in.second     100
> > mapred.tasktracker.dns.nameserver       default
> > io.sort.factor  10
> > mapred.task.timeout     4200000
> > mapred.max.tracker.failures     4
> > hadoop.rpc.socket.factory.class.default
> > org.apache.hadoop.net.StandardSocketFactory
> > mapred.job.tracker.jobhistory.lru.cache.size    5
> > fs.hdfs.impl    org.apache.hadoop.hdfs.DistributedFileSystem
> > mapred.queue.default.acl-administer-jobs        *
> > mapred.output.key.class com.mongodb.hadoop.io.BSONWritable
> > dfs.block.access.key.update.interval    600
> > mapred.skip.map.auto.incr.proc.count    true
> > mongo.input.skip        0
> > mapreduce.job.complete.cancel.delegation.tokens true
> > io.mapfile.bloom.size   1048576
> > mapreduce.reduce.shuffle.connect.timeout        180000
> > dfs.safemode.extension  30000
> > mapred.jobtracker.blacklist.fault-timeout-window        180
> > tasktracker.http.threads        80
> > mapred.job.shuffle.merge.percent        0.66
> > mapreduce.inputformat.class     com.mongodb.hadoop.MongoInputFormat
> > fs.ftp.impl     org.apache.hadoop.fs.ftp.FTPFileSystem
> > user.name       root
> > mapred.output.compress  true
> > io.bytes.per.checksum   512
> > mapred.healthChecker.script.timeout     600000
> > mongo.job.input.format  com.mongodb.hadoop.MongoInputFormat
> > topology.node.switch.mapping.impl
> > org.apache.hadoop.net.ScriptBasedMapping
> > dfs.https.server.keystore.resource      ssl-server.xml
> > mapred.reduce.slowstart.completed.maps  0.05
> > mongo.input.split.read_from_shards      false
> > mapred.reduce.max.attempts      4
> > fs.ramfs.impl   org.apache.hadoop.fs.InMemoryFileSystem
> > dfs.block.access.token.lifetime 600
> > dfs.name.edits.dir      ${dfs.name.dir}
> > mapred.skip.map.max.skip.records        0
> > mapred.cluster.map.memory.mb    -1
> > hadoop.security.group.mapping
> > org.apache.hadoop.security.ShellBasedUnixGroupsMapping
> > mongo.job.output.format com.mongodb.hadoop.MongoOutputFormat
> > mapred.job.tracker.persist.jobstatus.dir        /jobtracker/jobsInfo
> > mapred.jar
> >  hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_2011112822 14_20169/job.jar
> > dfs.block.size  67108864
> > fs.s3.buffer.dir        ${hadoop.tmp.dir}/s3
> > job.end.retry.attempts  0
> > fs.file.impl    org.apache.hadoop.fs.LocalFileSystem
> > mapred.local.dir.minspacestart  0
> > mapred.output.compression.type  BLOCK
> > dfs.datanode.ipc.address        0.0.0.0:50020
> > dfs.permissions true
> > topology.script.number.args     100
> > io.mapfile.bloom.error.rate     0.005
> > mapred.cluster.max.reduce.memory.mb     -1
> > mapred.max.tracker.blacklists   4
> > mapred.task.profile.maps        0-2
> > mongo.input.limit       0
> > dfs.datanode.https.address      0.0.0.0:50475
> > mapred.userlog.retain.hours     24
> > dfs.secondary.http.address      0.0.0.0:50090
> > dfs.replication.max     512
> > mapred.job.tracker.persist.jobstatus.active     false
> > hadoop.security.authorization   false
> > local.cache.size        10737418240
> > dfs.namenode.delegation.token.renew-interval    86400000
> > mapred.min.split.size   0
> > mapred.map.tasks        1
> > mapred.child.java.opts  -Xmx200m
> > mapreduce.job.counters.limit    120
> > mapred.output.value.class       org.apache.hadoop.io.DoubleWritable
> > dfs.https.client.keystore.resource      ssl-client.xml
> > mapred.job.queue.name   default
> > mongo.job.name  samsung#4ddd98a8c47eed304e0002ff
> > dfs.https.address       0.0.0.0:50470
> > mapred.job.tracker.retiredjobs.cache.size       1000
> > dfs.balance.bandwidthPerSec     1048576
> > ipc.server.listen.queue.size    128
> > mapred.inmem.merge.threshold    1000
> > job.end.retry.interval  30000
> > mapred.skip.attempts.to.start.skipping  2
> > fs.checkpoint.dir       ${hadoop.tmp.dir}/dfs/namesecondary
> > mapred.reduce.tasks     1
> > mongo.job.output.value  org.apache.hadoop.io.DoubleWritable
> > mapred.merge.recordsBeforeProgress      10000
> > mapred.userlog.limit.kb 0
> > mapred.job.reduce.memory.mb     -1
> > dfs.max.objects 0
> > webinterface.private.actions    false
> > io.sort.spill.percent   0.80
> > mapred.job.shuffle.input.buffer.percent 0.70
> > mongo.job.background    false
> > mapred.job.name samsung#4ddd98a8c47eed304e0002ff
> > dfs.datanode.dns.nameserver     default
> > mapred.map.tasks.speculative.execution  true
> > hadoop.util.hash.type   murmur
> > dfs.blockreport.intervalMsec    3600000
> > mapred.map.max.attempts 4
> > mapreduce.job.acl-view-job
> > dfs.client.block.write.retries  3
> > mapred.job.tracker.handler.count        10
> > mapreduce.reduce.shuffle.read.timeout   180000
> > mapred.tasktracker.expiry.interval      600000
> > dfs.https.enable        false
> > mapred.jobtracker.maxtasks.per.job      -1
> > mapred.jobtracker.job.history.block.size        3145728
> > keep.failed.task.files  false
> > mapreduce.outputformat.class    com.mongodb.hadoop.MongoOutputFormat
> > dfs.datanode.failed.volumes.tolerated   0
> > ipc.client.tcpnodelay   false
> > mapred.task.profile.reduces     0-2
> > mapred.output.compression.codec
> > org.apache.hadoop.io.compress.DefaultCodec
> > io.map.index.skip       0
> > mapred.working.dir      hdfs://x.x.x.x:50001/user/root
> > ipc.server.tcpnodelay   false
> > mapred.jobtracker.blacklist.fault-bucket-width  15
> > dfs.namenode.delegation.key.update-interval     86400000
> > mapred.used.genericoptionsparser        true
> > mapred.mapper.new-api   true
> > mapred.job.map.memory.mb        -1
> > dfs.default.chunk.view.size     32768
> > hadoop.logfile.size     10000000
> > mapred.reduce.tasks.speculative.execution       true
> > mapreduce.job.dir
> > hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_2011112822 14_20169
> > mapreduce.tasktracker.outofband.heartbeat       false
> > mapreduce.reduce.input.limit    -1
> > mongo.job.mapper.output.value   org.apache.hadoop.io.DoubleWritable
> > dfs.datanode.du.reserved        0
> > mongo.input.split.read_shard_chunks     false
> > hadoop.security.authentication  simple
> > fs.checkpoint.period    3600
> > dfs.web.ugi     webuser,webgroup
> > mapred.job.reuse.jvm.num.tasks  1
> > mapred.jobtracker.completeuserjobs.maximum      100
> > dfs.df.interval 60000
> > dfs.data.dir    ${hadoop.tmp.dir}/dfs/data
> > mapred.task.tracker.task-controller
> > org.apache.hadoop.mapred.DefaultTaskController
> > mongo.job.verbose       true
> > fs.s3.maxRetries        4
> > dfs.datanode.dns.interface      default
> > mapred.cluster.max.map.memory.mb        -1
> > mapred.map.child.java.opts      -Xmx4000m
> > dfs.support.append      false
> > mapreduce.job.acl-modify-job
> > dfs.permissions.supergroup      supergroup
> > mapred.local.dir        ${hadoop.tmp.dir}/mapred/local
> > fs.hftp.impl    org.apache.hadoop.hdfs.HftpFileSystem
> > fs.trash.interval       0
> > fs.s3.sleepTimeSeconds  10
> > dfs.replication.min     1
> > mapred.submit.replication       10
> > fs.har.impl     org.apache.hadoop.fs.HarFileSystem
> > mapred.map.output.compression.codec
> > org.apache.hadoop.io.compress.DefaultCodec
> > mapred.tasktracker.dns.interface        default
> > dfs.namenode.decommission.interval      30
> > dfs.http.address        0.0.0.0:50070
> > dfs.heartbeat.interval  3
> > mapred.job.tracker      hdfs://x.x.x.x:50002
> > mapreduce.job.submithost        x.x.x.x
> > io.seqfile.sorter.recordlimit   1000000
> > dfs.name.dir    ${hadoop.tmp.dir}/dfs/name
> > mapred.line.input.format.linespermap    1
> > mapred.jobtracker.taskScheduler
> > org.apache.hadoop.mapred.JobQueueTaskScheduler
> > dfs.datanode.http.address       0.0.0.0:50075
> > mapred.local.dir.minspacekill   0
> > dfs.replication.interval        3
> > io.sort.record.percent  0.05
> > mapreduce.reduce.class
> > com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
> > fs.kfs.impl     org.apache.hadoop.fs.kfs.KosmosFileSystem
> > mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
> > mapred.tasktracker.reduce.tasks.maximum 3
> > dfs.replication 3
> > fs.checkpoint.edits.dir ${fs.checkpoint.dir}
> > mapred.tasktracker.tasks.sleeptime-before-sigkill       5000
> > mapred.job.reduce.input.buffer.percent  0.0
> > mongo.input.query       {"user_id":{"$in":[{"$oid":"some_id_here"},
> > {"$oid":"another_id"}]}}
> > mapred.tasktracker.indexcache.mb        10
> > mapreduce.job.split.metainfo.maxsize    10000000
> > hadoop.logfile.count    10
> > mapred.skip.reduce.auto.incr.proc.count true
>
> ...
>
> read more »

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To unsubscribe from this group, send email to mongodb-user+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to mongodb-user+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
yankov | 1 Dec 2011 19:40
Picon
Gravatar

Re: Mongo-hadoop: map input records is wrong.

I didn't try running it without a query. Collections are too big.
I didn't get it to work with the sharding settings enabled - it just
hangs and doesn't receive any data back from mongo.

Jobs take different time to run depending on the size of the input:
from 2 minutes to 1-2 hours.
What I've noticed from logs also is that when mapper finishes, after
the last pContext.write it hangs and waits for sometime - can be 20
minutes or more, before starting flush of map output.

On Nov 30, 4:47 pm, "Brendan W. McAdams" <bren...@...> wrote:
> What happens if you set the sharding settings on? What about running
> without a query?
>
> How long does the job take to run? Is there heavy write and/or chunk
> migration occuring during this period?
> On Dec 1, 2011 12:42 AM, "yankov" <artem.yan...@...> wrote:
>
>
>
>
>
>
>
>
>
> > Hey Brendan,
> > Thanks for the answer!
>
> > Hadoop version 0.20.203.0, r1099333
> > Our MongoDB is sharded. But I set mongo.input.split.read_shard_chunks
> > and mongo.input.split.read_from_shards to false.
> > Is there anything i can do to confirm your suspicion?
>
> > On Nov 30, 4:34 pm, "Brendan W. McAdams" <bren...@...> wrote:
> > > Artem,
>
> > > I haven't seen behavior like this on my end but it is certainly possible
> > > you have found a bug.
>
> > > A few questions that can help narrow it down:
>
> > > - What Distribution and version of Hadoop are you using?
>
> > > - Is your MongoDB Sharded or unsharded?
>
> > > Because you are using a query (per your settings above) I suspect that
> > may
> > > be the culprit; queries aren't snapshotted so it's possible for the
> > > contents of the cursor to shift as you go through.  The query of course
> > can
> > > also account for the discrepancy in numbers.
>
> > > On Wed, Nov 30, 2011 at 9:10 PM, yankov <artem.yan...@...> wrote:
> > > > Oh, I've just noticed a response.
> > > > I can't see a number of docs processed in job output. Is there a
> > > > special place I can check it out?
> > > > Also I haven't change any special settings for adapter.
>
> > > > Here's the job configuration:
>
> > > > fs.s3n.impl     org.apache.hadoop.fs.s3native.NativeS3FileSystem
> > > > mapred.task.cache.levels        2
> > > > hadoop.tmp.dir  /mnt/hadoop
> > > > hadoop.native.lib       true
> > > > map.sort.class  org.apache.hadoop.util.QuickSort
> > > > dfs.namenode.decommission.nodes.per.interval    5
> > > > dfs.https.need.client.auth      false
> > > > mongo.job.reducer
> > > > com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
> > > > ipc.client.idlethreshold        4000
> > > > dfs.datanode.data.dir.perm      755
> > > > mapred.system.dir       ${hadoop.tmp.dir}/mapred/system
> > > > mapred.job.tracker.persist.jobstatus.hours      0
> > > > dfs.datanode.address    0.0.0.0:50010
> > > > dfs.namenode.logging.level      info
> > > > dfs.block.access.token.enable   false
> > > > io.skip.checksum.errors false
> > > > mongo.output.uri        mongodb://host/db/collection
> > > > fs.default.name hdfs://x.x.x.x:50001
> > > > mapred.cluster.reduce.memory.mb -1
> > > > mapred.reducer.new-api  true
> > > > mapred.child.tmp        ./tmp
> > > > fs.har.impl.disable.cache       true
> > > > dfs.safemode.threshold.pct      0.999f
> > > > mapred.skip.reduce.max.skip.groups      0
> > > > dfs.namenode.handler.count      10
> > > > dfs.blockreport.initialDelay    0
> > > > mapred.heartbeats.in.second     100
> > > > mapred.tasktracker.dns.nameserver       default
> > > > io.sort.factor  10
> > > > mapred.task.timeout     4200000
> > > > mapred.max.tracker.failures     4
> > > > hadoop.rpc.socket.factory.class.default
> > > > org.apache.hadoop.net.StandardSocketFactory
> > > > mapred.job.tracker.jobhistory.lru.cache.size    5
> > > > fs.hdfs.impl    org.apache.hadoop.hdfs.DistributedFileSystem
> > > > mapred.queue.default.acl-administer-jobs        *
> > > > mapred.output.key.class com.mongodb.hadoop.io.BSONWritable
> > > > dfs.block.access.key.update.interval    600
> > > > mapred.skip.map.auto.incr.proc.count    true
> > > > mongo.input.skip        0
> > > > mapreduce.job.complete.cancel.delegation.tokens true
> > > > io.mapfile.bloom.size   1048576
> > > > mapreduce.reduce.shuffle.connect.timeout        180000
> > > > dfs.safemode.extension  30000
> > > > mapred.jobtracker.blacklist.fault-timeout-window        180
> > > > tasktracker.http.threads        80
> > > > mapred.job.shuffle.merge.percent        0.66
> > > > mapreduce.inputformat.class     com.mongodb.hadoop.MongoInputFormat
> > > > fs.ftp.impl     org.apache.hadoop.fs.ftp.FTPFileSystem
> > > > user.name       root
> > > > mapred.output.compress  true
> > > > io.bytes.per.checksum   512
> > > > mapred.healthChecker.script.timeout     600000
> > > > mongo.job.input.format  com.mongodb.hadoop.MongoInputFormat
> > > > topology.node.switch.mapping.impl
> > > > org.apache.hadoop.net.ScriptBasedMapping
> > > > dfs.https.server.keystore.resource      ssl-server.xml
> > > > mapred.reduce.slowstart.completed.maps  0.05
> > > > mongo.input.split.read_from_shards      false
> > > > mapred.reduce.max.attempts      4
> > > > fs.ramfs.impl   org.apache.hadoop.fs.InMemoryFileSystem
> > > > dfs.block.access.token.lifetime 600
> > > > dfs.name.edits.dir      ${dfs.name.dir}
> > > > mapred.skip.map.max.skip.records        0
> > > > mapred.cluster.map.memory.mb    -1
> > > > hadoop.security.group.mapping
> > > > org.apache.hadoop.security.ShellBasedUnixGroupsMapping
> > > > mongo.job.output.format com.mongodb.hadoop.MongoOutputFormat
> > > > mapred.job.tracker.persist.jobstatus.dir        /jobtracker/jobsInfo
> > > > mapred.jar
>
> >  hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_2011112822
> > 14_20169/job.jar
> > > > dfs.block.size  67108864
> > > > fs.s3.buffer.dir        ${hadoop.tmp.dir}/s3
> > > > job.end.retry.attempts  0
> > > > fs.file.impl    org.apache.hadoop.fs.LocalFileSystem
> > > > mapred.local.dir.minspacestart  0
> > > > mapred.output.compression.type  BLOCK
> > > > dfs.datanode.ipc.address        0.0.0.0:50020
> > > > dfs.permissions true
> > > > topology.script.number.args     100
> > > > io.mapfile.bloom.error.rate     0.005
> > > > mapred.cluster.max.reduce.memory.mb     -1
> > > > mapred.max.tracker.blacklists   4
> > > > mapred.task.profile.maps        0-2
> > > > mongo.input.limit       0
> > > > dfs.datanode.https.address      0.0.0.0:50475
> > > > mapred.userlog.retain.hours     24
> > > > dfs.secondary.http.address      0.0.0.0:50090
> > > > dfs.replication.max     512
> > > > mapred.job.tracker.persist.jobstatus.active     false
> > > > hadoop.security.authorization   false
> > > > local.cache.size        10737418240
> > > > dfs.namenode.delegation.token.renew-interval    86400000
> > > > mapred.min.split.size   0
> > > > mapred.map.tasks        1
> > > > mapred.child.java.opts  -Xmx200m
> > > > mapreduce.job.counters.limit    120
> > > > mapred.output.value.class       org.apache.hadoop.io.DoubleWritable
> > > > dfs.https.client.keystore.resource      ssl-client.xml
> > > > mapred.job.queue.name   default
> > > > mongo.job.name  samsung#4ddd98a8c47eed304e0002ff
> > > > dfs.https.address       0.0.0.0:50470
> > > > mapred.job.tracker.retiredjobs.cache.size       1000
> > > > dfs.balance.bandwidthPerSec     1048576
> > > > ipc.server.listen.queue.size    128
> > > > mapred.inmem.merge.threshold    1000
> > > > job.end.retry.interval  30000
> > > > mapred.skip.attempts.to.start.skipping  2
> > > > fs.checkpoint.dir       ${hadoop.tmp.dir}/dfs/namesecondary
> > > > mapred.reduce.tasks     1
> > > > mongo.job.output.value  org.apache.hadoop.io.DoubleWritable
> > > > mapred.merge.recordsBeforeProgress      10000
> > > > mapred.userlog.limit.kb 0
> > > > mapred.job.reduce.memory.mb     -1
> > > > dfs.max.objects 0
> > > > webinterface.private.actions    false
> > > > io.sort.spill.percent   0.80
> > > > mapred.job.shuffle.input.buffer.percent 0.70
> > > > mongo.job.background    false
> > > > mapred.job.name samsung#4ddd98a8c47eed304e0002ff
> > > > dfs.datanode.dns.nameserver     default
> > > > mapred.map.tasks.speculative.execution  true
> > > > hadoop.util.hash.type   murmur
> > > > dfs.blockreport.intervalMsec    3600000
> > > > mapred.map.max.attempts 4
> > > > mapreduce.job.acl-view-job
> > > > dfs.client.block.write.retries  3
> > > > mapred.job.tracker.handler.count        10
> > > > mapreduce.reduce.shuffle.read.timeout   180000
> > > > mapred.tasktracker.expiry.interval      600000
> > > > dfs.https.enable        false
> > > > mapred.jobtracker.maxtasks.per.job      -1
> > > > mapred.jobtracker.job.history.block.size        3145728
> > > > keep.failed.task.files  false
> > > > mapreduce.outputformat.class    com.mongodb.hadoop.MongoOutputFormat
> > > > dfs.datanode.failed.volumes.tolerated   0
> > > > ipc.client.tcpnodelay   false
> > > > mapred.task.profile.reduces     0-2
> > > > mapred.output.compression.codec
> > > > org.apache.hadoop.io.compress.DefaultCodec
> > > > io.map.index.skip       0
> > > > mapred.working.dir      hdfs://x.x.x.x:50001/user/root
> > > > ipc.server.tcpnodelay   false
> > > > mapred.jobtracker.blacklist.fault-bucket-width  15
> > > > dfs.namenode.delegation.key.update-interval     86400000
> > > > mapred.used.genericoptionsparser        true
> > > > mapred.mapper.new-api   true
> > > > mapred.job.map.memory.mb        -1
> > > > dfs.default.chunk.view.size     32768
> > > > hadoop.logfile.size     10000000
> > > > mapred.reduce.tasks.speculative.execution       true
> > > > mapreduce.job.dir
>
> > hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_2011112822
> > 14_20169
> > > > mapreduce.tasktracker.outofband.heartbeat       false
> > > > mapreduce.reduce.input.limit    -1
> > > > mongo.job.mapper.output.value   org.apache.hadoop.io.DoubleWritable
> > > > dfs.datanode.du.reserved        0
> > > > mongo.input.split.read_shard_chunks     false
> > > > hadoop.security.authentication  simple
> > > > fs.checkpoint.period    3600
> > > > dfs.web.ugi     webuser,webgroup
> > > > mapred.job.reuse.jvm.num.tasks  1
> > > > mapred.jobtracker.completeuserjobs.maximum      100
> > > > dfs.df.interval 60000
> > > > dfs.data.dir    ${hadoop.tmp.dir}/dfs/data
> > > > mapred.task.tracker.task-controller
> > > > org.apache.hadoop.mapred.DefaultTaskController
> > > > mongo.job.verbose       true
> > > > fs.s3.maxRetries        4
> > > > dfs.datanode.dns.interface      default
> > > > mapred.cluster.max.map.memory.mb        -1
> > > > mapred.map.child.java.opts      -Xmx4000m
> > > > dfs.support.append      false
> > > > mapreduce.job.acl-modify-job
> > > > dfs.permissions.supergroup      supergroup
> > > > mapred.local.dir        ${hadoop.tmp.dir}/mapred/local
> > > > fs.hftp.impl    org.apache.hadoop.hdfs.HftpFileSystem
> > > > fs.trash.interval       0
> > > > fs.s3.sleepTimeSeconds  10
> > > > dfs.replication.min     1
>
> ...
>
> read more »

--

-- 
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user@...
To unsubscribe from this group, send email to mongodb-user+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

yankov | 1 Dec 2011 19:43
Picon
Gravatar

Re: Mongo-hadoop: map input records is wrong.

Ah, and we have probably 10 writes per second max during this period.

On Dec 1, 10:40 am, yankov <artem.yan...@...> wrote:
> I didn't try running it without a query. Collections are too big.
> I didn't get it to work with the sharding settings enabled - it just
> hangs and doesn't receive any data back from mongo.
>
> Jobs take different time to run depending on the size of the input:
> from 2 minutes to 1-2 hours.
> What I've noticed from logs also is that when mapper finishes, after
> the last pContext.write it hangs and waits for sometime - can be 20
> minutes or more, before starting flush of map output.
>
> On Nov 30, 4:47 pm, "Brendan W. McAdams" <bren...@...> wrote:
>
>
>
>
>
>
>
> > What happens if you set the sharding settings on? What about running
> > without a query?
>
> > How long does the job take to run? Is there heavy write and/or chunk
> > migration occuring during this period?
> > On Dec 1, 2011 12:42 AM, "yankov" <artem.yan...@...> wrote:
>
> > > Hey Brendan,
> > > Thanks for the answer!
>
> > > Hadoop version 0.20.203.0, r1099333
> > > Our MongoDB is sharded. But I set mongo.input.split.read_shard_chunks
> > > and mongo.input.split.read_from_shards to false.
> > > Is there anything i can do to confirm your suspicion?
>
> > > On Nov 30, 4:34 pm, "Brendan W. McAdams" <bren...@...> wrote:
> > > > Artem,
>
> > > > I haven't seen behavior like this on my end but it is certainly possible
> > > > you have found a bug.
>
> > > > A few questions that can help narrow it down:
>
> > > > - What Distribution and version of Hadoop are you using?
>
> > > > - Is your MongoDB Sharded or unsharded?
>
> > > > Because you are using a query (per your settings above) I suspect that
> > > may
> > > > be the culprit; queries aren't snapshotted so it's possible for the
> > > > contents of the cursor to shift as you go through.  The query of course
> > > can
> > > > also account for the discrepancy in numbers.
>
> > > > On Wed, Nov 30, 2011 at 9:10 PM, yankov <artem.yan...@...> wrote:
> > > > > Oh, I've just noticed a response.
> > > > > I can't see a number of docs processed in job output. Is there a
> > > > > special place I can check it out?
> > > > > Also I haven't change any special settings for adapter.
>
> > > > > Here's the job configuration:
>
> > > > > fs.s3n.impl     org.apache.hadoop.fs.s3native.NativeS3FileSystem
> > > > > mapred.task.cache.levels        2
> > > > > hadoop.tmp.dir  /mnt/hadoop
> > > > > hadoop.native.lib       true
> > > > > map.sort.class  org.apache.hadoop.util.QuickSort
> > > > > dfs.namenode.decommission.nodes.per.interval    5
> > > > > dfs.https.need.client.auth      false
> > > > > mongo.job.reducer
> > > > > com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
> > > > > ipc.client.idlethreshold        4000
> > > > > dfs.datanode.data.dir.perm      755
> > > > > mapred.system.dir       ${hadoop.tmp.dir}/mapred/system
> > > > > mapred.job.tracker.persist.jobstatus.hours      0
> > > > > dfs.datanode.address    0.0.0.0:50010
> > > > > dfs.namenode.logging.level      info
> > > > > dfs.block.access.token.enable   false
> > > > > io.skip.checksum.errors false
> > > > > mongo.output.uri        mongodb://host/db/collection
> > > > > fs.default.name hdfs://x.x.x.x:50001
> > > > > mapred.cluster.reduce.memory.mb -1
> > > > > mapred.reducer.new-api  true
> > > > > mapred.child.tmp        ./tmp
> > > > > fs.har.impl.disable.cache       true
> > > > > dfs.safemode.threshold.pct      0.999f
> > > > > mapred.skip.reduce.max.skip.groups      0
> > > > > dfs.namenode.handler.count      10
> > > > > dfs.blockreport.initialDelay    0
> > > > > mapred.heartbeats.in.second     100
> > > > > mapred.tasktracker.dns.nameserver       default
> > > > > io.sort.factor  10
> > > > > mapred.task.timeout     4200000
> > > > > mapred.max.tracker.failures     4
> > > > > hadoop.rpc.socket.factory.class.default
> > > > > org.apache.hadoop.net.StandardSocketFactory
> > > > > mapred.job.tracker.jobhistory.lru.cache.size    5
> > > > > fs.hdfs.impl    org.apache.hadoop.hdfs.DistributedFileSystem
> > > > > mapred.queue.default.acl-administer-jobs        *
> > > > > mapred.output.key.class com.mongodb.hadoop.io.BSONWritable
> > > > > dfs.block.access.key.update.interval    600
> > > > > mapred.skip.map.auto.incr.proc.count    true
> > > > > mongo.input.skip        0
> > > > > mapreduce.job.complete.cancel.delegation.tokens true
> > > > > io.mapfile.bloom.size   1048576
> > > > > mapreduce.reduce.shuffle.connect.timeout        180000
> > > > > dfs.safemode.extension  30000
> > > > > mapred.jobtracker.blacklist.fault-timeout-window        180
> > > > > tasktracker.http.threads        80
> > > > > mapred.job.shuffle.merge.percent        0.66
> > > > > mapreduce.inputformat.class     com.mongodb.hadoop.MongoInputFormat
> > > > > fs.ftp.impl     org.apache.hadoop.fs.ftp.FTPFileSystem
> > > > > user.name       root
> > > > > mapred.output.compress  true
> > > > > io.bytes.per.checksum   512
> > > > > mapred.healthChecker.script.timeout     600000
> > > > > mongo.job.input.format  com.mongodb.hadoop.MongoInputFormat
> > > > > topology.node.switch.mapping.impl
> > > > > org.apache.hadoop.net.ScriptBasedMapping
> > > > > dfs.https.server.keystore.resource      ssl-server.xml
> > > > > mapred.reduce.slowstart.completed.maps  0.05
> > > > > mongo.input.split.read_from_shards      false
> > > > > mapred.reduce.max.attempts      4
> > > > > fs.ramfs.impl   org.apache.hadoop.fs.InMemoryFileSystem
> > > > > dfs.block.access.token.lifetime 600
> > > > > dfs.name.edits.dir      ${dfs.name.dir}
> > > > > mapred.skip.map.max.skip.records        0
> > > > > mapred.cluster.map.memory.mb    -1
> > > > > hadoop.security.group.mapping
> > > > > org.apache.hadoop.security.ShellBasedUnixGroupsMapping
> > > > > mongo.job.output.format com.mongodb.hadoop.MongoOutputFormat
> > > > > mapred.job.tracker.persist.jobstatus.dir        /jobtracker/jobsInfo
> > > > > mapred.jar
>
> > >  hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_2011112822
> > > 14_20169/job.jar
> > > > > dfs.block.size  67108864
> > > > > fs.s3.buffer.dir        ${hadoop.tmp.dir}/s3
> > > > > job.end.retry.attempts  0
> > > > > fs.file.impl    org.apache.hadoop.fs.LocalFileSystem
> > > > > mapred.local.dir.minspacestart  0
> > > > > mapred.output.compression.type  BLOCK
> > > > > dfs.datanode.ipc.address        0.0.0.0:50020
> > > > > dfs.permissions true
> > > > > topology.script.number.args     100
> > > > > io.mapfile.bloom.error.rate     0.005
> > > > > mapred.cluster.max.reduce.memory.mb     -1
> > > > > mapred.max.tracker.blacklists   4
> > > > > mapred.task.profile.maps        0-2
> > > > > mongo.input.limit       0
> > > > > dfs.datanode.https.address      0.0.0.0:50475
> > > > > mapred.userlog.retain.hours     24
> > > > > dfs.secondary.http.address      0.0.0.0:50090
> > > > > dfs.replication.max     512
> > > > > mapred.job.tracker.persist.jobstatus.active     false
> > > > > hadoop.security.authorization   false
> > > > > local.cache.size        10737418240
> > > > > dfs.namenode.delegation.token.renew-interval    86400000
> > > > > mapred.min.split.size   0
> > > > > mapred.map.tasks        1
> > > > > mapred.child.java.opts  -Xmx200m
> > > > > mapreduce.job.counters.limit    120
> > > > > mapred.output.value.class       org.apache.hadoop.io.DoubleWritable
> > > > > dfs.https.client.keystore.resource      ssl-client.xml
> > > > > mapred.job.queue.name   default
> > > > > mongo.job.name  samsung#4ddd98a8c47eed304e0002ff
> > > > > dfs.https.address       0.0.0.0:50470
> > > > > mapred.job.tracker.retiredjobs.cache.size       1000
> > > > > dfs.balance.bandwidthPerSec     1048576
> > > > > ipc.server.listen.queue.size    128
> > > > > mapred.inmem.merge.threshold    1000
> > > > > job.end.retry.interval  30000
> > > > > mapred.skip.attempts.to.start.skipping  2
> > > > > fs.checkpoint.dir       ${hadoop.tmp.dir}/dfs/namesecondary
> > > > > mapred.reduce.tasks     1
> > > > > mongo.job.output.value  org.apache.hadoop.io.DoubleWritable
> > > > > mapred.merge.recordsBeforeProgress      10000
> > > > > mapred.userlog.limit.kb 0
> > > > > mapred.job.reduce.memory.mb     -1
> > > > > dfs.max.objects 0
> > > > > webinterface.private.actions    false
> > > > > io.sort.spill.percent   0.80
> > > > > mapred.job.shuffle.input.buffer.percent 0.70
> > > > > mongo.job.background    false
> > > > > mapred.job.name samsung#4ddd98a8c47eed304e0002ff
> > > > > dfs.datanode.dns.nameserver     default
> > > > > mapred.map.tasks.speculative.execution  true
> > > > > hadoop.util.hash.type   murmur
> > > > > dfs.blockreport.intervalMsec    3600000
> > > > > mapred.map.max.attempts 4
> > > > > mapreduce.job.acl-view-job
> > > > > dfs.client.block.write.retries  3
> > > > > mapred.job.tracker.handler.count        10
> > > > > mapreduce.reduce.shuffle.read.timeout   180000
> > > > > mapred.tasktracker.expiry.interval      600000
> > > > > dfs.https.enable        false
> > > > > mapred.jobtracker.maxtasks.per.job      -1
> > > > > mapred.jobtracker.job.history.block.size        3145728
> > > > > keep.failed.task.files  false
> > > > > mapreduce.outputformat.class    com.mongodb.hadoop.MongoOutputFormat
> > > > > dfs.datanode.failed.volumes.tolerated   0
> > > > > ipc.client.tcpnodelay   false
> > > > > mapred.task.profile.reduces     0-2
> > > > > mapred.output.compression.codec
> > > > > org.apache.hadoop.io.compress.DefaultCodec
> > > > > io.map.index.skip       0
> > > > > mapred.working.dir      hdfs://x.x.x.x:50001/user/root
> > > > > ipc.server.tcpnodelay   false
> > > > > mapred.jobtracker.blacklist.fault-bucket-width  15
> > > > > dfs.namenode.delegation.key.update-interval     86400000
> > > > > mapred.used.genericoptionsparser        true
> > > > > mapred.mapper.new-api   true
> > > > > mapred.job.map.memory.mb        -1
> > > > > dfs.default.chunk.view.size     32768
> > > > > hadoop.logfile.size     10000000
> > > > > mapred.reduce.tasks.speculative.execution       true
> > > > > mapreduce.job.dir
>
> > > hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_2011112822
> > > 14_20169
> > > > > mapreduce.tasktracker.outofband.heartbeat    ...
>
> read more »

--

-- 
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user@...
To unsubscribe from this group, send email to mongodb-user+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


Gmane