1 Dec 2011 01:34
Re: Re: Mongo-hadoop: map input records is wrong.
Artem,
I haven't seen behavior like this on my end but it is certainly possible you have found a bug.
A few questions that can help narrow it down:
- What Distribution and version of Hadoop are you using?
- Is your MongoDB Sharded or unsharded?
Because you are using a query (per your settings above) I suspect that may be the culprit; queries aren't snapshotted so it's possible for the contents of the cursor to shift as you go through. The query of course can also account for the discrepancy in numbers.
On Wed, Nov 30, 2011 at 9:10 PM, yankov <artem.yankov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Oh, I've just noticed a response.
I can't see a number of docs processed in job output. Is there a
special place I can check it out?
Also I haven't change any special settings for adapter.
Here's the job configuration:
fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem
mapred.task.cache.levels 2
hadoop.tmp.dir /mnt/hadoop
hadoop.native.lib true
map.sort.class org.apache.hadoop.util.QuickSort
dfs.namenode.decommission.nodes.per.interval 5
dfs.https.need.client.auth false
mongo.job.reducer
com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
ipc.client.idlethreshold 4000
dfs.datanode.data.dir.perm 755
mapred.system.dir ${hadoop.tmp.dir}/mapred/system
mapred.job.tracker.persist.jobstatus.hours 0
dfs.datanode.address 0.0.0.0:50010
dfs.namenode.logging.level info
dfs.block.access.token.enable false
io.skip.checksum.errors false
mongo.output.uri mongodb://host/db/collection
fs.default.name hdfs://x.x.x.x:50001
mapred.cluster.reduce.memory.mb -1
mapred.reducer.new-api true
mapred.child.tmp ./tmp
fs.har.impl.disable.cache true
dfs.safemode.threshold.pct 0.999f
mapred.skip.reduce.max.skip.groups 0
dfs.namenode.handler.count 10
dfs.blockreport.initialDelay 0
mapred.heartbeats.in.second 100
mapred.tasktracker.dns.nameserver default
io.sort.factor 10
mapred.task.timeout 4200000
mapred.max.tracker.failures 4
hadoop.rpc.socket.factory.class.default
org.apache.hadoop.net.StandardSocketFactory
mapred.job.tracker.jobhistory.lru.cache.size 5
fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem
mapred.queue.default.acl-administer-jobs *
mapred.output.key.class com.mongodb.hadoop.io.BSONWritable
dfs.block.access.key.update.interval 600
mapred.skip.map.auto.incr.proc.count true
mongo.input.skip 0
mapreduce.job.complete.cancel.delegation.tokens true
io.mapfile.bloom.size 1048576
mapreduce.reduce.shuffle.connect.timeout 180000
dfs.safemode.extension 30000
mapred.jobtracker.blacklist.fault-timeout-window 180
tasktracker.http.threads 80
mapred.job.shuffle.merge.percent 0.66
mapreduce.inputformat.class com.mongodb.hadoop.MongoInputFormat
fs.ftp.impl org.apache.hadoop.fs.ftp.FTPFileSystem
user.name root
mapred.output.compress true
io.bytes.per.checksum 512
mapred.healthChecker.script.timeout 600000
mongo.job.input.format com.mongodb.hadoop.MongoInputFormat
topology.node.switch.mapping.impl
org.apache.hadoop.net.ScriptBasedMapping
dfs.https.server.keystore.resource ssl-server.xml
mapred.reduce.slowstart.completed.maps 0.05
mongo.input.split.read_from_shards false
mapred.reduce.max.attempts 4
fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem
dfs.block.access.token.lifetime 600
dfs.name.edits.dir ${dfs.name.dir}
mapred.skip.map.max.skip.records 0
mapred.cluster.map.memory.mb -1
hadoop.security.group.mapping
org.apache.hadoop.security.ShellBasedUnixGroupsMapping
mongo.job.output.format com.mongodb.hadoop.MongoOutputFormat
mapred.job.tracker.persist.jobstatus.dir /jobtracker/jobsInfo
mapred.jar hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_201111282214_20169/job.jar
dfs.block.size 67108864
fs.s3.buffer.dir ${hadoop.tmp.dir}/s3
job.end.retry.attempts 0
fs.file.impl org.apache.hadoop.fs.LocalFileSystem
mapred.local.dir.minspacestart 0
mapred.output.compression.type BLOCK
dfs.datanode.ipc.address 0.0.0.0:50020
dfs.permissions true
topology.script.number.args 100
io.mapfile.bloom.error.rate 0.005
mapred.cluster.max.reduce.memory.mb -1
mapred.max.tracker.blacklists 4
mapred.task.profile.maps 0-2
mongo.input.limit 0
dfs.datanode.https.address 0.0.0.0:50475
mapred.userlog.retain.hours 24
dfs.secondary.http.address 0.0.0.0:50090
dfs.replication.max 512
mapred.job.tracker.persist.jobstatus.active false
hadoop.security.authorization false
local.cache.size 10737418240
dfs.namenode.delegation.token.renew-interval 86400000
mapred.min.split.size 0
mapred.map.tasks 1
mapred.child.java.opts -Xmx200m
mapreduce.job.counters.limit 120
mapred.output.value.class org.apache.hadoop.io.DoubleWritable
dfs.https.client.keystore.resource ssl-client.xml
mapred.job.queue.name default
mongo.job.name samsung#4ddd98a8c47eed304e0002ff
dfs.https.address 0.0.0.0:50470
mapred.job.tracker.retiredjobs.cache.size 1000
dfs.balance.bandwidthPerSec 1048576
ipc.server.listen.queue.size 128
mapred.inmem.merge.threshold 1000
job.end.retry.interval 30000
mapred.skip.attempts.to.start.skipping 2
fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary
mapred.reduce.tasks 1
mongo.job.output.value org.apache.hadoop.io.DoubleWritable
mapred.merge.recordsBeforeProgress 10000
mapred.userlog.limit.kb 0
mapred.job.reduce.memory.mb -1
dfs.max.objects 0
webinterface.private.actions false
io.sort.spill.percent 0.80
mapred.job.shuffle.input.buffer.percent 0.70
mongo.job.background false
mapred.job.name samsung#4ddd98a8c47eed304e0002ff
dfs.datanode.dns.nameserver default
mapred.map.tasks.speculative.execution true
hadoop.util.hash.type murmur
dfs.blockreport.intervalMsec 3600000
mapred.map.max.attempts 4
mapreduce.job.acl-view-job
dfs.client.block.write.retries 3
mapred.job.tracker.handler.count 10
mapreduce.reduce.shuffle.read.timeout 180000
mapred.tasktracker.expiry.interval 600000
dfs.https.enable false
mapred.jobtracker.maxtasks.per.job -1
mapred.jobtracker.job.history.block.size 3145728
keep.failed.task.files false
mapreduce.outputformat.class com.mongodb.hadoop.MongoOutputFormat
dfs.datanode.failed.volumes.tolerated 0
ipc.client.tcpnodelay false
mapred.task.profile.reduces 0-2
mapred.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
io.map.index.skip 0
mapred.working.dir hdfs://x.x.x.x:50001/user/root
ipc.server.tcpnodelay false
mapred.jobtracker.blacklist.fault-bucket-width 15
dfs.namenode.delegation.key.update-interval 86400000
mapred.used.genericoptionsparser true
mapred.mapper.new-api true
mapred.job.map.memory.mb -1
dfs.default.chunk.view.size 32768
hadoop.logfile.size 10000000
mapred.reduce.tasks.speculative.execution true
mapreduce.job.dir hdfs://x.x.x.x:50001/mnt/hadoop/mapred/staging/root/.staging/job_201111282214_20169
mapreduce.tasktracker.outofband.heartbeat false
mapreduce.reduce.input.limit -1
mongo.job.mapper.output.value org.apache.hadoop.io.DoubleWritable
dfs.datanode.du.reserved 0
mongo.input.split.read_shard_chunks false
hadoop.security.authentication simple
fs.checkpoint.period 3600
dfs.web.ugi webuser,webgroup
mapred.job.reuse.jvm.num.tasks 1
mapred.jobtracker.completeuserjobs.maximum 100
dfs.df.interval 60000
dfs.data.dir ${hadoop.tmp.dir}/dfs/data
mapred.task.tracker.task-controller
org.apache.hadoop.mapred.DefaultTaskController
mongo.job.verbose true
fs.s3.maxRetries 4
dfs.datanode.dns.interface default
mapred.cluster.max.map.memory.mb -1
mapred.map.child.java.opts -Xmx4000m
dfs.support.append false
mapreduce.job.acl-modify-job
dfs.permissions.supergroup supergroup
mapred.local.dir ${hadoop.tmp.dir}/mapred/local
fs.hftp.impl org.apache.hadoop.hdfs.HftpFileSystem
fs.trash.interval 0
fs.s3.sleepTimeSeconds 10
dfs.replication.min 1
mapred.submit.replication 10
fs.har.impl org.apache.hadoop.fs.HarFileSystem
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
mapred.tasktracker.dns.interface default
dfs.namenode.decommission.interval 30
dfs.http.address 0.0.0.0:50070
dfs.heartbeat.interval 3
mapred.job.tracker hdfs://x.x.x.x:50002
mapreduce.job.submithost x.x.x.x
io.seqfile.sorter.recordlimit 1000000
dfs.name.dir ${hadoop.tmp.dir}/dfs/name
mapred.line.input.format.linespermap 1
mapred.jobtracker.taskScheduler
org.apache.hadoop.mapred.JobQueueTaskScheduler
dfs.datanode.http.address 0.0.0.0:50075
mapred.local.dir.minspacekill 0
dfs.replication.interval 3
io.sort.record.percent 0.05
mapreduce.reduce.class
com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem
mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
mapred.tasktracker.reduce.tasks.maximum 3
dfs.replication 3
fs.checkpoint.edits.dir ${fs.checkpoint.dir}
mapred.tasktracker.tasks.sleeptime-before-sigkill 5000
mapred.job.reduce.input.buffer.percent 0.0
mongo.input.query {"user_id":{"$in":[{"$oid":"some_id_here"},
{"$oid":"another_id"}]}}
mapred.tasktracker.indexcache.mb 10
mapreduce.job.split.metainfo.maxsize 10000000
hadoop.logfile.count 10
mapred.skip.reduce.auto.incr.proc.count true
mapreduce.job.submithostaddress 10.6.91.183
mongo.job.mapper
com.mongodb.hadoop.examples.leaderboard.LeaderboardMapper
io.seqfile.compress.blocksize 1000000
fs.s3.block.size 67108864
mapred.tasktracker.taskmemorymanager.monitoring-interval 5000
mongo.job.output.key com.mongodb.hadoop.io.BSONWritable
mapred.queue.default.state RUNNING
mapred.acls.enabled false
mapreduce.jobtracker.staging.root.dir ${hadoop.tmp.dir}/mapred/staging
mapred.queue.names default
dfs.access.time.precision 3600000
fs.hsftp.impl org.apache.hadoop.hdfs.HsftpFileSystem
mapred.task.tracker.http.address 0.0.0.0:50060
mapreduce.combine.class
com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
mapred.reduce.parallel.copies 5
io.seqfile.lazydecompress true
io.sort.mb 1000
ipc.client.connection.maxidletime 10000
mapred.compress.map.output false
hadoop.security.uid.cache.secs 14400
mapred.task.tracker.report.address 127.0.0.1:0
mongo.job.combiner
com.mongodb.hadoop.examples.leaderboard.LeaderboardReducer
mapred.healthChecker.interval 60000
ipc.client.kill.max 10
ipc.client.connect.max.retries 10
mapreduce.map.class
com.mongodb.hadoop.examples.leaderboard.LeaderboardMapper
fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
mapred.user.jobconf.limit 5242880
mapred.job.tracker.http.address 0.0.0.0:50030
io.file.buffer.size 4096
mapred.jobtracker.restart.recover false
io.serializations
org.apache.hadoop.io.serializer.WritableSerialization
dfs.datanode.handler.count 3
mapred.reduce.copy.backoff 300
mapred.task.profile false
dfs.replication.considerLoad true
jobclient.output.filter FAILED
dfs.namenode.delegation.token.max-lifetime 604800000
mapred.tasktracker.map.tasks.maximum 3
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
fs.checkpoint.size 67108864
mongo.input.uri mongodb://host/db/collection
On Nov 27, 10:54 pm, Eliot Horowitz <el...-Ot75HdpNzd8AvxtiuMwx3w@public.gmane.org> wrote:
> How is the adapter configured?
> What is the number of docs processed?
>
>
>
>
>
>
>
> On Tue, Nov 22, 2011 at 8:54 PM, yankov <artem.yan...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
> > I'm using mongo-hadoop to pull data from mongo and do mapreduce in
> > hadoop.
> > Looks like map input records is always slightly different comparing to
> > the real number of records in the collection.
>
> > Especially I can see it on a big collection >300k records when number
> > is totally wrong which leads to wrong calculations.
>
> > Any ideas why it can be happening?
>
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> > To post to this group, send email to mongodb-user-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
> > To unsubscribe from this group, send email to mongodb-user+unsubscribe <at> googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To unsubscribe from this group, send email to mongodb-user+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to mongodb-user+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
RSS Feed