Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
The Hadoop MapReduce framework spawns one map task for each InputSplit
generated by the InputFormat for the job.
- Overall, Mapper implementations are passed the Job for the job via the
Job.setMapperClass(Class)
method. - The framework then calls
map(WritableComparable, Writable, Context)
for each key/value pair in theInputSplit
for that task. Applications can then override the
cleanup
(Context) method to perform any required cleanup.Output pairs do not need to be of the same types as input pairs.
- A given input pair may map to zero or many output pairs.
Output pairs are collected with calls to
context.write(WritableComparable, Writable)
.Applications can use the
Counter
to report its statistics.All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
- Users can control the grouping by specifying a Comparator via
Job.setGroupingComparatorClass(Class)
. - The Mapper outputs are sorted and then partitioned per Reducer.
- The total number of partitions is the same as the number of reduce tasks for the job.
- Users can control which keys (and hence records) go to which Reducer by implementing a custom
Partitioner
. - Users can optionally specify a combiner, via
Job.setCombinerClass(Class)
, to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.
The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format. Applications can control if, and how, the intermediate outputs are to be compressed and the CompressionCodec to be used via the Configuration.
#
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks.
- Task setup takes a while, so it is best if the maps take at least a minute to execute.