Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
The number of reduces for the job is set by the user via
Job.setNumReduceTasks(int)
.Overall, Reducer implementations are passed the Job for the job via the
Job.setReducerClass(Class)
method and can override it to initialize themselves.- The framework then calls
reduce(WritableComparable, Iterable<Writable>, Context)
method for eachpair in the grouped inputs. - Applications can then override the
cleanup(Context)
method to perform any required cleanup.
Reducer has 3 primary phases: shuffle, sort and reduce.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
Secondary Sort
If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via
Job.setSortComparatorClass(Class)
.Since Job.setGroupingComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.
Reduce
In this phase the
reduce(WritableComparable, Iterable<Writable>, Context)
method is called for eachpair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via
Context.write(WritableComparable, Writable)
.Applications can use the Counter to report its statistics.
The output of the Reducer is not sorted.
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (
* .) - With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish.
- With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
- Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
- The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
NONE Reducer
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by
FileOutputFormat.setOutputPath(Job, Path)
. The framework does not sort the map-outputs before writing them out to the FileSystem.
Partitioner
Partitioner partitions the key space.
Partitioner controls the partitioning of the keys of the intermediate map-outputs.
- The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job.
- Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.
HashPartitioner
is the default Partitioner.
Counter
- Counter is a facility for MapReduce applications to report its statistics.
- Mapper and Reducer implementations can use the Counter to report statistics.
- Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and partitioners.