Job Configuration

Job is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution

  • Job represents a MapReduce job configuration.
  • The framework tries to faithfully execute the job as described by Job, however:

    • Some configuration parameters may have been marked as final by administrators (see Final Parameters) and hence cannot be altered.
    • While some job parameters are straight-forward to set (e.g. Job.setNumReduceTasks(int)) , other parameters interact subtly with the rest of the framework and/or job configuration and are more complex to set (e.g. Configuration.set(JobContext.NUM_MAPS, int)).
  • Job is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat, OutputFormat implementations. FileInputFormat indicates the set of input files (FileInputFormat.setInputPaths(Job, Path…)/ FileInputFormat.addInputPath(Job, Path)) and (FileInputFormat.setInputPaths(Job, String…)/ FileInputFormat.addInputPaths(Job, String)) and where the output files should be written (FileOutputFormat.setOutputPath(Path)).

  • Optionally, Job is used to specify other advanced facets of the job such as the Comparator to be used, files to be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), whether job tasks can be executed in a speculative manner (setMapSpeculativeExecution(boolean))/ setReduceSpeculativeExecution(boolean)), maximum number of attempts per task (setMaxMapAttempts(int)/ setMaxReduceAttempts(int)) etc.

  • Of course, users can use Configuration.set(String, String)/ Configuration.get(String) to set/get arbitrary parameters needed by applications. However, use the DistributedCache for large amounts of (read-only) data.

results matching ""

    No results matching ""