How to use Hadoop MultipleOutputs

Just like MultipleInputs, Hadoop also supports MultipleOutputs, thanks to the equality, we can output different data/format in the same MapReduce job.

It’s very easy to use this useful feature, as before, I will mainly use Java code to demonstrate the usage, hope the code can explain itself 🙂

Note: I wrote and ran the following code using Hadoop 1.0.3, but it should be working in 0.20.205 as well

1. MultipleOutputs class

First of all, import the MultipleOutputs,
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

2. introduce the MultipleOutputs.addNamedOutput

There are 5 parameters for this method:

Job job
         pass the haddop Job created
String namedOutput
         give a unique name for this output, the output for
         this one will be nameOutput-r-XXXXX
Class<? extends OutputFormat> outputFormatClass  
         If you have a custom output format, pass the output 
         format in, if you just output text format,     
         use the hadoop TextOutputFormat.class
Class<?> keyClass 
         the class type of the key, if you don't output key, 
         use NullWritable.class
Class<?> valueClass
         the class type of the value, if you have a custom 
         value class, use it here, if the value is text 
         format, use Text.class

3. Codes

What I tried to do here is to separate the columns for a given input, different columns go to different output.

Sample Data:

1 APPLE RED
2 ORANGE BLACK
3 BANANA GREEN

Here I want to separate the fruit column and the color column.

3.1 Setup the driver for this MapReduce job:

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
   Path inputDir = new Path(args[0]);
   Path outputDir = new Path(args[1]);

   Configuration conf = new Configuration();

   Job job = new Job(conf);
   job.setJarByClass(MultipleOutputsTest.class);
   job.setJobName("MultipleOutputs Test");

   job.setMapOutputKeyClass(Text.class);
   job.setMapOutputValueClass(Text.class);

   job.setMapperClass(myMapper.class);
   job.setReducerClass(myReducer.class);

   FileInputFormat.setInputPaths(job, inputDir);
   FileOutputFormat.setOutputPath(job, outputDir);

   MultipleOutputs.addNamedOutput(job, fruitOutputName, TextOutputFormat.class, NullWritable.class, Text.class);
   MultipleOutputs.addNamedOutput(job, colorOutputName, TextOutputFormat.class, NullWritable.class, Text.class);

   job.waitForCompletion(true);
}

The fruitOutputName and colorOutputName are string I defined, they are “fruit” and “color” respectively, so for fruit output, the file name will be fruit-r-000XX.

3.2 Reducer

The next important part is the reducer. For single output, we use context.write(KEY, VALUE), but here it’s different.

public static class myReducer extends Reducer<Text, Text, Text, Writable> {
    MultipleOutputs<Text, Text> mos;

    @override
    public void setup(Context context) {
        mos = new MultipleOutputs(context);
    }

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            String str = value.toString();
            String[] items = str.split("\t");

            mos.write(fruitOutputName, NullWritable.get(), new Text(items[1]));
            mos.write(colorOutputName, NullWritable.get(), new Text(items[2]));
        }
    }

    @override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        mos.close();
    }
}

Please pay attention to the setup and cleanup function, there will be error if you didn’t initialize or close the MultipleOutputs object.

3.3 Output

As we expected, the output of the sample inputs will be:

fruit-r-00000:
APPLE
ORANGE
BANANA
color-r-00000:
RED
BLACK
GREEN

Here is the example code I used. Download

15 thoughts on “How to use Hadoop MultipleOutputs

  1. Amit

    Nice Post.But how to write these files to separate directories like fruit/fruit-r-00000 and color/color-r-0000?

    Reply
    1. purplechun Post author

      in order to do that, you need to overwrite a multipleOutput method. I can update the blog later for that information

      Reply
  2. Ajay

    To get color/color-r-0000, try:

    mos.write(colorOutputName, NullWritable.get(), new Text(items[2]), colorOutputName + “/” + colorOutputName);

    Reply
  3. Sunil

    Can some one give me an example on how to use outputs of 2 mapper function in one reducer function? My task is to read data from two input files which will be accessed in two separate mapper function and then use the results of the both to come up with some solution..any help would be appreciated..
    Thanks..

    Reply
  4. Andrew

    Hi,
    thank you for your this nice example but I would like to ask you for something yet. You wrote that I will get files fruit-r-00000 which consits of 3 words (apple,orange, banana) and the second file color-r-00000 which consits of 3 words too (but in this case from words red, black, green). Unfortunately, in my case I get 3 files for fruit case (fruit-r-00080,fruit-r-00081 and fruit-r-00082). Each this file contents only one “fruit word”. Analogously for color case, I get 3 files for color (color-r-00080,color-r-00081,color-r-00082) and again each file contents only one “color word”. Sure, at the end I can merge these files (using “hdfs dfs -getmerge /path/to/files/color* /path/to/destiny/path” and analogously for fruit case) but I would like to know where I can have problem..I use “Configuration conf = getConf(); ” insted of “Configuration conf = new Configuration();” in the driver part but the whole rest of code is the same as you present and I think that this difference does not cause this. Job is done without problems, no errors, no warnings…I would really appreciate your help or any advices for it. Thanks… Best, Andrew

    Reply
    1. purplechun Post author

      there are three files because 3 reducers are working on it and each one has its own output, if you only want 1 output, put this line in your driver: job.setNumReduceTasks(1);

      Reply
  5. Naveen Kumar B V

    Hi,

    I have a map-only job and cannot control the number of mappers as it depends on the number of input splits. Can you please let me know if there is any way to customize the name of the output file. I’m trying to

    1. Generate just one output file from a mapper and
    2. Customize the name of the output file to remove -m-0000 completely.

    Thanks.
    Naveen Kumar B.V

    Reply
  6. Abdullah Khan

    Hi, I wanted to calculate the frequency of the words of a text file and at the same time the total number of words too. Frequencies are stored in the output path defined by FileOutputFormat.setOutputPath. Now, I want to store the total number of words in another text file. How, can I do that?
    Thanks.

    Reply
  7. Rahul

    Can you please tell me how to generate two output file from mapper , one is in format which i am transferring to reducer and other one is format which is transferring to final output ?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *