Just like MultipleInputs, Hadoop also supports MultipleOutputs, thanks to the equality, we can output different data/format in the same MapReduce job.
It’s very easy to use this useful feature, as before, I will mainly use Java code to demonstrate the usage, hope the code can explain itself 🙂
Note: I wrote and ran the following code using Hadoop 1.0.3, but it should be working in 0.20.205 as well
1. MultipleOutputs class
First of all, import the MultipleOutputs,
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
2. introduce the MultipleOutputs.addNamedOutput
There are 5 parameters for this method:
Job job pass the haddop Job created String namedOutput give a unique name for this output, the output for this one will be nameOutput-r-XXXXX Class<? extends OutputFormat> outputFormatClass If you have a custom output format, pass the output format in, if you just output text format, use the hadoop TextOutputFormat.class Class<?> keyClass the class type of the key, if you don't output key, use NullWritable.class Class<?> valueClass the class type of the value, if you have a custom value class, use it here, if the value is text format, use Text.class
3. Codes
What I tried to do here is to separate the columns for a given input, different columns go to different output.
Sample Data:
1 APPLE RED
2 ORANGE BLACK
3 BANANA GREEN
Here I want to separate the fruit column and the color column.
3.1 Setup the driver for this MapReduce job:
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Path inputDir = new Path(args[0]); Path outputDir = new Path(args[1]); Configuration conf = new Configuration(); Job job = new Job(conf); job.setJarByClass(MultipleOutputsTest.class); job.setJobName("MultipleOutputs Test"); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setMapperClass(myMapper.class); job.setReducerClass(myReducer.class); FileInputFormat.setInputPaths(job, inputDir); FileOutputFormat.setOutputPath(job, outputDir); MultipleOutputs.addNamedOutput(job, fruitOutputName, TextOutputFormat.class, NullWritable.class, Text.class); MultipleOutputs.addNamedOutput(job, colorOutputName, TextOutputFormat.class, NullWritable.class, Text.class); job.waitForCompletion(true); }
The fruitOutputName and colorOutputName are string I defined, they are “fruit” and “color” respectively, so for fruit output, the file name will be fruit-r-000XX.
3.2 Reducer
The next important part is the reducer. For single output, we use context.write(KEY, VALUE)
, but here it’s different.
public static class myReducer extends Reducer<Text, Text, Text, Writable> { MultipleOutputs<Text, Text> mos; @override public void setup(Context context) { mos = new MultipleOutputs(context); } public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for (Text value : values) { String str = value.toString(); String[] items = str.split("\t"); mos.write(fruitOutputName, NullWritable.get(), new Text(items[1])); mos.write(colorOutputName, NullWritable.get(), new Text(items[2])); } } @override protected void cleanup(Context context) throws IOException, InterruptedException { mos.close(); } }
Please pay attention to the setup and cleanup function, there will be error if you didn’t initialize or close the MultipleOutputs object.
3.3 Output
As we expected, the output of the sample inputs will be:
fruit-r-00000:
APPLE
ORANGE
BANANA
color-r-00000:
RED
BLACK
GREENHere is the example code I used. Download
Nice Post.But how to write these files to separate directories like fruit/fruit-r-00000 and color/color-r-0000?
in order to do that, you need to overwrite a multipleOutput method. I can update the blog later for that information
To get color/color-r-0000, try:
mos.write(colorOutputName, NullWritable.get(), new Text(items[2]), colorOutputName + “/” + colorOutputName);
Can some one give me an example on how to use outputs of 2 mapper function in one reducer function? My task is to read data from two input files which will be accessed in two separate mapper function and then use the results of the both to come up with some solution..any help would be appreciated..
Thanks..
I have a post about MultipleInputs : http://www.lichun.cc/blog/2012/05/hadoop-multipleinputs-usage/, which should solve your problem
Hi,
thank you for your this nice example but I would like to ask you for something yet. You wrote that I will get files fruit-r-00000 which consits of 3 words (apple,orange, banana) and the second file color-r-00000 which consits of 3 words too (but in this case from words red, black, green). Unfortunately, in my case I get 3 files for fruit case (fruit-r-00080,fruit-r-00081 and fruit-r-00082). Each this file contents only one “fruit word”. Analogously for color case, I get 3 files for color (color-r-00080,color-r-00081,color-r-00082) and again each file contents only one “color word”. Sure, at the end I can merge these files (using “hdfs dfs -getmerge /path/to/files/color* /path/to/destiny/path” and analogously for fruit case) but I would like to know where I can have problem..I use “Configuration conf = getConf(); ” insted of “Configuration conf = new Configuration();” in the driver part but the whole rest of code is the same as you present and I think that this difference does not cause this. Job is done without problems, no errors, no warnings…I would really appreciate your help or any advices for it. Thanks… Best, Andrew
there are three files because 3 reducers are working on it and each one has its own output, if you only want 1 output, put this line in your driver: job.setNumReduceTasks(1);
Hi,
I have a map-only job and cannot control the number of mappers as it depends on the number of input splits. Can you please let me know if there is any way to customize the name of the output file. I’m trying to
1. Generate just one output file from a mapper and
2. Customize the name of the output file to remove -m-0000 completely.
Thanks.
Naveen Kumar B.V
Thank you Chun ,
Very Nice explanation ,moreover the code demonstration is self explanatory 🙂
Good to know! thanks 🙂
Hi, I wanted to calculate the frequency of the words of a text file and at the same time the total number of words too. Frequencies are stored in the output path defined by FileOutputFormat.setOutputPath. Now, I want to store the total number of words in another text file. How, can I do that?
Thanks.
Can you please tell me how to generate two output file from mapper , one is in format which i am transferring to reducer and other one is format which is transferring to final output ?
Thanks great it solved
Great doc.
I want further enhancement to this want to add header to each type of file
Hi,
Great document for beginners.
i want to know the mrunit for this code.