Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
If the input to my job is the fileset [a, b, c, d], is the input to the sort strictly [map(a.0), map(a.1), map(b.0), map(b.1), map(c.0), map(c.1), map(d.0), map(d.1)]?
My motivation is having a series of files (which will of course be broken up into blocks) whose rows are [key, value]; where each of key and value are a simple string. I wish to concatenate these values together in the reducer per key in the order they are present in the input, despite there not being an explicit order-defining field.
Any advice much appreciated; this is proving to be a difficult query to Google for.
Example
Input format
A First
A Another
A Third
B First
C First
C Another
Desired output
A First,Another,Third
B First
C First,Another
To reiterate, I'm uncertain if I can rely on getting First-Third in the correct order given files are being stored in separate blocks.
No, you have no guarantee that the values will be in that order using the standard data flow in Hadoop (i.e the standard sorter, partitioner, grouper). The only thing which is guaranteed is the order of the keys (A, B, C).
In order to achieve what you want you have to write your own sorter and to include the values (First, Second, Third) in the key => the new keys will be:
"A First"
"A Second"
But, the problem in this case is that these keys will end up in different partitions (it's very likely that the standard hash partitioner will distribute "A first" to one partition and "A second" to another one) so , to avoid this problem you should also plug in your own partitioner which will use only the first part of the key (i.e A) to do the partitioning.
You should aslo define the grouper, otherwise the "A first","A second" will not be passed together to the same reduce call.
So the output of your map function should be :
"A First" First
"A Second" Second
In other words, the values output by the mapper should be let as they are. Otherwise you won't be able to get the values in the reducer.
–
–
–
One solution to this issue is to make use the TextInputFormat's byte offset in the file as part of a composite key, and use a secondary sort to make sure the values are sent to the reducer in order. That way you can make sure the reducer sees input partioned by the key you want in the order it came in the file. If you have multiple input files, then this approach will not work as each new file will reset the byte counter.
With the streaming API you'll need to pass -inputformat TextInputFormat -D stream.map.input.ignoreKey=false
to the job so that you actually get the byte offsets as the key (by default the PipeMapper won't give you keys if the inputformat is TextInputFormat.. even if you explicitly set the TextInputFormat flag so you need to set the additional ignoreKey flag).
If you're emitting multiple keys from a mapper, be sure to set the following flags so your output is partitioned on the first key and sorted on the first and second in the reducer:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-D stream.num.map.output.key.fields=2
-D mapred.text.key.partitioner.options="-k1,1"
-D mapred.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator"
-D mapreduce.partition.keycomparator.options="-k1 -k2n"
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.