Hadoop sort input order - Stack Overflow

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

If the input to my job is the fileset [a, b, c, d], is the input to the sort strictly [map(a.0), map(a.1), map(b.0), map(b.1), map(c.0), map(c.1), map(d.0), map(d.1)]?

My motivation is having a series of files (which will of course be broken up into blocks) whose rows are [key, value]; where each of key and value are a simple string. I wish to concatenate these values together in the reducer per key in the order they are present in the input, despite there not being an explicit order-defining field.

Any advice much appreciated; this is proving to be a difficult query to Google for.

Example

Input format

A First
A Another
A Third
B First
C First
C Another
Desired output
A First,Another,Third
B First
C First,Another
To reiterate, I'm uncertain if I can rely on getting First-Third in the correct order given files are being stored in separate blocks.
No, you have no guarantee that the  values will be in that order using the standard data flow in Hadoop  (i.e the standard sorter, partitioner, grouper). The only thing which is guaranteed is the order of the keys (A, B, C). 
In order to achieve what you want you have to write your own sorter and to include the values (First, Second, Third) in the key => the new keys will be: 
  "A First"
  "A Second"
But, the problem in this case is that these keys will end up in different partitions (it's very likely that the standard hash partitioner will distribute "A first" to one partition and "A second" to another one) so , to avoid this problem you should also plug in your own partitioner which will use only the first part of the key (i.e A) to do the partitioning. 
You should aslo define the grouper, otherwise the "A first","A second" will not be passed together to the same reduce call.
So the output of your map function should be : 
 "A First"    First
 "A Second"   Second
In other words, the values output by the mapper should be let as they are. Otherwise you won't be able to get the values in the reducer.
                You mention standard sorter/partitioner/grouper. Is there an alternative built-in configuration of these that could allow this ordering to happen?
– tarnfeld
                Jul 13, 2012 at 10:24
                The order of the input values won't necessarily be lexicographical (I'll clarify in my example). I'd think about making the keys a tuple of (base_key, #) but I'm not sure how to computer the # if I'm unable to rely on the order of the input to the sort (which will only consider the key).
– icio
                Jul 13, 2012 at 10:29
                @tarnfeld: in his particular example, yes, the standard sorter would do the job but that wouldn't be a general solution. A counter example is the one which you've given. So, you need a special sorter.
– Razvan
                Jul 13, 2012 at 10:32
One solution to this issue is to make use the TextInputFormat's byte offset in the file as part of a composite key, and use a secondary sort to make sure the values are sent to the reducer in order. That way you can make sure the reducer sees input partioned by the key you want in the order it came in the file. If you have multiple input files, then this approach will not work as each new file will reset the byte counter.
With the streaming API you'll need to pass -inputformat TextInputFormat -D stream.map.input.ignoreKey=false to the job so that you actually get the byte offsets as the key (by default the PipeMapper won't give you keys if the inputformat is TextInputFormat.. even if you explicitly set the TextInputFormat flag so you need to set the additional ignoreKey flag). 
If you're emitting multiple keys from a mapper, be sure to set the following flags so your output is partitioned on the first key and sorted on the first and second in the reducer:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-D stream.num.map.output.key.fields=2
-D mapred.text.key.partitioner.options="-k1,1"
-D mapred.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator"
-D mapreduce.partition.keycomparator.options="-k1 -k2n"
                is there a typo in your example? should it be mapreduce.partition.key.comparator.options="-k1 -k2n"
– Nick Gerner
                Sep 27, 2013 at 19:59
                @Tom Hennigan I tried running like you proposed but it returns an error of illegal option -k3nr. This is my config: -D mapreduce.partition.keycomparator.options='-k1,2 -k3nr -k4nr'. When I run with double quotes (") like you did the sorting just doesn't work. Do you have an idea why?
– refaelos
                Dec 21, 2016 at 19:12
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.