Hadoop-Streaming (Using Python) Takeaways

There are couple resources online (http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/)  to introduce how to write Python Map-reduce codes using hadoop-streaming API. The key idea is to pass data between map and reduce using stdin and stdout. The mapper code is to transform the data into key-value string (separated by ‘\t’). The hadoop-streaming API then read the output from mapper, and allocate to nodes based on keys. The reduce code takes the output of mapper job,  then aggregates the results based on the keys.

The simplest example is word count. Suppose we have a word document called “input”, and we want to count the word frequency –
I am a student you are a doctor he is a teacher we are good friends now hadoop hive pig yarn spark scala Capital one facebook amex google amazon linkedin

The mapper code is to create key-value mapping. Here the mapping can be as simple as “word    1”

import sys
for line in sys.stdin:
    line=line.strip()
    word=line.split(' ')
    for i in word:
        print i+'\t'+str(1)

The reducer code is to go through each key-value pair and summarize word frequency.

import sys
curr_word=None
curr_count=0
for line in sys.stdin:
    word=line.split('\t')[0]
    count=line.split('\t')[1]
    if word==curr_word:
        curr_count+=int(count)
    else:
        if curr_word!=None:
            print curr_word+'\t'+str(curr_count)
            curr_word=word
            curr_count=int(count)
if curr_word!=None:
    print curr_word+'\t'+str(curr_count)

The hadoop streaming commend line is:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.10.0.jar \
-file mapper.py \
-mapper "python mapper.py" \
-file reduce.py \
-reduce "python reducer.py" \
-input /qle/input \
-output /qle/output

Here are some takeaways to avoid some common errors:
1) need to locate the jar file for hadoop-stream depending on different hadoop infra (you can google it)
2) the input file is needed to be upload to Hadoop. (using hadoop fs -put local_dir hadoop_dir)
3) We don’t need to upload mapper.py and reducer.py to Hadoop, as soon as we specify in -file argments.
4) If we upload mapper.py and reducer.py to Hadoop, we don’t need -file argments.
5) mapper.py and reducer.py should be all executable (using chmod +x)
6)  If you choose to use shebang (#!/usr/bin/env python). You need to make sure “/usr/bin/env” can find python interpreter in Hadoop (not locally). (Based on my experience, in both Clourdera and MapR, that is not the case). If this is not the case, you will get error code=2 Not Such file or dictionary. By using shebang, you can just write: -mapper mapper.py \ -reducer reducer.py \
7)  If you don’t want to waste time in finding interpreter in Hadoop. you can delete shebang, just use -mapper “python mapper.py” \ -reducer “python reducer.py” \. That works fine too.
8) You need to remove the output directory in Hadoop before creating a new “same name” output directory.

This entry was posted in Big Data. Bookmark the permalink.

Leave a comment