Tuesday, January 18, 2011

Python UDFs from PIG Scripts

This is one handy feature.

The UDFs get converted to Java by Jython which is still in Python 2.5. So, all your fancy Python 3 code won't be of any use here!

Let's assume that you want to read a file (a huge one) or a set of files running into TBs. Infact, let's assume that you've got a file or a set of files that contain the first names, ages and country of origin of every person on earth. And you want to do some analytics on that.

The file named "data" is something like this:

Deepak 22 India
Chaitanya 19 India
Sachin 36 India
Barack 50 USA
: : :        : :   : :
: : :        : :   : :
and so on..
list of all 6 billion people on earth..

If you're using PIG, the first thing you may do is this:

records = LOAD data AS (first_name:chararray, age:int, country:chararray);
DUMP records;

Now, PIG by default recognizes the whitespaces between the fields, (or tab spaces - you may use PigStorage('\t') ), but to do the same thing using a UDF, you can something like this:

REGISTER udf.py USING jython AS udf;
records = LOAD data AS (input_line:chararray);
schema_records = FOREACH records GENERATE udf.split_into_fields(input_line);
DUMP schema_records; 

The udf file, udf.py will look like this:

# Filename - udf.py
@outputSchema("t:(first_name:chararray,age:int, country:int)")
def split_into_fields(input_line):
        if input_line!=None:
                fields = input_line.split()
                first_name = fields[0]
                age = int(fields[1])
                country = fields[2]
                return (first_name, age, country)

This is really a fantastic feature. The fact that you can now write your UDFs in Python opens up PIG for a larger class of programmers. For those unfortunate people who're not very comfortable with Java (like me), it's a big boon.

PS - Chaitanya is my younger brother.

No comments:

Post a Comment