Data-Scoop!!!: Workings on APACHE PIG in HADOOP

Previously we have seen how to write our first hadoop program now lets execute out first PIG program in hadoop

Learning PIG.....

Apache Pig is an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on hadoop clusters

Pig enables developers to create query execution routines for analyzing large, distributed data sets without having to do low-level work in MapReduce, much like the way the ApacheHive data warehouse software provides a SQL-like interface for Hadoop that doesn't require direct MapReduce programming,
The key parts of Pig are a compiler and a scripting language known as Pig Latin. Pig Latin is a data-flow language geared toward parallel processing. Managers of the Apache software foundation's Pig project position the language as being part way between declarative SQL and the procedural JAVA approach used in MapReduce applications. Proponents say, for example, that data joins are easier to create with Pig Latin than with Java. However, through the use of user-defined functions (UDFs), Pig Latin applications can be extended to include custom processing tasks written in Java as well as languages such as JAVASCRIPT and Python.
Apache Pig grew out of work at Yahoo Research and was first formally described in a paper published in 2008. Pig is intended to handle all kinds of data, including structured and unstructured information and relational and nested data. That omnivorous view of data likely had a hand in the decision to name the environment for the common barnyard animal. It also extends to Pig's take on application frameworks; while the technology is primarily associated with Hadoop, it is said to be capable of being used with other frameworks as well.

Objective :
We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

For free flow and continue along with the blog data can be downloaded from the following link.
http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

Like our previous blog run the hortonworks from the virtual box and after running it open the following link to do APACHE PIG , URL:http://127.0.0.1:8000

Login Details :
Login : hue
password : 1111
You get to the hue screen as shown below and go to the file browser.

Once you have opened hue screen, navigate to file browser and upload the two csv files.

Once the files are uploaded click on the PIG icon on the top left corner of your screen to go to the PIG script page.

We need to write the following code and save it.

batting = load 'Batting.csv' using
PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
DUMP join_data;The explanation of above code is as follows:-

We load data using a comma delimiter.
Then we filter the first row of data.
Iteration for batting data object.
We should group the runs of each player by the year field.
We then join the runs data of highest scoring player to obtain player ID.

Once the script is ready you hit the Execute button to start the job and it will show the job running status

Job Status
To access this page you can either go to the job id which is displayed in bottom of the page when it says that the job is running successfully or go to Query History on the top left besides my scripts.

Once it is success you will get the following screen

Output will be obtained like below

Conclusion & Learning:
By this we have completed our task of executing the Pig script and obtaining the result of which player has highest runs from the year 1871 to 2011.

Data-Scoop!!!

Friday, December 18, 2015

Workings on APACHE PIG in HADOOP

2 comments :