Data-Scoop!!!: August 2015

Sunday, August 9, 2015

Airline Route :

In this blog we shall try and figure out how Python helps us in extracting information from CSV Files.

Note: CSV-Comma-separated values is a way of expressing structured data in flat text files:

Python comes with a CSV module which provides one way to easily work with CSV- Files

Scenario:

In the example stated below we shall try and figure out from a freely available airline data set available at OpenFlights data page in a CSV format comprising the details on the number of airports a city has.

We will now import and open the CSV file in Python with the help of the following commands

stated below:

Output :

Explanation:

We open the CSV file (airports.dat) in python and fetch the data row wise , but as we see the country name is on the fourth column we have to pull out row[3] and appended it in an array country=[]. ** row here temporary array .

Now, let's try to make a program which will fetch the airport names for some explicitly defined countries .Lets say ,In Our example we want to see the names of the airport present in

“Australia”.

Output :

Explanation:

At first we have created one empty Dictionary airport={}

Secondly each row from the CSV file is imported into a variable line ( which is one array) . As we know in our array the third column contains the country name and the first column contains the airport name , we further tried to check if the dictionary already contains the country name as key in it , if not then using if – else we first created one key as a new country ( else part ). Or if the dictionary already had the country name as a key in it we would just appended the value to it using if part .

Since we wanted to just see the names of the airport situated in “Australia” ,we printed the values assigned to the dictionary key= “Australia” .

The other countries mentioned in the raw data file can be checked in similar fashion.

Airline Route Histogram

In order to accomplish the task of calculating the distance that needs to be traversed in a certain flight schedule we shall plot a histogram showing the distribution of distances over each flight schedule.

In order to perform this operation we have to adhere to the following steps :

· Read the airports file (airports.dat) and build a dictionary mapping the unique airport ID to the geographical coordinates (latitude & longitude.) This allows you to look up the location of each airport by its ID.

· Read the routes file (routes.dat) and get the IDs of the source and destination airports. Look up the latitude and longitude based on the ID. Using those coordinates, calculate the length of the route and append it to a list of all route lengths.

Requirement : Calculating geographic distances is a bit tricky because the earth is a sphere. The distance we measure is the "great circle distance".

In order to calculate "great circle distance"., we have to import a module named geo_distance” , a pre-built function geo_distance.distance() helps us in calculating the distance of each airline route.

Let's now have a quick look how the function geo_distance.distace() works :

Output:

The final code:

import numpy as np
import matplotlib.pyplot as plt
latitudes = {}
longitudes = {}
f = open("airports.dat")
for row in csv.reader(f):
    airport_id = row[0]
    latitudes[airport_id] = float(row[6])
    longitudes[airport_id] = float(row[7])
distances = []
f = open("routes.dat")
for row in csv.reader(f):
    source_airport = row[3]
    dest_airport = row[5]
    if source_airport in latitudes and dest_airport in latitudes:
        source_lat = latitudes[source_airport]
        source_long = longitudes[source_airport]
        dest_lat = latitudes[dest_airport]
        dest_long = longitudes[dest_airport]
        distances.append(geo_distance.distance(source_lat,source_long,dest_lat,dest_long))
plt.hist(distances, 100, facecolor='r')
plt.xlabel("Distance (km)")
plt.ylabel("Number of flights")

Explanation :

We need to read the airport file (airports.dat) and build a dictionary mapping the unique airport ID to the geographical coordinates (latitude & longitude.)

We need to read the routes file (routes.dat) and get the IDs of the source and destination airports. And then look up for the latitude and longitude based on the ID . Finally, using these coordinates, calculate the length of the route and append it to a list distances = []of all route lengths.

At last a histogram is plotted based on the route lengths, to show the distribution of different flight distances.

Output :

Charts in Python (Visualization of DATA in Python )

As a continuation of the previous post regarding the Voting data analysis , we will be creating some Graphs and charts based on the result we got .

Note: Library to be used is matplotlib, which helps is creating some graphical charts based on our data.

Modules to be downloaded/imported are - numpy and pyplot .

A little brief behind using these two modules :

· pyplot is one way to plot graph data with Matplotlib. It's modelled on the way charting works in another popular commercial program, MATLab.

· importmatplotlib.pyplot as plt

· NumPy is a module providing lots of numeric functions for Python

· importnumpyasnp

Bar Chart :

We will be using the library matplotlib to create some graphical charts based on our data.

· Inline Charts:

%matplotlib inline

... the commands mentioned above instructs IPython that we want charts to be created in "inline style" inside our notebook and not in a separate window..

Codes used in creating the charts are as follows:

Explanation :

We captured radish variety and its count in two arrays

names = []

votes = []

Then we have created a range of indexes for the X values in the graph, one entry for each item in the "counts" dictionary (i.e. len(counts)), numbered 0,1,2,3,etc. This will spread out the graph bars evenly across the X axis on the plot.

np.arange is a NumPy function like the range() function in Python, only the result it produces is a "NumPy array".

plt.bar() creates a bar graph, using the "x" values as the X axis positions and the values in the votes array (i.e. the vote counts) as the height of each bar.

The Output :

Final Code for Bar Chart :

import matplotlib.pyplot as plt
import numpy as np
# Create an empty dictionary for associating radish names
# with vote counts
counts = {}
fraud=[]
# Create an empty list with the names of everyone who voted
voted = []
# Clean up (munge) a string so it's easy to match against other     strings
def clean_string(s):
    return s.strip().capitalize().replace("  "," ")
# Check if someone has voted already and return True or False
def has_already_voted(name):
    if name in voted:
        fraud.append(name)
        return True
    return False
# Count a vote for the radish variety named 'radish'
def count_vote(radish):
    if not radish in counts:
        # First vote for this variety
        counts[radish] = 1
    else:
        # Increment the radish count
        counts[radish] = counts[radish] + 1
def max_voted_radish(stats):
    return[key for key,val in stats.iteritems() if val == max(stats.values())]
def min_voted_radish(stats):
    return[key for key,val in stats.iteritems() if val == min(stats.values())]

for line in open("radishsurvey.txt"):
    line = line.strip()
    name, vote = line.split(" - ")
    name = clean_string(name)
    vote = clean_string(vote)

    if not has_already_voted(name):
        count_vote(vote)
    voted.append(name)
         
names = []
votes = []
# Split the dictionary of name:votes into two lists, one for names and one for vote count
for radish in counts:
    names.append(radish)
    votes.append(counts[radish])
mxpos= votes.index(max(votes))+1
mnpos= votes.index(max(votes))+1
# The X axis can just be numbered 0,1,2,3...
x = np.arange(len(counts))

plt.bar(x, votes)
plt.xticks(x + 0.5, names, rotation=90)
plt.yticks(np.arange(0,max(votes)+20,10))
plt.ylabel('Votes')
plt.xlabel('Voters')
plt.title('Leader Board')
plt.annotate('max vote '+str(max(votes)), xy=(0.5+0.5*mxpos, max(votes)), xytext=(2+0.5*mxpos, max(votes)+5),
            arrowprops=dict(facecolor='red', shrink=0.05),
            )

Output:

Pie Chart :

Code :

import matplotlib.pyplot as plt
import numpy as np
from pylab import *
# Create an empty dictionary for associating radish names
# with vote counts
counts = {}
fraud=[]
# Create an empty list with the names of everyone who voted
voted = []
# Clean up (munge) a string so it's easy to match against other     strings
def clean_string(s):
    return s.strip().capitalize().replace("  "," ")
# Check if someone has voted already and return True or False
def has_already_voted(name):
    if name in voted:
        fraud.append(name)
        return True
    return False
# Count a vote for the radish variety named 'radish'
def count_vote(radish):
    if not radish in counts:
        # First vote for this variety
        counts[radish] = 1
    else:
        # Increment the radish count
        counts[radish] = counts[radish] + 1
def max_voted_radish(stats):
    return[key for key,val in stats.iteritems() if val == max(stats.values())]
def min_voted_radish(stats):
    return[key for key,val in stats.iteritems() if val == min(stats.values())]

for line in open("radishsurvey.txt"):
    line = line.strip()
    name, vote = line.split(" - ")
    name = clean_string(name)
    vote = clean_string(vote)

    if not has_already_voted(name):
        count_vote(vote)
    voted.append(name)
         
names = []
votes = []
# Split the dictionary of name:votes into two lists, one for names and one for vote count
for radish in counts:
    names.append(radish)
    votes.append(counts[radish])
vts=[(float(x)/float(sum(votes)))*100.0 for x in votes]
sizes = vts
cs=cm.Set1(np.arange(40)/40.)
expl=[]
for i in xrange(len(vts)):
    if vts[i]==max(vts):
        expl.append(0.1)
    else:
        expl.append(0)

plt.pie(sizes, explode=expl, labels=names, colors=cs,
        autopct='%1.1f%%', shadow=True, startangle=90)
# Set aspect ratio to be equal so that pie is drawn as a circle.
plt.axis('equal')

plt.show()

Output :

Python In Action --->

The Vote Counting Problem

In a world where every moment is being captured in making a story from insurmountable information we have yet another innovation from Guido Van Rossum a software named as Python in helping us doing the same with an ease.

Let us try and figure out in a nut shell how Python helps us in extracting and presenting information

Find below a real life example using a data set called radishsurvey.txt.

Scenario : We are trying to figure out some information's regarding a person's choice on the variety of radish.

Introduction :

The file is a text file which contains 300 lines of a raw data from a survey , where each line consists of a name, a hyphen, then a radish variety .

Note: The default directory is being used to save the raw data downloaded in txt format , this directory gets created when we install anaconda suit in the machine.

Targeting Result :

Our Target is to get some useful information out of this raw data.But What information we are looking for to get out of this raw data. Let's say :

1. What's the most popular radish variety?

2. What are the least popular?

3. Did anyone vote twice?

What we need to know -

Well first thing first , This is a Text data which means we must have knowledge of Working With Strings
.

Array :

· First of all we will process the raw data and store two information in the form of strings in to two different variable . Lets say, one is name and another one is vote .

So in the above example we use the code :

· We stripped out each line from the text file

· Split the line by “ – “ and stored it in two variables name and vote

· Finally printed our result .

Now , let's make these two variables as array and put the data over there for some future use.

Output:

Perform Checking for Duplication :

let's quickly perform some operation on these arrays to check if there are any duplication.

Output:

So from the list we can infer there are fraud voters / duplicate values by taking the Example of “Red Kings” , which is repeated three times .

So definitely there is something wrong.

Let's investigate in some other ways as well !

Using Dictionary :

Output:

So here we have simply created one blank Dictionary named counts {} . Then we have put the two Things name and the voted radish variety . We only put the count of Each Variety in the Dictionary .

Here as well the output exhibits some flaws within !

Cleaning the data :

Let's Check something same with person’s name ( This time more cautiously , we will remove extra spaces from person’s name and also capitalize all the letters ) .

Output :

A Haaa !!! So here is Phoebe barwell and Procopiozito who are the frauds …

So how actually we found them ? ..

· We created one empty array named voted.

· Then we went through each line of the text and took out the names of the voters from the line Then performed the cleanup by using capitalize() keyword to make all the names in Capital letters , and replace() keyword to replace extra spaces between First name and Surname .

Use Of Function :

Now ,Lets use some user defined functions to make the lines of code shorter.

Output:

So , We got our Fresh Result .

Conclusion:

Now ,Lets make the program more efficient so that it can give us the answer for the following questions :

1. What's the most popular radish variety?

2. What are the least popular?

3. Did anyone vote twice?

4. All voting result.

Code :

# Create an empty dictionary for associating radish names
# with vote counts
counts = {}
fraud=[]
# Create an empty list with the names of everyone who voted
voted = []
# Clean up (munge) a string so it's easy to match against other     strings
def clean_string(s):
    return s.strip().capitalize().replace("  "," ")
# Check if someone has voted already and return True or False
def has_already_voted(name):
    if name in voted:
        fraud.append(name)
        return True
    return False
# Count a vote for the radish variety named 'radish'
def count_vote(radish):
    if not radish in counts:
        # First vote for this variety
        counts[radish] = 1
    else:
        # Increment the radish count
        counts[radish] = counts[radish] + 1
def max_voted_radish(stats):
    return[key for key,val in stats.iteritems() if val == max(stats.values())]
def min_voted_radish(stats):
    return[key for key,val in stats.iteritems() if val == min(stats.values())]

for line in open("radishsurvey.txt"):
    line = line.strip()
    name, vote = line.split(" - ")
    name = clean_string(name)
    vote = clean_string(vote)

    if not has_already_voted(name):
        count_vote(vote)
    voted.append(name)
if len(fraud)==1:
    print("--------------------------------------")
    print("There is only one fraud :" +fraud[0])
    print("--------------------------------------")
elif len(fraud)>1:
    print("--------------------------------------")
    print("There are total "+ str(len(fraud))+" frauds And they are : ")
    print("--------------------------------------")
    for i in xrange(len(fraud)):
        print(str(i+1)+" . "+fraud[i])
        
#print(max(counts.iterkeys(), key=lambda k: counts[k]))
x=(max_voted_radish(counts))
print("--------------------------------------")
print("The most popular radish variety")
print("--------------------------------------")
i=1
for elem in x:
    print(str(i)+" . "+elem +" With Total Vote : "+str(counts[elem]))
    i=i+1
y=(min_voted_radish(counts))
print("--------------------------------------")
print("The least popular radish variety")
print("--------------------------------------")
i=1
for elem in y:
    print(str(i)+" . "+elem +" With Total Vote : "+str(counts[elem]))
    i=i+1
print("--------------------------------------")
print("The Leader Board")
print("--------------------------------------")
for name in counts:
    print(name + ": " + str(counts[name]))

Output :

Subscribe to: Posts ( Atom )