In this blog we shall try and figure out how Python
helps us in extracting information from CSV Files.
Note: CSV-Comma-separated values is a
way of expressing structured data in flat text files:
Scenario:
In the example stated below we shall try and figure out from a freely available airline data set available at OpenFlights data page in a CSV format comprising the details on the number of airports a city has.
In the example stated below we shall try and figure out from a freely available airline data set available at OpenFlights data page in a CSV format comprising the details on the number of airports a city has.
We will now
import and open the CSV file in Python with the help of the following commands
stated below:
Output :
Explanation:
We open the CSV file (airports.dat) in
python and fetch the data row wise , but as we see the country name is on the
fourth column we have to pull out row[3] and appended it in an array country=[].
** row here temporary array .
Now, let's try to make a program which
will fetch the airport names for some explicitly defined countries .Lets say
,In Our example we want to see the names of the airport present in
“Australia”.
Output :
Explanation:
At first we have created one empty Dictionary
airport={}
Secondly each row from the CSV file is
imported into a variable line ( which is one array) . As we know in our array
the third column contains the country name and the first column contains the
airport name , we further tried to check if the dictionary already contains the
country name as key in it , if not then using if – else we first created one
key as a new country ( else part ). Or if the dictionary already had the
country name as a key in it we would just appended the value to it using if
part .
Since we wanted to just see the names of the airport
situated in “Australia” ,we printed the values assigned to the dictionary key=
“Australia” .
The other countries mentioned in the
raw data file can be checked in similar fashion.
Airline Route Histogram
In order to accomplish the task of
calculating the distance that needs to be traversed in a certain flight
schedule we shall plot a histogram showing the distribution of distances over each flight
schedule.
In order to perform this operation we have
to adhere to the following steps :
·
Read the airports file (airports.dat) and build a dictionary
mapping the unique airport ID to the geographical coordinates (latitude &
longitude.) This allows you to look up the location of each airport by its ID.
·
Read the routes file (routes.dat) and get the IDs of the source
and destination airports. Look up the latitude and longitude based on the ID.
Using those coordinates, calculate the length of the route and append it to a
list of all route lengths.
Requirement : Calculating geographic
distances is a bit tricky because the earth is a sphere. The distance we
measure is the "great circle distance".
In order to calculate "great circle distance".,
we have to import a module named geo_distance” , a pre-built function geo_distance.distance() helps us in
calculating the distance of each airline route.
Let's
now have a quick look how the function geo_distance.distace() works :
Output:
The final code:
import numpy as np import matplotlib.pyplot as plt latitudes = {} longitudes = {} f = open("airports.dat") for row in csv.reader(f): airport_id = row[0] latitudes[airport_id] = float(row[6]) longitudes[airport_id] = float(row[7]) distances = [] f = open("routes.dat") for row in csv.reader(f): source_airport = row[3] dest_airport = row[5] if source_airport in latitudes and dest_airport in latitudes: source_lat = latitudes[source_airport] source_long = longitudes[source_airport] dest_lat = latitudes[dest_airport] dest_long = longitudes[dest_airport] distances.append(geo_distance.distance(source_lat,source_long,dest_lat,dest_long)) plt.hist(distances, 100, facecolor='r') plt.xlabel("Distance (km)") plt.ylabel("Number of flights")
Explanation :
We need to read the airport file
(airports.dat) and build a dictionary mapping the unique airport ID to the
geographical coordinates (latitude & longitude.)
We need to read the routes file
(routes.dat) and get the IDs of the source and destination airports. And then look up for the latitude and
longitude based on the ID . Finally, using these coordinates, calculate the
length of the route and append it to a list distances = []of all route lengths.
At last a histogram is plotted based on the route lengths, to
show the distribution of different flight distances.
Output :