Data Cleaning and Analysis: Priceline Flight Data

Introduction

Justin Chen
3 min readJul 16, 2023

Since I don’t actively use Python in my current job, I have been grabbing Kaggle datasets and performing data analysis within Jupyter Notebooks to practice on my own. This project walks through how I cleaned and analyzed an interesting flight dataset from Priceline.

Cleaning the data

The dataset contained records of flights and relevant information, such as the airline, total travel time, departing and destination airport, number of stops (layovers), and the waiting time spent at each stop. When I read the file, there were also a lot of extra columns.

The first step was to drop all those extra columns and rename some of the remaining ones. Next, I checked the data types and learned that all columns were object datatypes. This wasn’t optimal, as I was interested in calculating descriptive statistics using price and travel time.

Notice the travel time-related columns are in the “XXh XXm” format

To do this, I converted the columns containing price and travel time information into integers. It turned out to be somewhat of a trial-and-error process because I kept discovering edge cases. For example, a ‘$’ in some of the price records.

#Cleaning non-numeric records from the Price column
df['Price']=df['Price'].astype(str)
df['Price']=df['Price'].str.replace('$','x',regex=False)
i = df.index[df['Price'].str.isalpha()]
df.drop(i,inplace=True)

#Now I can convert the Price column to int
df.Price=df.Price.astype(int)

There were also some unexpected characters in the travel time-related columns, which I had to discover and account for through trial and error. I used string replacements and the eval function to convert each travel time into the number of hours, rounded to the hundredth.

#converting duration columns into int (hours)
temp=['Travel time in hours','1st stop wtime','2nd stop wtime','3rd stop wtime']

for x in temp:
df[x] = df[x].fillna(0)
df[x] = df[x].astype(str)

df['Travel time in hours']=df['Travel time in hours'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
df['1st stop wtime']=df['1st stop wtime'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
df['2nd stop wtime']=df['2nd stop wtime'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').str.replace('sc','+0').str.replace('jRQBWg','+0').apply(eval)
df['3rd stop wtime']=df['3rd stop wtime'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
df.head(1)

for x in temp:
df[x]=(df[x]/60).round(2)

The last step of this cleaning process was to drop all rows where either the destination or departure airport name was missing.

#drop rows with null destination/departure airports
i =df.index[df['Departure Airport'].isnull() | df['Destination Airport'].isnull()]
df.drop(i,inplace=True)
This is the format I wanted. Looks good!

Analysis

The analysis I chose to perform for this dataset centered around travel times and prices for specific routes. For a specific route, say, New York City and London, I wanted to know which airlines would provide the least amount of travel time and/or the best price.

By pivoting the data by airline and the “number of stops” field and then plotting it in a scatterplot I get a general idea of which airlines provide a minimized travel time for a fair price when compared to other companies that fly the same route.

#lets start with major airports in London and NYC
airports=['LHR','LCY','LGW','LTN','STN','SEN','JFK','EWR','LGA']
london_nyc=df[df['Departure Airport'].isin(airports) & df['Destination Airport'].isin(airports)].reset_index(drop=True)
london_nyc_sorted=london_nyc.pivot_table(index=['Airline','Stops'],values=['Price','Travel time in hours'],aggfunc=['mean','count']).round(2)
london_nyc_sorted = london_nyc_sorted.loc[london_nyc_sorted[('count','Price')]>=10].sort_values(by=[('mean','Price'),('mean','Travel time in hours')],ascending=False)

#plotting the scatterplot
x=london_nyc_sorted[('mean','Travel time in hours')]
y=london_nyc_sorted[('mean','Price')]
a=london_nyc_sorted.index

fig,ax=plt.subplots()
ax.scatter(x,y)
ax.set_xlabel ('Travel time in hours')
ax.set_ylabel ('Price (USD)')

for i,txt in enumerate(a):
ax.annotate(txt, (x[i],y[i]))

Time is money. You can fly Etihad between NYC and London for an average of under $600, but you’ll be in transit for 3 hours more than the most efficient flights. If you’re looking for the most bang for your buck, consider booking through Emirates, which has similar travel times to British Airways and Qatar but can be $250 cheaper.

Conclusion

Not all Kaggle datasets are ready-to-use, so this project was a good exercise in data cleaning and transformation. Thanks for reading!

Link to Jupyter Notebook

--

--

Justin Chen
Justin Chen

Written by Justin Chen

Showcasing my data analysis projects in a narrative format.

No responses yet