Click above image to access the interactive version of this notebook
05 - 02 Exploratory Data Analysis using Pandas
Based on the concepts that we have built in the last couple of modules, we will explore the imdb dataset but before we do that, let’s first look at ways of loading the datasets as pandas dataframe.
Loading Data
You can read data from a CSV file using the read_csv function. By default, it assumes that the fields are comma-separated. Pandas supports following file formats:
Function | Description |
---|---|
read_csv | Load delimited data from a file, URL, or file-like object. Use comma as default delimiter |
read_table | Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter |
read_fwf | Read data in fixed-width column format (that is, no delimiters) |
read_clipboard | Version of read_table that reads data from the clipboard. Useful for converting tables from web pages. |
For most of the analysis work, you will be loading the csv, tsv or some delimited files. We will only look at read_csv
function in this example.
First, let’s import the dataset:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
Lets load the cast, release-dates and titles dataset. This might take sometime.
%%time
cast = pd.read_csv('./sample_datasets/cast.csv', index_col=None)
release_dates = pd.read_csv('./sample_datasets/release_dates.csv', index_col=None,
parse_dates=['date'], infer_datetime_format=True)
titles = pd.read_csv('./sample_datasets/titles.csv', index_col=None)
CPU times: user 4.75 s, sys: 398 ms, total: 5.14 s
Wall time: 5.16 s
Lets look at some of the contents of these dataframes
cast.head()
title | year | name | type | character | n | |
---|---|---|---|---|---|---|
0 | Closet Monster | 2015 | Buffy #1 | actor | Buffy 4 | 31.0 |
1 | Suuri illusioni | 1985 | Homo $ | actor | Guests | 22.0 |
2 | Battle of the Sexes | 2017 | $hutter | actor | Bobby Riggs Fan | 10.0 |
3 | Secret in Their Eyes | 2015 | $hutter | actor | 2002 Dodger Fan | NaN |
4 | Steve Jobs | 2015 | $hutter | actor | 1988 Opera House Patron | NaN |
The
n
in the cast dataframe tells us the rank or the order of the leading roles.
release_dates.head()
title | year | country | date | |
---|---|---|---|---|
0 | #73, Shaanthi Nivaasa | 2007 | India | 2007-06-15 |
1 | #BKKY | 2016 | Cambodia | 2017-10-12 |
2 | #Beings | 2015 | Romania | 2015-01-29 |
3 | #Captured | 2017 | USA | 2017-09-05 |
4 | #Ewankosau saranghaeyo | 2015 | Philippines | 2015-01-21 |
titles.head()
title | year | |
---|---|---|
0 | Gypsy 83 | 2001 |
1 | Challenge of Five Gauntlets | 2018 |
2 | Haunting Shadows | 1919 |
3 | The Cake Eaters | 2007 |
4 | B Qu 32 Hao | 2011 |
Do you know how many movies released since you were born?
my_birth_year = 1990
len(titles[(titles['year']>my_birth_year) & (titles['year']<2017)])
115183
Lets plot how many movies released every year since you were born
titles.groupby('year').size().loc[my_birth_year:2016].plot(title="Number of movies released every year")
<AxesSubplot:title={'center':'Number of movies released every year'}, xlabel='year'>
Pandas provides matplotlib based plotting as a part of dataframe. To plot any dataframe (with plottable data) you can simply use .plot()
.
Ofcourse we can also do this manually using matplotlib’s pyplot as well (which is actually what pandas uses under the cover)
fig, ax = plt.subplots()
ax.plot(titles.groupby('year').size().loc[my_birth_year:2016])
ax.set_xlabel("Year")
ax.set_ylabel("# of Movies")
ax.set_title("Number of movies released every year")
ax.set_xlim(1990, 2016)
(1990.0, 2016.0)
Hmm.. How about the total number of movies released every decade?
This will involve a little math. We know that the titles
dataframe has a year
column which, obviously, has the year when a movie was released.
To count the occurence of something, we can use the value_counts
method.
So now, all we need to do is pass the decade as key to the titles
dataframe and apply the value_counts
method.
Let’s see how to do this
(titles['year'] // 10 * 10).value_counts().sort_index().plot(kind='bar')
<AxesSubplot:>
How many movies did a movie star star in?
movie_star = "Matt Damon"
len(cast[cast['name'] == movie_star])
62
What are the 10 most common name of the roles played by characters?
cast['character'].value_counts().head(10)
Himself 20746
Dancer 12477
Extra 11948
Reporter 8434
Student 7773
Doctor 7669
Party Guest 7245
Policeman 7029
Nurse 6999
Bartender 6802
Name: character, dtype: int64
What are the 10 most common movie names?
titles['title'].value_counts().head(10)
Hamlet 20
Carmen 17
Macbeth 16
Maya 12
Temptation 12
The Outsider 12
The Three Musketeers 11
Othello 11
Honeymoon 11
Freedom 11
Name: title, dtype: int64
Similarly, you can find who has been the most in the movies
cast['name'].value_counts().head(10)
Bess Flowers 835
Herman Hack 702
Sam (II) Harris 667
Harold Miller 624
Lee Phelps 624
Frank O'Connor 613
Franklyn Farnum 570
Tom London 565
Larry Steers 559
Frank Ellis 546
Name: name, dtype: int64
Lets find the years when The Bourne series were released.
titles[titles['title'].str.contains("Bourne")].sort_values('year')
title | year | |
---|---|---|
169218 | The Bourne Identity | 2002 |
244703 | The Bourne Supremacy | 2004 |
74839 | The Bourne Ultimatum | 2007 |
197575 | The Mel Bourne Ultimatum | 2009 |
106112 | The Bourne Legacy | 2012 |
4707 | Jason Bourne | 2016 |
Lets find the movies when Matt Damon was Jason Bourne
matt_jbourne = cast[(cast['title'].str.contains("Bourne")) &
(cast['name'].str.contains("Matt Damon"))]
matt_jbourne
title | year | name | type | character | n | |
---|---|---|---|---|---|---|
522755 | Jason Bourne | 2016 | Matt Damon | actor | Jason Bourne | 1.0 |
522774 | The Bourne Identity | 2002 | Matt Damon | actor | Bourne | 1.0 |
522775 | The Bourne Supremacy | 2004 | Matt Damon | actor | Jason Bourne | 1.0 |
522776 | The Bourne Ultimatum | 2007 | Matt Damon | actor | Jason Bourne | 1.0 |
So, How many movies do you think is released every year starring Matt Damon?
fig, ax = plt.subplots()
ax.plot(cast[cast['name'] == "Matt Damon"].groupby('year').size())
ax.set_xlabel("Year")
ax.set_ylabel("# of Movies")
ax.set_title("Matt Damon movies")
Text(0.5, 1.0, 'Matt Damon movies')
And how about the ranks at which Matt Damon stars in the movies?
matt_movies = cast[cast.name == 'Matt Damon'].sort_values('year')
matt_movies = matt_movies[matt_movies['n'].notnull()]
# For scatter plots, you can simply pass the column
# names for the x and y argument
matt_movies.plot(x='year', y='n', kind='scatter')
<AxesSubplot:xlabel='year', ylabel='n'>
So how many leading roles?
matt_movies = cast[cast.name == 'Matt Damon'].sort_values('year')
matt_movies[matt_movies['n'] == 1]['n'].value_counts()
1.0 21
Name: n, dtype: int64
Lets see how many people were casted in all these Jason Bourne movies
%%time
cast[(cast['title'].str.contains("Bourne"))].groupby(['year', 'title']).size()
CPU times: user 1.31 s, sys: 7.41 ms, total: 1.31 s
Wall time: 1.32 s
year title
2002 The Bourne Identity 56
2004 The Bourne Supremacy 49
2007 The Bourne Ultimatum 70
2009 The Mel Bourne Ultimatum 8
2012 The Bourne Legacy 122
2016 Jason Bourne 259
dtype: int64
Among these casts, how many were actors and actresses?
%%time
cast[(cast['title'].str.contains("Bourne"))].groupby(['year', 'title', 'type']).size()
CPU times: user 1.34 s, sys: 8.34 ms, total: 1.35 s
Wall time: 1.36 s
year title type
2002 The Bourne Identity actor 50
actress 6
2004 The Bourne Supremacy actor 38
actress 11
2007 The Bourne Ultimatum actor 59
actress 11
2009 The Mel Bourne Ultimatum actor 5
actress 3
2012 The Bourne Legacy actor 86
actress 36
2016 Jason Bourne actor 189
actress 70
dtype: int64
Lets plot the above and see the total number of roles based on gender
jason_cast = cast[(cast['title'].str.contains("Bourne"))]
jason_cast_gender = jason_cast[['year', 'type']].groupby(['year', 'type']).size().unstack()
print(jason_cast_gender)
jason_cast_gender.plot()
type actor actress
year
2002 50 6
2004 38 11
2007 59 11
2009 5 3
2012 86 36
2016 189 70
<AxesSubplot:xlabel='year'>
Lets find out the entire cast of the The Bourne Ultimatum and print just the top 10 leads
cast[cast['title'] == "The Bourne Ultimatum"].sort_values(['n']).head(10)
title | year | name | type | character | n | |
---|---|---|---|---|---|---|
522776 | The Bourne Ultimatum | 2007 | Matt Damon | actor | Jason Bourne | 1.0 |
3613634 | The Bourne Ultimatum | 2007 | Julia Stiles | actress | Nicky Parsons | 2.0 |
2195770 | The Bourne Ultimatum | 2007 | David Strathairn | actor | Noah Vosen | 3.0 |
840581 | The Bourne Ultimatum | 2007 | Scott Glenn | actor | Ezra Kramer | 4.0 |
459423 | The Bourne Ultimatum | 2007 | Paddy Considine | actor | Simon Ross | 5.0 |
1872404 | The Bourne Ultimatum | 2007 | Edgar (IV) Ram?rez | actor | Paz | 6.0 |
728791 | The Bourne Ultimatum | 2007 | Albert Finney | actor | Dr. Albert Hirsch | 7.0 |
2552022 | The Bourne Ultimatum | 2007 | Joan Allen | actress | Pam Landy | 8.0 |
790556 | The Bourne Ultimatum | 2007 | Tom Gallop | actor | Tom Cronin | 9.0 |
1112789 | The Bourne Ultimatum | 2007 | Corey Johnson | actor | Wills | 10.0 |
Lets see in what months Matt Damon’s movies are most often released in the USA.
- First, find the year and the title of unique movies starring Matt Damon were released
matt_movies = cast[cast['name'] == "Matt Damon"][['title', 'year']].drop_duplicates()
Now we have to re-index the release_dates
dataframe with title
and year
keys
rel_dts = release_dates.set_index(['title', 'year']).sort_index()
- Now, the ‘month’ part is present in
date
column present inrelease_dates
dataframe so we have to combine thecast
andrelease_dates
dataframes for Matt Damon’s movies by some common index (in our case, we will use the common index astitle
andyear
since it is present in both the dataframes)
matt_movie_releases = matt_movies.join(rel_dts, on=['title', 'year'])
- We only want movies released in the USA
matt_movie_releases = matt_movie_releases[matt_movie_releases['country'] == "USA"]
- Now lets plot the
month
part.
matt_movie_releases['date'].dt.month.value_counts().sort_index().plot(kind='bar')
<AxesSubplot:>
Cool, Now lets see when the Bourne movie series were released in different countries.
Since there are is a huge list of countries, lets just select “USA”, “UK” and “India”
countries = ["USA", "UK", "Australia"]
matt_movie_releases = matt_movies.join(rel_dts, on=['title', 'year'])
matt_movie_countries = matt_movie_releases[matt_movie_releases['country'].str.contains('|'.join(countries))]
matt_movie_countries.set_index(['title', 'country'])[['date']].unstack()
date | |||
---|---|---|---|
country | Australia | UK | USA |
title | |||
All the Pretty Horses | 2001-05-10 | 2001-05-25 | 2000-12-25 |
Chasing Amy | 1997-07-03 | 1997-11-14 | 1997-04-18 |
Che: Part Two | 2009-10-01 | 2009-02-20 | 2009-01-24 |
Confessions of a Dangerous Mind | 2003-07-24 | 2003-03-14 | 2003-01-24 |
Contagion | 2011-10-20 | 2011-10-21 | 2011-09-09 |
Courage Under Fire | 1996-09-12 | 1996-10-04 | 1996-07-12 |
Dogma | 2000-02-03 | 1999-12-26 | 1999-11-12 |
Downsizing | 2017-12-26 | 2018-01-24 | 2017-12-22 |
Elysium | 2013-08-15 | 2013-08-21 | 2013-08-09 |
EuroTrip | 2004-08-12 | 2004-06-25 | 2004-02-20 |
Field of Dreams | 1989-08-31 | 1989-11-24 | 1989-05-05 |
Finding Forrester | 2001-03-29 | 2001-02-23 | 2001-01-12 |
Gake no ue no Ponyo | 2009-08-27 | 2010-02-12 | 2009-08-14 |
Geronimo: An American Legend | 1994-05-26 | 1994-10-14 | 1993-12-10 |
Gerry | NaT | 2003-08-22 | NaT |
Glory Daze | NaT | NaT | 1996-09-27 |
Good Will Hunting | 1998-03-12 | 1998-03-06 | 1998-01-09 |
Green Zone | 2010-03-11 | 2010-03-12 | 2010-03-12 |
Happy Feet Two | 2011-12-26 | 2011-12-02 | 2011-11-18 |
Hereafter | NaT | 2011-01-28 | 2010-10-22 |
Interstellar | 2014-11-06 | 2014-11-07 | 2014-11-07 |
Invictus | 2010-01-21 | 2010-02-05 | 2009-12-11 |
Jason Bourne | 2016-07-28 | 2016-07-27 | 2016-07-29 |
Jay and Silent Bob Strike Back | 2002-01-31 | 2001-11-30 | 2001-08-24 |
Jersey Girl | 2004-08-12 | 2004-06-18 | 2004-03-26 |
Margaret | 2012-06-14 | NaT | NaT |
Mystic Pizza | 1989-08-03 | 1990-01-05 | 1988-10-21 |
Ocean's Eleven | 2002-01-10 | 2002-02-15 | 2001-12-07 |
Ocean's Thirteen | 2007-06-14 | 2007-06-08 | 2007-06-08 |
Ocean's Twelve | 2004-12-09 | 2005-02-04 | 2004-12-10 |
Promised Land | 2013-05-02 | 2013-04-19 | 2013-01-04 |
Rounders | 1999-01-28 | 1998-11-20 | 1998-09-11 |
Saving Private Ryan | 1998-11-19 | 1998-09-11 | 1998-07-24 |
School Ties | 1993-06-10 | NaT | 1992-09-18 |
Spirit: Stallion of the Cimarron | 2002-06-20 | 2002-07-05 | 2002-05-24 |
Stuck on You | 2004-02-12 | 2004-01-02 | 2003-12-12 |
Suburbicon | 2017-10-26 | 2017-11-24 | 2017-10-27 |
Syriana | 2006-02-16 | 2006-03-03 | 2005-12-09 |
The Adjustment Bureau | NaT | 2011-03-04 | 2011-03-04 |
The Bourne Identity | 2002-08-22 | 2002-09-06 | 2002-06-14 |
The Bourne Supremacy | 2004-08-26 | 2004-08-13 | 2004-07-23 |
The Bourne Ultimatum | 2007-08-30 | 2007-08-17 | 2007-08-03 |
The Brothers Grimm | 2005-11-24 | 2005-11-04 | 2005-08-26 |
The Departed | 2006-10-12 | 2006-10-06 | 2006-10-06 |
The Good Mother | 1989-05-04 | 1989-02-17 | 1988-11-04 |
The Good Shepherd | 2007-02-15 | 2007-02-23 | 2006-12-22 |
The Great Wall | 2017-02-16 | 2017-02-17 | 2017-02-17 |
The Informant! | 2009-12-03 | 2009-11-20 | 2009-09-18 |
The Legend of Bagger Vance | 2001-02-08 | 2001-02-23 | 2000-11-03 |
The Majestic | 2002-05-16 | 2002-05-24 | 2001-12-21 |
The Martian | 2015-09-30 | 2015-09-30 | 2015-10-02 |
The Monuments Men | 2014-03-13 | 2014-02-14 | 2014-02-07 |
The Rainmaker | 1998-01-22 | 1998-04-03 | 1997-11-21 |
The Talented Mr. Ripley | 2000-02-24 | 2000-02-25 | 1999-12-25 |
The Zero Theorem | 2014-05-15 | 2014-03-14 | NaT |
Titan A.E. | 2001-01-04 | 2000-07-28 | 2000-06-16 |
True Grit | 2011-01-26 | 2011-02-11 | 2010-12-22 |
We Bought a Zoo | 2011-12-26 | 2012-03-16 | 2011-12-23 |
Youth Without Youth | 2008-11-20 | 2007-12-14 | NaT |
To match the
country
column against all the elements of the list, we use the|
(OR) operator. It is considered as a Regular Expression.
We can also create a Pivot
table to provide the above output.
- The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.
- Think of it as a multi-dimensional GroupBy function
countries = ["USA", "UK", "Australia"]
matt_movie_releases = matt_movies.join(rel_dts, on=['title', 'year'])
matt_movie_countries = matt_movie_releases[matt_movie_releases['country'].str.contains('|'.join(countries))]
matt_movie_countries.pivot(index='title', columns='country', values='date')
country | Australia | UK | USA |
---|---|---|---|
title | |||
All the Pretty Horses | 2001-05-10 | 2001-05-25 | 2000-12-25 |
Chasing Amy | 1997-07-03 | 1997-11-14 | 1997-04-18 |
Che: Part Two | 2009-10-01 | 2009-02-20 | 2009-01-24 |
Confessions of a Dangerous Mind | 2003-07-24 | 2003-03-14 | 2003-01-24 |
Contagion | 2011-10-20 | 2011-10-21 | 2011-09-09 |
Courage Under Fire | 1996-09-12 | 1996-10-04 | 1996-07-12 |
Dogma | 2000-02-03 | 1999-12-26 | 1999-11-12 |
Downsizing | 2017-12-26 | 2018-01-24 | 2017-12-22 |
Elysium | 2013-08-15 | 2013-08-21 | 2013-08-09 |
EuroTrip | 2004-08-12 | 2004-06-25 | 2004-02-20 |
Field of Dreams | 1989-08-31 | 1989-11-24 | 1989-05-05 |
Finding Forrester | 2001-03-29 | 2001-02-23 | 2001-01-12 |
Gake no ue no Ponyo | 2009-08-27 | 2010-02-12 | 2009-08-14 |
Geronimo: An American Legend | 1994-05-26 | 1994-10-14 | 1993-12-10 |
Gerry | NaT | 2003-08-22 | NaT |
Glory Daze | NaT | NaT | 1996-09-27 |
Good Will Hunting | 1998-03-12 | 1998-03-06 | 1998-01-09 |
Green Zone | 2010-03-11 | 2010-03-12 | 2010-03-12 |
Happy Feet Two | 2011-12-26 | 2011-12-02 | 2011-11-18 |
Hereafter | NaT | 2011-01-28 | 2010-10-22 |
Interstellar | 2014-11-06 | 2014-11-07 | 2014-11-07 |
Invictus | 2010-01-21 | 2010-02-05 | 2009-12-11 |
Jason Bourne | 2016-07-28 | 2016-07-27 | 2016-07-29 |
Jay and Silent Bob Strike Back | 2002-01-31 | 2001-11-30 | 2001-08-24 |
Jersey Girl | 2004-08-12 | 2004-06-18 | 2004-03-26 |
Margaret | 2012-06-14 | NaT | NaT |
Mystic Pizza | 1989-08-03 | 1990-01-05 | 1988-10-21 |
Ocean's Eleven | 2002-01-10 | 2002-02-15 | 2001-12-07 |
Ocean's Thirteen | 2007-06-14 | 2007-06-08 | 2007-06-08 |
Ocean's Twelve | 2004-12-09 | 2005-02-04 | 2004-12-10 |
Promised Land | 2013-05-02 | 2013-04-19 | 2013-01-04 |
Rounders | 1999-01-28 | 1998-11-20 | 1998-09-11 |
Saving Private Ryan | 1998-11-19 | 1998-09-11 | 1998-07-24 |
School Ties | 1993-06-10 | NaT | 1992-09-18 |
Spirit: Stallion of the Cimarron | 2002-06-20 | 2002-07-05 | 2002-05-24 |
Stuck on You | 2004-02-12 | 2004-01-02 | 2003-12-12 |
Suburbicon | 2017-10-26 | 2017-11-24 | 2017-10-27 |
Syriana | 2006-02-16 | 2006-03-03 | 2005-12-09 |
The Adjustment Bureau | NaT | 2011-03-04 | 2011-03-04 |
The Bourne Identity | 2002-08-22 | 2002-09-06 | 2002-06-14 |
The Bourne Supremacy | 2004-08-26 | 2004-08-13 | 2004-07-23 |
The Bourne Ultimatum | 2007-08-30 | 2007-08-17 | 2007-08-03 |
The Brothers Grimm | 2005-11-24 | 2005-11-04 | 2005-08-26 |
The Departed | 2006-10-12 | 2006-10-06 | 2006-10-06 |
The Good Mother | 1989-05-04 | 1989-02-17 | 1988-11-04 |
The Good Shepherd | 2007-02-15 | 2007-02-23 | 2006-12-22 |
The Great Wall | 2017-02-16 | 2017-02-17 | 2017-02-17 |
The Informant! | 2009-12-03 | 2009-11-20 | 2009-09-18 |
The Legend of Bagger Vance | 2001-02-08 | 2001-02-23 | 2000-11-03 |
The Majestic | 2002-05-16 | 2002-05-24 | 2001-12-21 |
The Martian | 2015-09-30 | 2015-09-30 | 2015-10-02 |
The Monuments Men | 2014-03-13 | 2014-02-14 | 2014-02-07 |
The Rainmaker | 1998-01-22 | 1998-04-03 | 1997-11-21 |
The Talented Mr. Ripley | 2000-02-24 | 2000-02-25 | 1999-12-25 |
The Zero Theorem | 2014-05-15 | 2014-03-14 | NaT |
Titan A.E. | 2001-01-04 | 2000-07-28 | 2000-06-16 |
True Grit | 2011-01-26 | 2011-02-11 | 2010-12-22 |
We Bought a Zoo | 2011-12-26 | 2012-03-16 | 2011-12-23 |
Youth Without Youth | 2008-11-20 | 2007-12-14 | NaT |
Do you know when are the most Action movies released in the USA?
action_usa = release_dates[(release_dates['title'].str.contains('Action')) &
(release_dates['country'] == "USA")]
action_usa['date'].dt.dayofweek.value_counts().sort_index().plot(kind='bar')
<AxesSubplot:>
In which months are Matt Damon’s movies generally released in the USA?
matt_movies = cast[cast['name'] == 'Matt Damon']
matt_movies_usa = matt_movies.merge(release_dates[release_dates.country == 'USA']).sort_values('date')
matt_movies_usa.date.dt.month.value_counts().sort_index().plot(kind='bar')
<AxesSubplot:>
The above examples should give you some idea about the importance of Pandas and how its high level functions mask the complex computation that is performed on the underlying Numpy arrays.
This is by no means an exhaustive list of all the functions. We have barely scratched the surface.
Remember –
The only way to become a master of something is to be really With It!
So keep practicing and whenever you are stuck:
- Go through the official documentation.
- Enter the object name and Press
<TAB>
or.?
and Jupyter will show you the docstring. - Don’t trust the examples blindly. Run them, modify them, make mistakes and then rectify them.
- Don’t simply copy the StackOverflow or StackExchange or answers from anywhere else. Understand the solution that you find on such sites and then and only then use it in your code.
If you want more tutorials/ cookbooks, take a look at
- [
Hernan Rojas's Learn Pandas
]( https://bitbucket.org/hrojas/learn-pandas ‘hrojas’s Learn Pandas’)