Collab Link

Click above image to access the interactive version of this notebook

05 - 02 Exploratory Data Analysis using Pandas

Based on the concepts that we have built in the last couple of modules, we will explore the imdb dataset but before we do that, let’s first look at ways of loading the datasets as pandas dataframe.

Loading Data

You can read data from a CSV file using the read_csv function. By default, it assumes that the fields are comma-separated. Pandas supports following file formats:

Function Description
read_csv Load delimited data from a file, URL, or file-like object. Use comma as default delimiter
read_table Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter
read_fwf Read data in fixed-width column format (that is, no delimiters)
read_clipboard Version of read_table that reads data from the clipboard. Useful for converting tables from web pages.

For most of the analysis work, you will be loading the csv, tsv or some delimited files. We will only look at read_csv function in this example.

First, let’s import the dataset:

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')

Lets load the cast, release-dates and titles dataset. This might take sometime.

%%time
cast = pd.read_csv('./sample_datasets/cast.csv', index_col=None)
release_dates = pd.read_csv('./sample_datasets/release_dates.csv', index_col=None,
                            parse_dates=['date'], infer_datetime_format=True)
titles = pd.read_csv('./sample_datasets/titles.csv', index_col=None)
CPU times: user 4.75 s, sys: 398 ms, total: 5.14 s
Wall time: 5.16 s

Lets look at some of the contents of these dataframes

cast.head()

title year name type character n
0 Closet Monster 2015 Buffy #1 actor Buffy 4 31.0
1 Suuri illusioni 1985 Homo $ actor Guests 22.0
2 Battle of the Sexes 2017 $hutter actor Bobby Riggs Fan 10.0
3 Secret in Their Eyes 2015 $hutter actor 2002 Dodger Fan NaN
4 Steve Jobs 2015 $hutter actor 1988 Opera House Patron NaN

The n in the cast dataframe tells us the rank or the order of the leading roles.

release_dates.head()

title year country date
0 #73, Shaanthi Nivaasa 2007 India 2007-06-15
1 #BKKY 2016 Cambodia 2017-10-12
2 #Beings 2015 Romania 2015-01-29
3 #Captured 2017 USA 2017-09-05
4 #Ewankosau saranghaeyo 2015 Philippines 2015-01-21
titles.head()

title year
0 Gypsy 83 2001
1 Challenge of Five Gauntlets 2018
2 Haunting Shadows 1919
3 The Cake Eaters 2007
4 B Qu 32 Hao 2011

Do you know how many movies released since you were born?

my_birth_year = 1990
len(titles[(titles['year']>my_birth_year) & (titles['year']<2017)])
115183

Lets plot how many movies released every year since you were born

titles.groupby('year').size().loc[my_birth_year:2016].plot(title="Number of movies released every year")
<AxesSubplot:title={'center':'Number of movies released every year'}, xlabel='year'>

png

Pandas provides matplotlib based plotting as a part of dataframe. To plot any dataframe (with plottable data) you can simply use .plot().

Ofcourse we can also do this manually using matplotlib’s pyplot as well (which is actually what pandas uses under the cover)

fig, ax = plt.subplots()
ax.plot(titles.groupby('year').size().loc[my_birth_year:2016])
ax.set_xlabel("Year")
ax.set_ylabel("# of Movies")
ax.set_title("Number of movies released every year")
ax.set_xlim(1990, 2016)
(1990.0, 2016.0)

png

Hmm.. How about the total number of movies released every decade?

This will involve a little math. We know that the titles dataframe has a year column which, obviously, has the year when a movie was released.

To count the occurence of something, we can use the value_counts method.

So now, all we need to do is pass the decade as key to the titles dataframe and apply the value_counts method.

Let’s see how to do this

(titles['year'] // 10 * 10).value_counts().sort_index().plot(kind='bar')
<AxesSubplot:>

png

How many movies did a movie star star in?

movie_star = "Matt Damon"
len(cast[cast['name'] == movie_star])
62

What are the 10 most common name of the roles played by characters?

cast['character'].value_counts().head(10)
Himself        20746
Dancer         12477
Extra          11948
Reporter        8434
Student         7773
Doctor          7669
Party Guest     7245
Policeman       7029
Nurse           6999
Bartender       6802
Name: character, dtype: int64

What are the 10 most common movie names?

titles['title'].value_counts().head(10)
Hamlet                  20
Carmen                  17
Macbeth                 16
Maya                    12
Temptation              12
The Outsider            12
The Three Musketeers    11
Othello                 11
Honeymoon               11
Freedom                 11
Name: title, dtype: int64

Similarly, you can find who has been the most in the movies

cast['name'].value_counts().head(10)
Bess Flowers       835
Herman Hack        702
Sam (II) Harris    667
Harold Miller      624
Lee Phelps         624
Frank O'Connor     613
Franklyn Farnum    570
Tom London         565
Larry Steers       559
Frank Ellis        546
Name: name, dtype: int64

Lets find the years when The Bourne series were released.

titles[titles['title'].str.contains("Bourne")].sort_values('year')

title year
169218 The Bourne Identity 2002
244703 The Bourne Supremacy 2004
74839 The Bourne Ultimatum 2007
197575 The Mel Bourne Ultimatum 2009
106112 The Bourne Legacy 2012
4707 Jason Bourne 2016

Lets find the movies when Matt Damon was Jason Bourne

matt_jbourne = cast[(cast['title'].str.contains("Bourne")) & 
                    (cast['name'].str.contains("Matt Damon"))]
matt_jbourne

title year name type character n
522755 Jason Bourne 2016 Matt Damon actor Jason Bourne 1.0
522774 The Bourne Identity 2002 Matt Damon actor Bourne 1.0
522775 The Bourne Supremacy 2004 Matt Damon actor Jason Bourne 1.0
522776 The Bourne Ultimatum 2007 Matt Damon actor Jason Bourne 1.0

So, How many movies do you think is released every year starring Matt Damon?

fig, ax = plt.subplots()
ax.plot(cast[cast['name'] == "Matt Damon"].groupby('year').size())
ax.set_xlabel("Year")
ax.set_ylabel("# of Movies")
ax.set_title("Matt Damon movies")
Text(0.5, 1.0, 'Matt Damon movies')

png

And how about the ranks at which Matt Damon stars in the movies?

matt_movies = cast[cast.name == 'Matt Damon'].sort_values('year')
matt_movies = matt_movies[matt_movies['n'].notnull()]
# For scatter plots, you can simply pass the column
# names for the x and y argument
matt_movies.plot(x='year', y='n', kind='scatter')
<AxesSubplot:xlabel='year', ylabel='n'>

png

So how many leading roles?

matt_movies = cast[cast.name == 'Matt Damon'].sort_values('year')
matt_movies[matt_movies['n'] == 1]['n'].value_counts()
1.0    21
Name: n, dtype: int64

Lets see how many people were casted in all these Jason Bourne movies

%%time
cast[(cast['title'].str.contains("Bourne"))].groupby(['year', 'title']).size()
CPU times: user 1.31 s, sys: 7.41 ms, total: 1.31 s
Wall time: 1.32 s





year  title                   
2002  The Bourne Identity          56
2004  The Bourne Supremacy         49
2007  The Bourne Ultimatum         70
2009  The Mel Bourne Ultimatum      8
2012  The Bourne Legacy           122
2016  Jason Bourne                259
dtype: int64

Among these casts, how many were actors and actresses?

%%time 
cast[(cast['title'].str.contains("Bourne"))].groupby(['year', 'title', 'type']).size()
CPU times: user 1.34 s, sys: 8.34 ms, total: 1.35 s
Wall time: 1.36 s





year  title                     type   
2002  The Bourne Identity       actor       50
                                actress      6
2004  The Bourne Supremacy      actor       38
                                actress     11
2007  The Bourne Ultimatum      actor       59
                                actress     11
2009  The Mel Bourne Ultimatum  actor        5
                                actress      3
2012  The Bourne Legacy         actor       86
                                actress     36
2016  Jason Bourne              actor      189
                                actress     70
dtype: int64

Lets plot the above and see the total number of roles based on gender

jason_cast = cast[(cast['title'].str.contains("Bourne"))]
jason_cast_gender = jason_cast[['year', 'type']].groupby(['year', 'type']).size().unstack()
print(jason_cast_gender)
jason_cast_gender.plot()
type  actor  actress
year                
2002     50        6
2004     38       11
2007     59       11
2009      5        3
2012     86       36
2016    189       70





<AxesSubplot:xlabel='year'>

png

Lets find out the entire cast of the The Bourne Ultimatum and print just the top 10 leads

cast[cast['title'] == "The Bourne Ultimatum"].sort_values(['n']).head(10)

title year name type character n
522776 The Bourne Ultimatum 2007 Matt Damon actor Jason Bourne 1.0
3613634 The Bourne Ultimatum 2007 Julia Stiles actress Nicky Parsons 2.0
2195770 The Bourne Ultimatum 2007 David Strathairn actor Noah Vosen 3.0
840581 The Bourne Ultimatum 2007 Scott Glenn actor Ezra Kramer 4.0
459423 The Bourne Ultimatum 2007 Paddy Considine actor Simon Ross 5.0
1872404 The Bourne Ultimatum 2007 Edgar (IV) Ram?rez actor Paz 6.0
728791 The Bourne Ultimatum 2007 Albert Finney actor Dr. Albert Hirsch 7.0
2552022 The Bourne Ultimatum 2007 Joan Allen actress Pam Landy 8.0
790556 The Bourne Ultimatum 2007 Tom Gallop actor Tom Cronin 9.0
1112789 The Bourne Ultimatum 2007 Corey Johnson actor Wills 10.0

Lets see in what months Matt Damon’s movies are most often released in the USA.

  • First, find the year and the title of unique movies starring Matt Damon were released
matt_movies = cast[cast['name'] == "Matt Damon"][['title', 'year']].drop_duplicates()

Now we have to re-index the release_dates dataframe with title and year keys

rel_dts = release_dates.set_index(['title', 'year']).sort_index()
  • Now, the ‘month’ part is present in date column present in release_dates dataframe so we have to combine the cast and release_dates dataframes for Matt Damon’s movies by some common index (in our case, we will use the common index as title and year since it is present in both the dataframes)
matt_movie_releases = matt_movies.join(rel_dts, on=['title', 'year'])
  • We only want movies released in the USA
matt_movie_releases = matt_movie_releases[matt_movie_releases['country'] == "USA"]
  • Now lets plot the month part.
matt_movie_releases['date'].dt.month.value_counts().sort_index().plot(kind='bar')
<AxesSubplot:>

png

Cool, Now lets see when the Bourne movie series were released in different countries.

Since there are is a huge list of countries, lets just select “USA”, “UK” and “India”

countries = ["USA", "UK", "Australia"]
matt_movie_releases = matt_movies.join(rel_dts, on=['title', 'year'])
matt_movie_countries = matt_movie_releases[matt_movie_releases['country'].str.contains('|'.join(countries))]
matt_movie_countries.set_index(['title', 'country'])[['date']].unstack()

date
country Australia UK USA
title
All the Pretty Horses 2001-05-10 2001-05-25 2000-12-25
Chasing Amy 1997-07-03 1997-11-14 1997-04-18
Che: Part Two 2009-10-01 2009-02-20 2009-01-24
Confessions of a Dangerous Mind 2003-07-24 2003-03-14 2003-01-24
Contagion 2011-10-20 2011-10-21 2011-09-09
Courage Under Fire 1996-09-12 1996-10-04 1996-07-12
Dogma 2000-02-03 1999-12-26 1999-11-12
Downsizing 2017-12-26 2018-01-24 2017-12-22
Elysium 2013-08-15 2013-08-21 2013-08-09
EuroTrip 2004-08-12 2004-06-25 2004-02-20
Field of Dreams 1989-08-31 1989-11-24 1989-05-05
Finding Forrester 2001-03-29 2001-02-23 2001-01-12
Gake no ue no Ponyo 2009-08-27 2010-02-12 2009-08-14
Geronimo: An American Legend 1994-05-26 1994-10-14 1993-12-10
Gerry NaT 2003-08-22 NaT
Glory Daze NaT NaT 1996-09-27
Good Will Hunting 1998-03-12 1998-03-06 1998-01-09
Green Zone 2010-03-11 2010-03-12 2010-03-12
Happy Feet Two 2011-12-26 2011-12-02 2011-11-18
Hereafter NaT 2011-01-28 2010-10-22
Interstellar 2014-11-06 2014-11-07 2014-11-07
Invictus 2010-01-21 2010-02-05 2009-12-11
Jason Bourne 2016-07-28 2016-07-27 2016-07-29
Jay and Silent Bob Strike Back 2002-01-31 2001-11-30 2001-08-24
Jersey Girl 2004-08-12 2004-06-18 2004-03-26
Margaret 2012-06-14 NaT NaT
Mystic Pizza 1989-08-03 1990-01-05 1988-10-21
Ocean's Eleven 2002-01-10 2002-02-15 2001-12-07
Ocean's Thirteen 2007-06-14 2007-06-08 2007-06-08
Ocean's Twelve 2004-12-09 2005-02-04 2004-12-10
Promised Land 2013-05-02 2013-04-19 2013-01-04
Rounders 1999-01-28 1998-11-20 1998-09-11
Saving Private Ryan 1998-11-19 1998-09-11 1998-07-24
School Ties 1993-06-10 NaT 1992-09-18
Spirit: Stallion of the Cimarron 2002-06-20 2002-07-05 2002-05-24
Stuck on You 2004-02-12 2004-01-02 2003-12-12
Suburbicon 2017-10-26 2017-11-24 2017-10-27
Syriana 2006-02-16 2006-03-03 2005-12-09
The Adjustment Bureau NaT 2011-03-04 2011-03-04
The Bourne Identity 2002-08-22 2002-09-06 2002-06-14
The Bourne Supremacy 2004-08-26 2004-08-13 2004-07-23
The Bourne Ultimatum 2007-08-30 2007-08-17 2007-08-03
The Brothers Grimm 2005-11-24 2005-11-04 2005-08-26
The Departed 2006-10-12 2006-10-06 2006-10-06
The Good Mother 1989-05-04 1989-02-17 1988-11-04
The Good Shepherd 2007-02-15 2007-02-23 2006-12-22
The Great Wall 2017-02-16 2017-02-17 2017-02-17
The Informant! 2009-12-03 2009-11-20 2009-09-18
The Legend of Bagger Vance 2001-02-08 2001-02-23 2000-11-03
The Majestic 2002-05-16 2002-05-24 2001-12-21
The Martian 2015-09-30 2015-09-30 2015-10-02
The Monuments Men 2014-03-13 2014-02-14 2014-02-07
The Rainmaker 1998-01-22 1998-04-03 1997-11-21
The Talented Mr. Ripley 2000-02-24 2000-02-25 1999-12-25
The Zero Theorem 2014-05-15 2014-03-14 NaT
Titan A.E. 2001-01-04 2000-07-28 2000-06-16
True Grit 2011-01-26 2011-02-11 2010-12-22
We Bought a Zoo 2011-12-26 2012-03-16 2011-12-23
Youth Without Youth 2008-11-20 2007-12-14 NaT

To match the country column against all the elements of the list, we use the | (OR) operator. It is considered as a Regular Expression.

We can also create a Pivot table to provide the above output.

  • The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.
  • Think of it as a multi-dimensional GroupBy function
countries = ["USA", "UK", "Australia"]
matt_movie_releases = matt_movies.join(rel_dts, on=['title', 'year'])
matt_movie_countries = matt_movie_releases[matt_movie_releases['country'].str.contains('|'.join(countries))]
matt_movie_countries.pivot(index='title', columns='country', values='date')

country Australia UK USA
title
All the Pretty Horses 2001-05-10 2001-05-25 2000-12-25
Chasing Amy 1997-07-03 1997-11-14 1997-04-18
Che: Part Two 2009-10-01 2009-02-20 2009-01-24
Confessions of a Dangerous Mind 2003-07-24 2003-03-14 2003-01-24
Contagion 2011-10-20 2011-10-21 2011-09-09
Courage Under Fire 1996-09-12 1996-10-04 1996-07-12
Dogma 2000-02-03 1999-12-26 1999-11-12
Downsizing 2017-12-26 2018-01-24 2017-12-22
Elysium 2013-08-15 2013-08-21 2013-08-09
EuroTrip 2004-08-12 2004-06-25 2004-02-20
Field of Dreams 1989-08-31 1989-11-24 1989-05-05
Finding Forrester 2001-03-29 2001-02-23 2001-01-12
Gake no ue no Ponyo 2009-08-27 2010-02-12 2009-08-14
Geronimo: An American Legend 1994-05-26 1994-10-14 1993-12-10
Gerry NaT 2003-08-22 NaT
Glory Daze NaT NaT 1996-09-27
Good Will Hunting 1998-03-12 1998-03-06 1998-01-09
Green Zone 2010-03-11 2010-03-12 2010-03-12
Happy Feet Two 2011-12-26 2011-12-02 2011-11-18
Hereafter NaT 2011-01-28 2010-10-22
Interstellar 2014-11-06 2014-11-07 2014-11-07
Invictus 2010-01-21 2010-02-05 2009-12-11
Jason Bourne 2016-07-28 2016-07-27 2016-07-29
Jay and Silent Bob Strike Back 2002-01-31 2001-11-30 2001-08-24
Jersey Girl 2004-08-12 2004-06-18 2004-03-26
Margaret 2012-06-14 NaT NaT
Mystic Pizza 1989-08-03 1990-01-05 1988-10-21
Ocean's Eleven 2002-01-10 2002-02-15 2001-12-07
Ocean's Thirteen 2007-06-14 2007-06-08 2007-06-08
Ocean's Twelve 2004-12-09 2005-02-04 2004-12-10
Promised Land 2013-05-02 2013-04-19 2013-01-04
Rounders 1999-01-28 1998-11-20 1998-09-11
Saving Private Ryan 1998-11-19 1998-09-11 1998-07-24
School Ties 1993-06-10 NaT 1992-09-18
Spirit: Stallion of the Cimarron 2002-06-20 2002-07-05 2002-05-24
Stuck on You 2004-02-12 2004-01-02 2003-12-12
Suburbicon 2017-10-26 2017-11-24 2017-10-27
Syriana 2006-02-16 2006-03-03 2005-12-09
The Adjustment Bureau NaT 2011-03-04 2011-03-04
The Bourne Identity 2002-08-22 2002-09-06 2002-06-14
The Bourne Supremacy 2004-08-26 2004-08-13 2004-07-23
The Bourne Ultimatum 2007-08-30 2007-08-17 2007-08-03
The Brothers Grimm 2005-11-24 2005-11-04 2005-08-26
The Departed 2006-10-12 2006-10-06 2006-10-06
The Good Mother 1989-05-04 1989-02-17 1988-11-04
The Good Shepherd 2007-02-15 2007-02-23 2006-12-22
The Great Wall 2017-02-16 2017-02-17 2017-02-17
The Informant! 2009-12-03 2009-11-20 2009-09-18
The Legend of Bagger Vance 2001-02-08 2001-02-23 2000-11-03
The Majestic 2002-05-16 2002-05-24 2001-12-21
The Martian 2015-09-30 2015-09-30 2015-10-02
The Monuments Men 2014-03-13 2014-02-14 2014-02-07
The Rainmaker 1998-01-22 1998-04-03 1997-11-21
The Talented Mr. Ripley 2000-02-24 2000-02-25 1999-12-25
The Zero Theorem 2014-05-15 2014-03-14 NaT
Titan A.E. 2001-01-04 2000-07-28 2000-06-16
True Grit 2011-01-26 2011-02-11 2010-12-22
We Bought a Zoo 2011-12-26 2012-03-16 2011-12-23
Youth Without Youth 2008-11-20 2007-12-14 NaT

Do you know when are the most Action movies released in the USA?

action_usa = release_dates[(release_dates['title'].str.contains('Action')) & 
                           (release_dates['country'] == "USA")]
action_usa['date'].dt.dayofweek.value_counts().sort_index().plot(kind='bar')
<AxesSubplot:>

png

In which months are Matt Damon’s movies generally released in the USA?

matt_movies = cast[cast['name'] == 'Matt Damon']
matt_movies_usa = matt_movies.merge(release_dates[release_dates.country == 'USA']).sort_values('date')
matt_movies_usa.date.dt.month.value_counts().sort_index().plot(kind='bar')
<AxesSubplot:>

png

The above examples should give you some idea about the importance of Pandas and how its high level functions mask the complex computation that is performed on the underlying Numpy arrays.

This is by no means an exhaustive list of all the functions. We have barely scratched the surface.

Remember –

The only way to become a master of something is to be really With It!

So keep practicing and whenever you are stuck:

  • Go through the official documentation.
  • Enter the object name and Press <TAB> or .? and Jupyter will show you the docstring.
  • Don’t trust the examples blindly. Run them, modify them, make mistakes and then rectify them.
  • Don’t simply copy the StackOverflow or StackExchange or answers from anywhere else. Understand the solution that you find on such sites and then and only then use it in your code.

If you want more tutorials/ cookbooks, take a look at

Mohit Sharma
Mohit Sharma
Senior Software Development Engineer, DevOps

DevOps engineer with a strong Linux background and over a decade of experience designing, automating and managing mission critical infrastructure deployments by leveraging SRE principles and other DevOps processes. Expert in scripting using python with an emphasis on real-time, high speed data pipelines and distributed computing across networks.