05-01 Series and Dataframe

05 - 01 Pandas Series and Dataframes

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Pandas provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame.

Series

Pandas Series is a one-dimensional array-like object that has index and value just like Numpy. Infact if you view the type of the values of series object, you will see that it indeed is numpy.ndarray.

You can assign name to pandas Series.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-darkgrid')
In [2]:
ob = pd.Series([8,7,6,5], name='test_data')
print('Name: ',ob.name)
print('Data:\n',ob)
print('Type of Object: ',type(ob))
print('Type of elements:',type(ob.values))
Name:  test_data
Data:
 0    8
1    7
2    6
3    5
Name: test_data, dtype: int64
Type of Object:  <class 'pandas.core.series.Series'>
Type of elements: <class 'numpy.ndarray'>

You can also use your numpy array and convert them to Series.

In [3]:
# integers between 5 to 8 (reversed)
ob = pd.Series(np.linspace(5, 8, num=4, dtype=int)[::-1])
print(ob)
print(type(ob))
0    8
1    7
2    6
3    5
dtype: int64
<class 'pandas.core.series.Series'>

You can also provide custom index to the values and just like in Numpy, access them with the index.

In [4]:
ob = pd.Series([8,7,6,5], index=['a','b','c','d'])
print(ob['b'])
7

Pandas Series is more like an fixed size dictionary whose mapping of index-value is preserved when array operations are applied to them. For example,

In [5]:
# select all the values greater than 4 and less than 8
print(ob[(ob>4) & (ob<8)])
b    7
c    6
d    5
dtype: int64

This also means that if you have a dictionary, you can easily convert that into pandas series.

In [6]:
states_dict = {'State1': 'Alabama', 
               'State2': 'California', 
               'State3': 'New Jersey', 
               'State4': 'New York'}
ob = pd.Series(states_dict)
print(ob)
print(type(ob))
State1       Alabama
State2    California
State3    New Jersey
State4      New York
dtype: object
<class 'pandas.core.series.Series'>

Just like dictionaries, you can also change the index..

In [7]:
ob.index = ['AL','CA','NJ','NY']
print(ob)
AL       Alabama
CA    California
NJ    New Jersey
NY      New York
dtype: object

or use dictionary's method to get the label..

In [8]:
ob.get('CA', np.nan)
Out[8]:
'California'

Dataframe

Dataframe is something like spreadsheet or a sql table. It is basically a 2 dimensional labelled data structure with columns of potentially different datatype. Like Series, DataFrame accepts many different kinds of input:

Compared with other such DataFrame-like structures you may have used before (like R’s data.frame), row- oriented and column-oriented operations in DataFrame are treated roughly symmetrically. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

Creating Dataframes from dictionaries

In [9]:
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
In [10]:
df = pd.DataFrame(data)
print('Dataframe:\n',df)
print('Type of Object:',type(df))
print('Type of elements:',type(df.values))
Dataframe:
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
Type of Object: <class 'pandas.core.frame.DataFrame'>
Type of elements: <class 'numpy.ndarray'>

Another way to construct dataframe from dictionaries is by using DataFrame.from_dict function. DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can be set to 'index' in order to use the dict keys as row labels.

Just like Series, you can access index, values and also columns.

In [11]:
print('Index: ',df.index)
print('Columns: ',df.columns)
print('Values of Column one: ',df['one'].values)
print('Values of Column two: ',df['two'].values)
Index:  Index(['a', 'b', 'c', 'd'], dtype='object')
Columns:  Index(['one', 'two'], dtype='object')
Values of Column one:  [  1.   2.   3.  nan]
Values of Column two:  [ 1.  2.  3.  4.]

Creating dataframe from list of dictionaries

As with Series, if you pass a column that isn’t contained in data, it will appear with NaN values in the result

In [12]:
df2 = pd.DataFrame([{'a': 1, 'b': 2, 'c':3, 'd':None}, 
                    {'a': 2, 'b': 2, 'c': 3, 'd': 4}],
                   index=['one', 'two'])
print('Dataframe: \n',df2)

# Ofcourse you can also transpose the result:
print('Transposed Dataframe: \n',df2.T)
Dataframe: 
      a  b  c    d
one  1  2  3  NaN
two  2  2  3  4.0
Transposed Dataframe: 
    one  two
a  1.0  2.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

Assigning a column that doesn’t exist will create a new column.

In [23]:
df['three'] = None
print('Added third column: \n',df)

# The del keyword can be used delete columns:
del df['three']
print('\nDeleted third column: \n',df)
# You can also use df.drop(). We shall see that later
Added third column: 
    one  two three
a  1.0  1.0  None
b  2.0  2.0  None
c  3.0  3.0  None
d  NaN  4.0  None

Deleted third column: 
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

Each Index has a number of methods and properties for set logic and answering other common questions about the data it contains.

Method Description
append Concatenate with additional Index objects, producing a new Index
diff Compute set difference as an Index
intersection Compute set intersection
union Compute set union
isin Compute boolean array indicating whether each value is contained in the passed collection
delete Compute new Index with element at index i deleted
drop Compute new index by deleting passed values
insert Compute new Index by inserting element at index i
is_monotonic Returns True if each element is greater than or equal to the previous element
is_unique Returns True if the Index has no duplicate values
unique Compute the array of unique values in the Index

for example:

In [15]:
print(1 in df.one.values)
print('one' in df.columns)
True
True

Reindexing

A critical method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [16]:
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
In [17]:
# Reindex in descending order.
print(df.reindex(['d','c','b','a']))
   one  two
d  NaN  4.0
c  3.0  3.0
b  2.0  2.0
a  1.0  1.0

If you reindex with more number of rows than in the dataframe, it will return the dataframe with new row whose values are NaN.

In [18]:
print(df.reindex(['a','b','c','d','e']))
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
e  NaN  NaN

Reindexing is also useful when you want to introduce any missing values. For example in our case, look at column one and row d

In [19]:
df.reindex(['a','b','c','d','e'], fill_value=0)
# Guess why the df['one']['d'] was not filled with 0 ?
Out[19]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
e 0.0 0.0

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill which forward fills the values:

In [20]:
df.reindex(['a','b','c','d','e'], method='ffill')
Out[20]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
e NaN 4.0

There are basically two different types of method (interpolation) options:

Method Description
ffill or pad Fill (or carry) values forward
bfill or backfill Fill (or carry) values backward

Reindexing has following arguments:

Argument Description
index New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying
method Interpolation (fill) method, see above table for options.
fill_value Substitute value to use when introducing missing data by reindexing.
limit When forward- or backfilling, maximum size gap to fill
level Match simple Index on level of MultiIndex, otherwise select subset of
copy Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data)

Dropping Entries

Dropping one or more entries from an axis is easy if you have an index array or list without those entries.

In [21]:
# Drop row c and row a
df.drop(['c', 'a'])
Out[21]:
one two
b 2.0 2.0
d NaN 4.0
In [24]:
# Drop column two
df.drop(['two'], axis=1)
Out[24]:
one
a 1.0
b 2.0
c 3.0
d NaN

Indexing, selection, Sorting and filtering

Series indexing works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.

In [27]:
print("Dataframe: \n",df)
# Slicing and selecting only column `one` for row 0 and row 4
df['one'][['a', 'd']]
Dataframe: 
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
Out[27]:
a    1.0
d    NaN
Name: one, dtype: float64
In [28]:
# Slicing df from row b to row 4 for column `one`
df['one']['b':'d']
Out[28]:
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

If you observe the above command (and the one above it), you will see that slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive.

For DataFrame label-indexing on the rows, there is a special indexing field ix (or loc). It enables you to select a subset of the rows and columns from a DataFrame with NumPy- like notation plus axis labels. It is a less verbose way to do the reindexing.

In [31]:
df.ix[['a','c'],['one']]
Out[31]:
one
a 1.0
c 3.0
In [32]:
df.loc[['a', 'c'], ['one']]
Out[32]:
one
a 1.0
c 3.0
In [33]:
df.ix[df.one > 1]
Out[33]:
one two
b 2.0 2.0
c 3.0 3.0

There are many ways to select and rearrange the data contained in a pandas object. Some indexing options can be seen in below table:

Indexing Type Description
df[val] Select single column or sequence of columns from the DataFrame. Special case con- veniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion).
df.ix[val] Selects single row of subset of rows from the DataFrame.
df.ix[:, val] Selects single column of subset of columns.
df.ix[val1, val2] Select both rows and columns.
reindex method Conform one or more axes to new indexes.
xs method Select single row or column as a Series by label.
icol, irowmethods Select single column or row, respectively, as a Series by integer location.
get_value, set_value methods Select single value by row and column label.

You can sort a data frame or series (by some criteria) using the built-in functions. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [36]:
dt = pd.Series(np.random.randint(3, 10, size=7), 
               index=['g','c','a','b','e','d','f'])
print('Original Data: \n', dt, end="\n\n")
print('Sorted by Index: \n',dt.sort_index())
Original Data: 
 g    8
c    7
a    8
b    5
e    8
d    6
f    7
dtype: int64

Sorted by Index: 
 a    8
b    5
c    7
d    6
e    8
f    7
g    8
dtype: int64

Data alignment and arithmetic

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). The resulting object will have the union of the column and row labels.

In [38]:
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
print('df1:\n',df1, end="\n\n")
print('df2:\n',df2, end="\n\n")
print('Sum:\n',df1.add(df2))
df1:
           A         B         C         D
0 -0.362487 -2.112680 -0.806952 -1.108754
1  1.011245 -1.183287  0.143261 -0.648712
2 -0.607728  0.185564 -0.316349 -0.108025
3  1.133124  0.313431  0.480309 -0.435832
4 -0.138228  0.873961  0.354257  0.573475
5 -1.048813  0.907835 -0.858269 -0.193833
6  0.102733 -1.975696  0.538086  0.191236
7  0.438749 -1.705098 -0.642303 -0.843471
8  1.491922 -1.348542  0.520506  0.991557
9 -1.160722 -0.624619 -0.431036  1.218312

df2:
           A         B         C
0 -1.512020  0.487113  1.279393
1  1.291985  1.842381 -1.222940
2 -0.111107 -0.670234 -0.554140
3  1.614198  0.863894 -0.771780
4 -0.600226 -1.377429 -1.372277
5  1.276203  2.589817 -0.938377
6 -0.532945  1.564143 -0.185730

Sum:
           A         B         C   D
0 -1.874507 -1.625566  0.472441 NaN
1  2.303230  0.659094 -1.079679 NaN
2 -0.718835 -0.484670 -0.870488 NaN
3  2.747322  1.177326 -0.291471 NaN
4 -0.738454 -0.503468 -1.018020 NaN
5  0.227390  3.497653 -1.796646 NaN
6 -0.430212 -0.411553  0.352356 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN

Note that in arithmetic operations between differently-indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

In [39]:
print('Sum:\n',df1.add(df2, fill_value=0))
Sum:
           A         B         C         D
0 -1.874507 -1.625566  0.472441 -1.108754
1  2.303230  0.659094 -1.079679 -0.648712
2 -0.718835 -0.484670 -0.870488 -0.108025
3  2.747322  1.177326 -0.291471 -0.435832
4 -0.738454 -0.503468 -1.018020  0.573475
5  0.227390  3.497653 -1.796646 -0.193833
6 -0.430212 -0.411553  0.352356  0.191236
7  0.438749 -1.705098 -0.642303 -0.843471
8  1.491922 -1.348542  0.520506  0.991557
9 -1.160722 -0.624619 -0.431036  1.218312

Similarly you can perform subtracion, multiplication and division.

When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting (just like in numpy) row-wise.

In [45]:
print("Dataframe: \n", df1, end="\n\n")
print("Operand (0th row): \n", df1.loc[0], end="\n\n")
print('Subtraction: \n',df1.sub(df1.loc[0]))
Dataframe: 
           A         B         C         D
0 -0.362487 -2.112680 -0.806952 -1.108754
1  1.011245 -1.183287  0.143261 -0.648712
2 -0.607728  0.185564 -0.316349 -0.108025
3  1.133124  0.313431  0.480309 -0.435832
4 -0.138228  0.873961  0.354257  0.573475
5 -1.048813  0.907835 -0.858269 -0.193833
6  0.102733 -1.975696  0.538086  0.191236
7  0.438749 -1.705098 -0.642303 -0.843471
8  1.491922 -1.348542  0.520506  0.991557
9 -1.160722 -0.624619 -0.431036  1.218312

Operand (0th row): 
 A   -0.362487
B   -2.112680
C   -0.806952
D   -1.108754
Name: 0, dtype: float64

Subtraction: 
           A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1  1.373732  0.929393  0.950213  0.460042
2 -0.245241  2.298244  0.490603  1.000729
3  1.495611  2.426111  1.287261  0.672921
4  0.224258  2.986641  1.161209  1.682229
5 -0.686326  3.020515 -0.051317  0.914921
6  0.465220  0.136984  1.345038  1.299990
7  0.801236  0.407582  0.164649  0.265283
8  1.854409  0.764138  1.327459  2.100311
9 -0.798235  1.488061  0.375916  2.327066

In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise:

In [48]:
ind1 = pd.date_range('06/1/2017', periods=10)
df1.set_index(ind1)
Out[48]:
A B C D
2017-06-01 -0.362487 -2.112680 -0.806952 -1.108754
2017-06-02 1.011245 -1.183287 0.143261 -0.648712
2017-06-03 -0.607728 0.185564 -0.316349 -0.108025
2017-06-04 1.133124 0.313431 0.480309 -0.435832
2017-06-05 -0.138228 0.873961 0.354257 0.573475
2017-06-06 -1.048813 0.907835 -0.858269 -0.193833
2017-06-07 0.102733 -1.975696 0.538086 0.191236
2017-06-08 0.438749 -1.705098 -0.642303 -0.843471
2017-06-09 1.491922 -1.348542 0.520506 0.991557
2017-06-10 -1.160722 -0.624619 -0.431036 1.218312

Using Numpy functions on DataFrame

Elementwise NumPy ufuncs like log, exp, sqrt, ... and various other NumPy functions can be used on DataFrame

In [49]:
np.abs(df1)
Out[49]:
A B C D
0 0.362487 2.112680 0.806952 1.108754
1 1.011245 1.183287 0.143261 0.648712
2 0.607728 0.185564 0.316349 0.108025
3 1.133124 0.313431 0.480309 0.435832
4 0.138228 0.873961 0.354257 0.573475
5 1.048813 0.907835 0.858269 0.193833
6 0.102733 1.975696 0.538086 0.191236
7 0.438749 1.705098 0.642303 0.843471
8 1.491922 1.348542 0.520506 0.991557
9 1.160722 0.624619 0.431036 1.218312
In [50]:
# Convert to numpy array
np.asarray(df1)
Out[50]:
array([[-0.36248689, -2.11267995, -0.80695205, -1.1087539 ],
       [ 1.01124514, -1.18328675,  0.14326111, -0.64871176],
       [-0.60772766,  0.18556379, -0.31634858, -0.10802532],
       [ 1.13312415,  0.31343103,  0.48030905, -0.43583243],
       [-0.13822844,  0.87396131,  0.35425682,  0.5734751 ],
       [-1.04881286,  0.90783519, -0.85826904, -0.19383257],
       [ 0.10273273, -1.9756959 ,  0.53808642,  0.19123646],
       [ 0.4387492 , -1.7050977 , -0.64230271, -0.84347095],
       [ 1.4919221 , -1.34854196,  0.52050645,  0.99155684],
       [-1.16072188, -0.62461892, -0.43103621,  1.21831203]])

Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:

In [52]:
def fn(x):
    """
    Get max and min of the columns
    """
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

df1.apply(fn)
Out[52]:
A B C D
min -1.160722 -2.112680 -0.858269 -1.108754
max 1.491922 0.907835 0.538086 1.218312

Element-wise Python functions can be used, too. Suppose you wanted to format the dataframe elements in floating point format with accuracy of only 3 decimal places. You can do this with applymap:

In [53]:
fmt = lambda x: "{:.3f}".format(x)
df1.applymap(fmt)
Out[53]:
A B C D
0 -0.362 -2.113 -0.807 -1.109
1 1.011 -1.183 0.143 -0.649
2 -0.608 0.186 -0.316 -0.108
3 1.133 0.313 0.480 -0.436
4 -0.138 0.874 0.354 0.573
5 -1.049 0.908 -0.858 -0.194
6 0.103 -1.976 0.538 0.191
7 0.439 -1.705 -0.642 -0.843
8 1.492 -1.349 0.521 0.992
9 -1.161 -0.625 -0.431 1.218

The reason for the name applymap for dataframe (instead of simply using map)is that pandas Series already has a map method for applying an element-wise operation

Related

comments powered by Disqus