Data Science for Beginners: Handling Missing Values with Pandas

Missing values can appear as ‘NaN’ (Not a Number), ‘NA’ (Not Available), ‘n/a’, ‘na’, ‘?’, a blank space, an out-of-range value and in many other forms depending on the user(s) filling in the data. In real datasets, missing values are almost unavoidable and they can be caused by several reasons like corrupted data or unrecorded observations.

Image for post

Image for post

Learning to handle missing values in a dataset is very important because most machine learning models cannot handle missing values. In this tutorial, you’ll learn how to handle missing values using Pandas. Let’s get started!

Identifying Missing Values

To handle missing data, we need to first identify them. Pandas identifies some missing value forms by default as NaN values (e.g blank entry, ‘NA’, ‘null’, ‘nan’, ‘n/a’, ‘NULL’) but not some others. To make Pandas recognize other non-default missing value forms (e.g ‘?’, ’na’, ‘Nil’), we can make a list of them and pass them into Pandas’ .read_excel() method (since the dataset is an excel file in this case) as is done below:

import pandas as pd
pd.read_excel('Students_Scores.xlsx')

#Output
             Maths  English  Physics  Chemistry  Biology
Student ID
SA/2020/001   83.0      NaN     55.0       23.0     100.0
SA/2020/002     ?      Nil     76.0       43.0      32.0
SA/2020/003  NaN      NaN        ?       56.0      54.0
SA/2020/004   56.0     90.0        ?       76.0      88.0
SA/2020/005   76.0      Nil     76.0       54.0      43.0
SA/2020/006   32.0     83.0      NaN       21.0      21.0
SA/2020/007      ?     75.0      NaN       97.0      54.0
SA/2020/008     ?     43.0      NaN        NaN       NaN
SA/2020/009   NaN     32.0     43.0        NaN       NaN
SA/2020/010    NaN     12.0      Nil        NaN       NaN</span>

```missing_value_forms=['?','Nil']
df= pd.read_excel('Students_Scores.xlsx',na_values= missing_value_forms, index_col= 'Student ID')
df</span>

This gives an output such that all missing value forms used are now recognized by Pandas:

              Maths  English  Physics  Chemistry  Biology
Student ID
SA/2020/001   83.0      NaN     55.0       23.0     100.0
SA/2020/002    NaN      NaN     76.0       43.0      32.0
SA/2020/003    NaN      NaN      NaN       56.0      54.0
SA/2020/004   56.0     90.0      NaN       76.0      88.0
SA/2020/005   76.0      NaN     76.0       54.0      43.0
SA/2020/006   32.0     83.0      NaN       21.0      21.0
SA/2020/007    NaN     75.0      NaN       97.0      54.0
SA/2020/008    NaN     43.0      NaN        NaN       NaN
SA/2020/009    NaN     32.0     43.0        NaN       NaN
SA/2020/010    NaN     12.0      NaN        NaN       NaN</span>

When Pandas has been made to identify all missing value forms, we can then mark them out. To mark out missing values in a given dataset, Pandas isnull() and notnull() functions come in handy. Pandas isnull() marks all missing values as True while notnull() marks all missing values as False.

df.isnull()

# Output is:
             Maths  English  Physics Chemistry  Biology
Student ID    
SA/2020/001 False    True    False     False     False
SA/2020/002 True    True    False     False     False
SA/2020/003 True    True     True     False     False
SA/2020/004 False   False     True     False     False
SA/2020/005 False    True    False     False     False
SA/2020/006 False   False     True     False     False
SA/2020/007  True   False     True     False     False
SA/2020/008  True   False     True      True      True
SA/2020/009  True   False    False      True      True
SA/2020/010  True   False     True      True      True</span>

To check for missing values in the dataset or the number of missing values per column:

df.isnull().values.any()

# Output is:
True</span>

To check for the number of missing values in each column:

df.isnull().sum()

# Output is:
Maths         6
English       4
Physics       6
Chemistry     3
Biology       3
dtype: int64</span>

Ways of handling missing values

Missing values can be handled by:

1. Deletion

2. Imputation

3. Interpolation

Deletion

Here rows or columns containing missing values are deleted. This is however not a good way to handle missing values especially when there are many missing values in the dataframe as it leads to loss of data.

To delete rows or columns containing missing values, Pandas dropna() function is used:

#to drop rows with missing values
df.dropna(axis=0,inplace=True)

#to drop columns with missing values
df.dropna(axis=1,inplace=True)

**Note: axis=0 for rows, axis=1 for columns**</span>

Imputation

In this method, missing values are replaced by a constant value or they are replaced based on other observations in the dataset. Pandas fillna() function is used for imputing values. Imputation can be done by:

i. Replacing missing values with a predetermined constant value.

To fill all missing values in the dataframe with a constant:

df.fillna(0, inplace=True)
#Output is:
             Maths  English  Physics  Chemistry  Biology
Student ID     
SA/2020/001   83.0      0.0    55.0      23.0      100.0
SA/2020/002    0.0      0.0    76.0      43.0       32.0 SA/2020/003    0.0      0.0     0.0      56.0       54.0
SA/2020/004   56.0     90.0     0.0      76.0       88.0
SA/2020/005   76.0      0.0    76.0      54.0       43.0
SA/2020/006   32.0     83.0     0.0      21.0       21.0
SA/2020/007    0.0     75.0     0.0      97.0       54.0
SA/2020/008    0.0     43.0     0.0       0.0        0.0
SA/2020/009    0.0     32.0    43.0       0.0        0.0
SA/2020/010    0.0     12.0     0.0       0.0        0.0</span>

To fill a particular column in the dataframe with a constant:

df['Biology'].fillna(50, inplace=True)

# Output is:
Student ID
SA/2020/001    100.0
SA/2020/002     32.0
SA/2020/003     54.0
SA/2020/004     88.0
SA/2020/005     43.0
SA/2020/006     21.0
SA/2020/007     54.0
SA/2020/008     50.0
SA/2020/009     50.0
SA/2020/010     50.0
Name: Biology, dtype: float64</span>

ii. replacing missing values in a column with values from previous rows (forward or backward).

Forward fill: this fills missing values with values from previous rows in a forward manner

#forward fill

# Output is:
Student ID
SA/2020/001    83.0
SA/2020/002    83.0
SA/2020/003    83.0
SA/2020/004    56.0
SA/2020/005    76.0
SA/2020/006    32.0
SA/2020/007    32.0
SA/2020/008    32.0
SA/2020/009    32.0
SA/2020/010    32.0
Name: Maths, dtype: float64</span>

Backward fill: this fills missing values using values from the later rows in a backward manner

#backward fill
df['English'].fillna(method='bfill')
# Output is:
Student ID
SA/2020/001    90.0
SA/2020/002    90.0
SA/2020/003    90.0
SA/2020/004    90.0
SA/2020/005    83.0
SA/2020/006    83.0
SA/2020/007    75.0
SA/2020/008    43.0
SA/2020/009    32.0
SA/2020/010    12.0
Name: English, dtype: float64</span>

iii. replacing missing values in a column with the mean, median or mode of that column.

Mean

mean_physics= df['Physics'].mean()
df['Physics'].fillna(int(mean_physics),inplace= True)
# Output is:
Student ID
SA/2020/001    55.0
SA/2020/002    76.0
SA/2020/003    62.0
SA/2020/004    62.0
SA/2020/005    76.0
SA/2020/006    62.0
SA/2020/007    62.0
SA/2020/008    62.0
SA/2020/009    43.0
SA/2020/010    62.0
Name: Physics, dtype: float64</span>

Median

median_english= df['English'].median()
df['English'].fillna(int(median_english),inplace= True)
#Output is:
Student ID
SA/2020/001    59.0
SA/2020/002    59.0
SA/2020/003    59.0
SA/2020/004    90.0
SA/2020/005    59.0
SA/2020/006    83.0
SA/2020/007    75.0
SA/2020/008    43.0
SA/2020/009    32.0
SA/2020/010    12.0
Name: English, dtype: float64</span>

Mode

mode_physics= df['Physics'].mode()
df['Physics'].fillna(int(mode_physics),inplace= True)
#Output is:
Student ID
SA/2020/001    55.0
SA/2020/002    76.0
SA/2020/003    76.0
SA/2020/004    76.0
SA/2020/005    76.0
SA/2020/006    76.0
SA/2020/007    76.0
SA/2020/008    76.0
SA/2020/009    43.0
SA/2020/010    76.0
Name: Physics, dtype: float64</span>

Interpolation

In this method, missing values are handled using Pandas interpolate() function. Interpolation can be done using different methods e.g linear, pad, nearest, quadratic method. Pandas however carries out linear interpolation by default.

#Linear interpolation
df['Chemistry'].fillna(df['Chemistry'].interpolate())
#Output is:
Student ID
SA/2020/001    23.0
SA/2020/002    43.0
SA/2020/003    56.0
SA/2020/004    76.0
SA/2020/005    54.0
SA/2020/006    21.0
SA/2020/007    97.0
SA/2020/008    97.0
SA/2020/009    97.0
SA/2020/010    97.0
Name: Chemistry, dtype: float64</span>

The way you choose to handle the missing values in your dataset will depend on the type and description of the dataset you are working on.

Voila! You are now ready to handle missing values.