Python cheatsheet for data analysis

Reading a file


import pandas as pd
dataFrame = pd.read_csv("./file.csv")

Reading a csv file, pandas assumes that the file will have the first row as header. If the file doesn’t contain header row, it can be read as:

dataFrame = pd.read_csv("./file.csv", header=None)

Peeking into data



     Id   Name           Marks     Percentage
0     1   Scott Summers    175           87.5 
1     2   Peter Parker      99           49.5


describe() works differently for numeric columns and categorical columns. If we apply describe() on Id column here is what it will look like:

dataFrame.Id.describe() OR dataFrame['Id'].describe()


count 2.000000
mean  1.500000
std   0.707107
min   1.000000
25%   1.250000
50%   1.500000
75%   1.750000
max   2.000000
Name: Id, dtype: float64

For numeric field, it shows total values (count), mean of all the values, standard deviation (std), minimum (min), maximum (max), 25, 50 and 75th percentile values to give an idea of the data for a particular attribute.

dataFrame.Name.describe() OR dataFrame['Name'].describe()


count     2
unique    2
top       Scott Summers
freq      1
Name: Name, dtype: object

For categorical fields, it shows total values, unique values and the one occurring maximum times along with the frequency.

Extracting and removing columns in data frame

Extracting numeric columns

import numpy as np
dataFrameNum = dataFrame.select_dtypes(include=[np.number])
print (dataFrameNum.columns)


Index(['Id', 'Marks', 'Percentage'], dtype='object')

Extracting categorical columns

dataFrameNonNum = dataFrame.select_dtypes(exclude=[np.number])
print (dataFrameNonNum.columns)


Index(['Name'], dtype='object')

Dropping Columns

dataFrameDroppedCols = dataFrame.drop(['Id'], axis=1)

Checking correlation between columns

corr = dataFrame.corr()
print (corr)


              Id    Marks    Percentage
Id           1.0     -1.0          -1.0
Marks       -1.0      1.0           1.0
Percentage  -1.0      1.0           1.0

It prints out a matrix showing correlation between every column. As percentage is calculated using marks and total marks, it shows perfect correlation (1.0) between them. Whereas -1.0 shows that there is absolutely no correlation between them.

In case there are many attributes, this matrix can be huge. We can sort the values to see the most correlated and least correlated attributes easily.

print (corr['Marks'].sort_values(ascending=False)[:5], '\n')
print (corr['Marks'].sort_values(ascending=False)[-5:], '\n')

Pivot Table for further correlation exploration

dataFrame.pivot_table(index='Marks', values='Percentage', aggfunc = np.median)

Visualising correlation between columns

import matplotlib.pyplot as plt'ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

plt.hist(dataFrame.Marks, color='blue')

Plotting using data frame

dataFrame.plot(kind='bar', color='blue')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')

Finding and removing outliers

Scatter Plot to find outliers

plt.scatter(x=dataFrame['Marks'], y=dataFrame['Percentage'])
plt.ylabel('Y Label')
plt.xlabel('X Label')
plt.xlim(-100, 600)

Data Frame operations to remove outliers

dataFrame = dataFrame[dataFrame['Percentage'] > 50]

Encoding attribute values


Use value_counts on data frame series to see number of occurrences of each value

print (dataFrame.Marks.value_counts())


175      1
99       1
Name: Marks, dtype: int64

Encoding values

def encode_values(x):
    if x > 50.0:
       return 1
       return 0

dataFrame['enc_percentage'] = dataFrame.Percentage.apply(encode)

Handling null values


In case, the values are following an order (increasing, decreasing, time etc.), interpolation can be used to fill the missing values.

dataFrame['Id'] = dataFrame['Id'].interpolate(method='linear')

Drop null values

Drop columns with all values null

dataFrame.dropna(axis=1, how='all')

Drop columns with any value null

dataFrame.dropna(axis=1, how='any')

Drop rows with all value null

dataFrame.dropna(axis=0, how='all')

Drop rows with any value null

dataFrame.dropna(axis=0, how='any')