Python cheatsheet for data analysis

Reading a file

CSV

import pandas as pd
dataFrame = pd.read_csv("./file.csv")

Reading a csv file, pandas assumes that the file will have the first row as header. If the file doesn’t contain header row, it can be read as:

dataFrame = pd.read_csv("./file.csv", header=None)

Peeking into data

dataFrame.head()

Output:

     Id   Name           Marks     Percentage
0     1   Scott Summers    175           87.5 
1     2   Peter Parker      99           49.5

describe()

describe() works differently for numeric columns and categorical columns. If we apply describe() on Id column here is what it will look like:

dataFrame.Id.describe() OR dataFrame['Id'].describe()

Output:

count 2.000000
mean  1.500000
std   0.707107
min   1.000000
25%   1.250000
50%   1.500000
75%   1.750000
max   2.000000
Name: Id, dtype: float64

For numeric field, it shows total values (count), mean of all the values, standard deviation (std), minimum (min), maximum (max), 25, 50 and 75th percentile values to give an idea of the data for a particular attribute.

dataFrame.Name.describe() OR dataFrame['Name'].describe()

Output:

count     2
unique    2
top       Scott Summers
freq      1
Name: Name, dtype: object

For categorical fields, it shows total values, unique values and the one occurring maximum times along with the frequency.

Extracting and removing columns in data frame

Extracting numeric columns

import numpy as np
dataFrameNum = dataFrame.select_dtypes(include=[np.number])
print (dataFrameNum.columns)

Output:

Index(['Id', 'Marks', 'Percentage'], dtype='object')

Extracting categorical columns

dataFrameNonNum = dataFrame.select_dtypes(exclude=[np.number])
print (dataFrameNonNum.columns)

Output:

Index(['Name'], dtype='object')

Dropping Columns

dataFrameDroppedCols = dataFrame.drop(['Id'], axis=1)

Checking correlation between columns

corr = dataFrame.corr()
print (corr)

Output:

              Id    Marks    Percentage
Id           1.0     -1.0          -1.0
Marks       -1.0      1.0           1.0
Percentage  -1.0      1.0           1.0

It prints out a matrix showing correlation between every column. As percentage is calculated using marks and total marks, it shows perfect correlation (1.0) between them. Whereas -1.0 shows that there is absolutely no correlation between them.

In case there are many attributes, this matrix can be huge. We can sort the values to see the most correlated and least correlated attributes easily.

print (corr['Marks'].sort_values(ascending=False)[:5], '\n')
print (corr['Marks'].sort_values(ascending=False)[-5:], '\n')

Pivot Table for further correlation exploration

dataFrame.pivot_table(index='Marks', values='Percentage', aggfunc = np.median)

Visualising correlation between columns

import matplotlib.pyplot as plt

plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

plt.hist(dataFrame.Marks, color='blue')
plt.show()


Plotting using data frame

dataFrame.plot(kind='bar', color='blue')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.xticks(rotation=0)
plt.show()

Finding and removing outliers

Scatter Plot to find outliers

plt.scatter(x=dataFrame['Marks'], y=dataFrame['Percentage'])
plt.ylabel('Y Label')
plt.xlabel('X Label')
plt.xlim(-100, 600)
plt.show()

Data Frame operations to remove outliers

dataFrame = dataFrame[dataFrame['Percentage'] > 50]

Encoding attribute values

value_counts

Use value_counts on data frame series to see number of occurrences of each value

print (dataFrame.Marks.value_counts())

Output:

175      1
99       1
Name: Marks, dtype: int64

Encoding values

def encode_values(x):
    if x > 50.0:
       return 1
    else
       return 0

dataFrame['enc_percentage'] = dataFrame.Percentage.apply(encode)

Handling null values

Interpolate

In case, the values are following an order (increasing, decreasing, time etc.), interpolation can be used to fill the missing values.

dataFrame['Id'] = dataFrame['Id'].interpolate(method='linear')

Drop null values

Drop columns with all values null

dataFrame.dropna(axis=1, how='all')

Drop columns with any value null

dataFrame.dropna(axis=1, how='any')

Drop rows with all value null

dataFrame.dropna(axis=0, how='all')

Drop rows with any value null

dataFrame.dropna(axis=0, how='any')
Advertisements

Spring Core Reference – Cheatsheet

Factory
=======

BeanFactory factory  = new XmlBeanFactory(new FileSystemResource(fileName));
factory.getBean(beanName);

ApplicationContext context = new ClassPathXmlApplicationContext(fileName);
context.getBean(beanName);

Property
========

By Value
<bean id=”beanName” class=”org.abcd.efg.AClass”>
<property name=”name” value=”Rasesh”/>
</bean>

By Reference
<bean id=”beanName” class=”org.abcd.efg.AClass”>
<property name=”name” ref=”anotherBeanName”/>
</bean>

Constructor
==========

<bean id=”beanName” class=”org.abcd.efg.AClass”>
<constructor-arg index=”0″ value=”Rasesh”/>
<constructor-arg index=”1″ ref=”anotherBeanName”/>
</bean>

Collections
=========

<bean id=”beanName” class=”org.abcd.efg.AClass”>
<property name=”names” >
<list>
<ref bean=”beanName”/>
<ref bean=”anotherBeanName”/>
</list>
</property>
</bean>

Autowiring by tags
===============

ByName
<bean id=”beanName” class=”org.abcd.efg.AClass” autowire=”byName“>
</bean>

Here, it will search for the beans with the same name as member variables in AClass and autowire them.

ByType
<bean id=”beanName” class=”org.abcd.efg.AClass” autowire=”byType“>
</bean>

Here, it will try and match the data type of beans and member variables in AClass. If there are multiple beans of same data type, this won’t work.

Constructor
<bean id=”beanName” class=”org.abcd.efg.AClass” autowire=”constructor“>
</bean>

This is autowire by Type, but this will set the member variables through constructor instead of setters.

Bean Scopes
===========

Singleton, Prototype

Annotations
==========

@Required
Adding Required annotation to setter will fail when it initializes the bean if the value is null instead of a later NPE when it is referenced.

<bean class=”org.springframework.beans.factory.annotation.RequiredAnnotationBeanPostProcessor” />

This bean will take care of the process mentioned above.

@Autowired
This works same as auto-wiring by type falling back to auto-wiring by name. If still not found, qualifier can be used.

<bean id=”……>
<qualifier value=”qualifierString”/>

</bean>

<context:annotation-config/> can be used to avoid adding bean post processor for each annotation.

This along with @Qualifier(“qualifierString”) will be able to auto wire the bean.
<bean class=”org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor” />

@Resource
This can be used for dependency injection by name. @Resource(name=”beanName”) will wire the bean with specified name.

If name is not specified, it will do auto-wiring by name.

@PostConstruct and @PreDestroy can be used for init method and destroy method with context.registerShutDownHook().

@Component

This can be used to avoid xml configuration and spring will create a bean named as the class’s name. This can only handle singleton bean.

<context:component-scan base-package=”org.abcd.efg”/>

@Service, @Controller, @Repository

component-scan tag will pick this up as well. These map to MVC data model.

MessageSource to get text from property files
=====================================

<bean id=”messageSource” class=”org.springframework.context.support.ResourceBundleMessageSource”>
<property name=”basenames”>
<list>
<value>mymessages</value>
</list>
</property>
</bean>

context.getMessage(“greeting”, null, //message parameter
“Default greeting”, null //Locale
);

OR

MessageSource messageSource = context.getBean(“messageSource”);
messageSource.getMessage(“greeting”, null, //message parameter
“Default greeting”, null //Locale
);

parameter ex:

drawing.point=Circle: Point is: ({0}, {1})
context.getMessage(“drawing.point”, new Object[] {“val1”, “val2”}, “Default point msg”, null)