Python cheatsheet for data analysis

Reading a file


import pandas as pd
dataFrame = pd.read_csv("./file.csv")

Reading a csv file, pandas assumes that the file will have the first row as header. If the file doesn’t contain header row, it can be read as:

dataFrame = pd.read_csv("./file.csv", header=None)

Peeking into data



     Id   Name           Marks     Percentage
0     1   Scott Summers    175           87.5 
1     2   Peter Parker      99           49.5


describe() works differently for numeric columns and categorical columns. If we apply describe() on Id column here is what it will look like:

dataFrame.Id.describe() OR dataFrame['Id'].describe()


count 2.000000
mean  1.500000
std   0.707107
min   1.000000
25%   1.250000
50%   1.500000
75%   1.750000
max   2.000000
Name: Id, dtype: float64

For numeric field, it shows total values (count), mean of all the values, standard deviation (std), minimum (min), maximum (max), 25, 50 and 75th percentile values to give an idea of the data for a particular attribute.

dataFrame.Name.describe() OR dataFrame['Name'].describe()


count     2
unique    2
top       Scott Summers
freq      1
Name: Name, dtype: object

For categorical fields, it shows total values, unique values and the one occurring maximum times along with the frequency.

Extracting and removing columns in data frame

Extracting numeric columns

import numpy as np
dataFrameNum = dataFrame.select_dtypes(include=[np.number])
print (dataFrameNum.columns)


Index(['Id', 'Marks', 'Percentage'], dtype='object')

Extracting categorical columns

dataFrameNonNum = dataFrame.select_dtypes(exclude=[np.number])
print (dataFrameNonNum.columns)


Index(['Name'], dtype='object')

Dropping Columns

dataFrameDroppedCols = dataFrame.drop(['Id'], axis=1)

Checking correlation between columns

corr = dataFrame.corr()
print (corr)


              Id    Marks    Percentage
Id           1.0     -1.0          -1.0
Marks       -1.0      1.0           1.0
Percentage  -1.0      1.0           1.0

It prints out a matrix showing correlation between every column. As percentage is calculated using marks and total marks, it shows perfect correlation (1.0) between them. Whereas -1.0 shows that there is absolutely no correlation between them.

In case there are many attributes, this matrix can be huge. We can sort the values to see the most correlated and least correlated attributes easily.

print (corr['Marks'].sort_values(ascending=False)[:5], '\n')
print (corr['Marks'].sort_values(ascending=False)[-5:], '\n')

Pivot Table for further correlation exploration

dataFrame.pivot_table(index='Marks', values='Percentage', aggfunc = np.median)

Visualising correlation between columns

import matplotlib.pyplot as plt'ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

plt.hist(dataFrame.Marks, color='blue')

Plotting using data frame

dataFrame.plot(kind='bar', color='blue')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')

Finding and removing outliers

Scatter Plot to find outliers

plt.scatter(x=dataFrame['Marks'], y=dataFrame['Percentage'])
plt.ylabel('Y Label')
plt.xlabel('X Label')
plt.xlim(-100, 600)

Data Frame operations to remove outliers

dataFrame = dataFrame[dataFrame['Percentage'] > 50]

Encoding attribute values


Use value_counts on data frame series to see number of occurrences of each value

print (dataFrame.Marks.value_counts())


175      1
99       1
Name: Marks, dtype: int64

Encoding values

def encode_values(x):
    if x > 50.0:
       return 1
       return 0

dataFrame['enc_percentage'] = dataFrame.Percentage.apply(encode)

Handling null values


In case, the values are following an order (increasing, decreasing, time etc.), interpolation can be used to fill the missing values.

dataFrame['Id'] = dataFrame['Id'].interpolate(method='linear')

Drop null values

Drop columns with all values null

dataFrame.dropna(axis=1, how='all')

Drop columns with any value null

dataFrame.dropna(axis=1, how='any')

Drop rows with all value null

dataFrame.dropna(axis=0, how='all')

Drop rows with any value null

dataFrame.dropna(axis=0, how='any')

Mockito and Power Mockito – Cheatsheet

@RunWith(PowerMockRunner.class) – Tell Junit that run this test using PowerMockRunner

@PrepareForTest(A.class) – This is needed when we need to test static methods of A class

AService mock = PowerMockito.mock(A.class) – Creating a mock for A class

PowerMockito.when(mock.mockedMethod()).thenReturn(value) – When mockedMethod is called in the code, then return the value specified here.

PowerMockito.doNothing().when(mock).method() – do nothing when method() is called on mock object

Mockito.verify(mock).someMethod() – Verify that someMethod was called on mock once.

Mockito.verify(mock, times(n)).someMethod() – someMethod called n number of times

Mockito.verify(mock, never()).someMethod() – someMethod called n number of times

Mockito.verify(mock, atLeastOnce()).someMethod() – self explanatory

Mockito.verify(mock, atLeast(n)).someMethod() – self explanatory

Mockito.verify(mock, atMost(n)).someMethod() – self explanatory


PowerMockito.mockStatic(A.class) – mock all static methods of class A

A.staticMethod(value); – Nothing to be done when staticMethod(value) is called on class A

PowerMockito.doNothing().doThrow(new IllegalStateException()).when(A.class)
A.staticMethod(value); – Throw IllegalStateException when staticMethod(value) is called on class A

//We first have to inform PowerMock that we will now verify
//the invocation of a static method by calling verifyStatic.
//Then we need to inform PowerMock about the method we want to verify.
//This is done by actually invoking the static

@Before – annotation for a method that does the set up before starting the test.

InOrder verification

//First we have to let PowerMock know that the verification order is
//going to be important. This is done by calling Mockito.inOrder and passing
//it the mocked object.
InOrder inOrder = Mockito.inOrder(mock);
//Next, we can continue our verification using the inOrder instance
//using the same technique as seen earlier.
inOrder.verify(mock, Mockito.never()).create();


PowerMockito.whenNew(A.class).withArguments(mock, “msg”).thenReturn(object)

PowerMockito.verifyNew(A.class).withArguments(mock, “msg”)
PowerMockito.verifyNew(A.class, times(n)).withArguments(mock, “msg”)

The class creating an object of A will be needed to be in @PrepareForTest


Assert.assertSame(objValue, mock.method(“somestring123”));
Assert.assertSame(objValue, mock.method(“somestring456”));

PowerMockito.when(mock.method(Mockito.argThat(new ArgumentMatcher){public void matches(Object obj)….}).thenReturn(value); – Use the custom matcher to match the argument and return the value specified.

Mockito.anyString, anyFloat, anyDouble, anyList, and so on

Answer Interface

When thenReturn() is not practical, use Answer interface

PowerMockito.when(mock.method()).then(new Answer<T>() {
public T answer(InvocationOnMock invocation) {

PowerMockito.mock(A.class, Answer obj) – This will act as default answer for all the invocation on this mock object.

Spy – Partial Mocking (some methods) of classes

//Following is the syntax to create a spy using the PowerMockito.spy method.
//Notice that we have to pass an actual instance of the EmployeeService class.
//This is necessary since a spy will only mock few methods of a class and
//invoke the real methods for all methods that are not mocked.
final EmployeeService spy = PowerMockito.spy(new EmployeeService());

//Notice that we have to use the PowerMockito.doNothing().when(spy).createEmployee()
//syntax to create the spy. This is required because if we use the
//PowerMockito.when(spy.createEmployee()) syntax will result in calling
//the actual method on the spy.
//Hence, remember when we are using spies,
//always use the doNothing(), doReturn() or the //doThrow() syntax only. PowerMockito.doNothing().when(spy)

Mocking private methods

“createEmployee”, employeeMock);

.invoke(“createEmployee”, employeeMock);

Spring Core Reference – Cheatsheet


BeanFactory factory  = new XmlBeanFactory(new FileSystemResource(fileName));

ApplicationContext context = new ClassPathXmlApplicationContext(fileName);


By Value
<bean id=”beanName” class=”org.abcd.efg.AClass”>
<property name=”name” value=”Rasesh”/>

By Reference
<bean id=”beanName” class=”org.abcd.efg.AClass”>
<property name=”name” ref=”anotherBeanName”/>


<bean id=”beanName” class=”org.abcd.efg.AClass”>
<constructor-arg index=”0″ value=”Rasesh”/>
<constructor-arg index=”1″ ref=”anotherBeanName”/>


<bean id=”beanName” class=”org.abcd.efg.AClass”>
<property name=”names” >
<ref bean=”beanName”/>
<ref bean=”anotherBeanName”/>

Autowiring by tags

<bean id=”beanName” class=”org.abcd.efg.AClass” autowire=”byName“>

Here, it will search for the beans with the same name as member variables in AClass and autowire them.

<bean id=”beanName” class=”org.abcd.efg.AClass” autowire=”byType“>

Here, it will try and match the data type of beans and member variables in AClass. If there are multiple beans of same data type, this won’t work.

<bean id=”beanName” class=”org.abcd.efg.AClass” autowire=”constructor“>

This is autowire by Type, but this will set the member variables through constructor instead of setters.

Bean Scopes

Singleton, Prototype


Adding Required annotation to setter will fail when it initializes the bean if the value is null instead of a later NPE when it is referenced.

<bean class=”org.springframework.beans.factory.annotation.RequiredAnnotationBeanPostProcessor” />

This bean will take care of the process mentioned above.

This works same as auto-wiring by type falling back to auto-wiring by name. If still not found, qualifier can be used.

<bean id=”……>
<qualifier value=”qualifierString”/>


<context:annotation-config/> can be used to avoid adding bean post processor for each annotation.

This along with @Qualifier(“qualifierString”) will be able to auto wire the bean.
<bean class=”org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor” />

This can be used for dependency injection by name. @Resource(name=”beanName”) will wire the bean with specified name.

If name is not specified, it will do auto-wiring by name.

@PostConstruct and @PreDestroy can be used for init method and destroy method with context.registerShutDownHook().


This can be used to avoid xml configuration and spring will create a bean named as the class’s name. This can only handle singleton bean.

<context:component-scan base-package=”org.abcd.efg”/>

@Service, @Controller, @Repository

component-scan tag will pick this up as well. These map to MVC data model.

MessageSource to get text from property files

<bean id=”messageSource” class=””>
<property name=”basenames”>

context.getMessage(“greeting”, null, //message parameter
“Default greeting”, null //Locale


MessageSource messageSource = context.getBean(“messageSource”);
messageSource.getMessage(“greeting”, null, //message parameter
“Default greeting”, null //Locale

parameter ex:

drawing.point=Circle: Point is: ({0}, {1})
context.getMessage(“drawing.point”, new Object[] {“val1”, “val2”}, “Default point msg”, null)

Getting started with erlang

This post is more of notes as compared to a step-by-step tutorial.

Install Erlang
Download and install Erlang


$ erl
Erlang/OTP 17 [erts-6.0] [source-07b8f44] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Eshell V6.0  (abort with ^G)

erl is the command to start the Erlang shell.

You can also use this online interpreter:

Expressions in Erlang are terminated by .(dot) followed by whitespace (line break, space etc.)

Usage of arithmetic operators (+, *, /, div, rem and -):

1> 5+5.
2> 5*10.
3> 25/2.
4> 25 div 2.
5> 25 rem 2.
6> 10-5.
  • ‘/’ operator by default performs a floating point operation.
  • For integer division, use ‘div’ operator.
  • For modulo operation, use ‘rem’ operator.

Numbers in other bases

Numbers can be expressed with base other than 10.

7> 2#101.
8> 8#723.
9> 16#abe.


  • Being a functional programming language, variables are immutable.
  • Variable names must start with a capital letter or ‘_’.
10> FirstVariable.
* 1: variable 'FirstVariable' is unbound
11> Three.
* 1: variable 'Three' is unbound
12> Three=3.
13> Five=Three+2.
14> Two=2.
15> Five=Three+Two.
16> Five=Five+1.
** exception error: no match of right hand side value 6
17> Three=4.
** exception error: no match of right hand side value 4
  • A variable once assigned a value cannot be changed. (Refer command 16 and 17).
  • A variable can be assigned the same value any number of times. (Refer command 13 and 15).
  • ‘=’ operator tries to compare the values on left and right side. If there is a variable on left hand side and it is not bound, it is assigned the value.
  • If left and right side both have value and do not match, it throws an error.
18> 85=45+40.
19> 85=40+40.
** exception error: no match of right hand side value 80

More examples on variables:

20> three=3.
** exception error: no match of right hand side value 3
21> _=3.
22> _.
* 1: variable '_' is unbound
  • ‘three’ is invalid variable name as it does not start with capital letter.
  • _ is a special variable to which no value can be bound.

Erase a variable’s value

23> Three.
24> f(Three).
25> Three.
* 1: variable 'Three' is unbound
26> Five.
27> Two.
28> f().
29> Two.
* 1: variable 'Two' is unbound
30> Five.
* 1: variable 'Five' is unbound


Atoms are the reason why variable names cannot start with lower-case letters. Atoms are literals,constants with the value same as their name. As the name suggests, it cannot be smashed into species nor can it be changed.

Well, ironman is ‘ironman’ and it can’t be ‘Tony Stark’ 🙂

An atom that does not start with lower-case letter or contains anything other than alphanumeric characters, “_” or “@” must be enclosed in single quotes (‘).

31> iamatom.
32> iamatom=5.
** exception error: no match of right hand side value 5
33> iamatom='iamatom'.
34> iamatom=iamatom.
35> 'I am also an atom'.
'I am also an atom'
  • Atoms generally represent constants and should be used with care as they are not garbage-collected.
  • They are referred in ‘atom table’ with memory usage of 4 bytes/atom in 32-bit and 8 bytes/atom in 64-bit system.
  • Reserved atoms: after and andalso band begin bnot bor bsl bsr bxor case catch cond div end fun if let not of or orelse query receive rem try when xor.

Boolean Algebra and comparison operators

Well, what is the use of a language if it can’t distinguish apple to oranges 🙂

1> true and false.
2> true or false.
3> true and true.
4> false or false.
5> not (true and true).
6> true xor false.
7> true xor true.
8> false xor true.
9> false xor false.
10> true andalso true.
11> false andalso true.
12> true orelse false.
  • and, or, not, xor are quite straight forward.
  • andalso is similar to and. The difference is that it does not evaluate another expression if it does not need to.
  • orelse is counterpart of or as andalso is for and.

Euality and Inequality:

13> One=1.
14> Two=2.
15> 2=:= 2.
16> 2=/=2.
17> 5=/=2.
18> One=:=One.
19> One=/=Two.
  • =:= is comparable to ‘==’ in other languages and =/= is comparable to ‘!=’
  • ‘==’ and ‘/=’ in Erlang are used to compare values of different types.
20> 5=:=5.0.
21> 5==5.0.
22> 5=/=5.0.
23> 5/=5.0.

Other comparison operators:

24> 1<2.
25> 1>2.
26> 1>=2.
27> 1<=2.
* 1: syntax error before: '<='
  • Oh!! Does this mean there is no ‘less than or equal to’ operator? Here it is with a different symbol:
27> 1=<2.
28> 2=<2.

Well, do you think you have got it all correct? Let’s try this:

29> 7-foo.
** exception error: an error occurred when evaluating an arithmetic expression
in operator -/2
called as 7 - foo
30> 5=:=false.
  • Erlang slapped you hard with the error details when you subtracted foo from 7.
  • But, it is all okay comparing 5 to false!!!! – Well, 5 is not false, right?

If this is not enough, take this:

31> 1<false.
32> 2<false.
33> 54<false.
34> 546734<false.

A total ordering on all data types is defined based on which numbers < atoms. Hence, 1<false. ‘false’ is not a special value. It is just an atom, remember?

Well, we have had enough now. Let’s take a break and we will continue from here.

JDK Tools for your application

In this post, I will talk about various tools that help you analyze your Java application’s behavior when something has went wrong. This is not an exhaustive list but just the tools that I have used at some point of time and found useful.

1. jmap
This is a tool bundled within JDK and useful for connecting to either running java processes locally/remotely or connecting to a core file.

  • Heap Dump: jmap -dump:format=b,file=/tmp/heap.bin PID
    This command dumps the heap which can be later analyzed using other tools mentioned below to analyze the application’s memory usage.
  • Heap Summary: jmap -heap PID
    This command dumps the current memory usage with the division of heap space (Young Gen/Old Gen/Survivor Space) depending on the Garbage Collector used

2. jstack
This tool is also included in JDK and useful to get details of a running application.

  • Thread details: jstack -l PID
    This command will print the stack trace of all the threads running in the java process. This can be used to get total threads with priority details of those threads.

3. jhat
Bundled within JDK, this tool can be used to analyze the heap dump

  • Heap Dump Analysis: jhat Heap_Dump_File
    This command will analyze the heap dump file extracting all the object details and publish this details to a web server running on default 7000 port (can be changed using options) where you can see all the object details and even run Object Queries (OQL) against the heap dump.

4. jvisualvm
This tool is also available within JDK and provides UI to connect to running java process and can be used to get a heap dump or used as a profiler.

5. MAT
Memory Analyzer Tool is an eclipse plugin used for heap dump analysis and very useful to find memory leaks or checking the heap usage of the application. More about MAT: MAT. This tool also provides the details of unreachable objects useful to study the behavior of Garbage Collector.

6. GCViewer
When the Java application is started, it can be asked to dump garbage collector logs using options like –XX:+PrintGCDetails –XX:+PrintGCTimeStamps and -Xloggc:logFile to redirect the logs to a file for further analysis.

GCViewer reads the GC logs to provide a summary of how the garbage collector performed.

Design Pattern: Builder Pattern

What is it?

According to Gang of Four, “Separates the construction of a complex object from its representation so that the same construction process can create different representations.” Again, no idea reading it for the first time 🙂

So, stated simply, we try to separate the construction of a complex object from its representation to reuse the building process for creating different representations.

Let’s take an example of subway, where we provide all the details of vegetables and sauces needed (data) and we get the sandwich (complex object). Here, we did not tell how to prepare it (building process). This is what we want to do with object creation process too.

In software world:

To create a sandwich object, you pass all the parameters in the constructor to create the object. This can get worse if we have multiple constructors with different parameters.

Improvement #1:

Remove the parameters from the constructor and set all the properties you need. Something like:

 Sandwich sandwich = new Sandwich();
 sandwich.setHasMayo(true); ...

But, now we need to check if we have set all the properties. We have no way to set the order. I mean, you need to provide the bread type before vegetables and sauce you need that can not be controlled anymore. So, we need a builder that takes all details and build it using proper steps.

Improvement #2:

class SandwichBuilder{
       Sandwich sandwich;

       public Sandwich getSandwich(){
             return sandwich; 

       public void createSandwich(){
              sandwich = new Sandwich();

It can be further modified to modularize and make the building process more meaningful with functions doing smaller tasks being called from createSandwich

       public void createSandwich(){
              sandwich = new Sandwich();

So, now we have a builder whose object can be used to create a new sandwich having same properties using this builder.

Also, this helps when creation of the complex object (sandwich here) changes i.e. we can change the code inside the build function (createSandwich here).

One more advantage here is if we have more than one kind of complex object (different types of sandwich), just create a new builder class to handle that and we can reuse the creation of the complex object de-coupling it from its representation.

But, now we need to create multiple builder classes with similar functions without any relation between them (no common interface implemented by them) and much of code will be replicated in these builders.

Improvement #3

Create an abstract class controlling the structure of the builder which can be used by each builder to use the same procedure with different values (ingredients here):

abstract class SandwichBuilder{
        Sandwich sandwich;

        public Sandwich getSandwich(){
               return sandwich;

        public void createNewSandwich(){
               sandwich=new Sandwich();

        public abstract void addVegetables();
        public abstract void addSauces();

So, this class here now encapsulates the commonalities between the builders but who handles the procedure of building the sandwich.

It should not be added here as this is the class that does the work common to all the builders. Here is what we do to mitigate this:

class SandwichMaker{
      private SandwichBuilder builder;

      public SandwichMaker(SandwichBuilder builder){
             this.builder = builder;

      public void buildSandwich(){

      public Sandwich getSandwich(){
             return builder.getSandwich();

So, now our builder class just handles the properties (ingredients here) to create the complex object and the creation steps are handled by the maker and your code using this will look something like this:

      SandwichMaker maker = new SandwichMaker(new MyFavSandwichBuilder());
      Sandwich sandwich = maker.getSandwich();

The Builder Pattern Design:

Director (SandwichMaker) <>——- Builder (SandwichBuilder) <———- ConcreteBuilder (MyFavSandwichBuilder) ——-> Product (Sandwich)

So in Hollywood terms:

Director knows how to build

ConcreteBuilder defines the steps for creation of the Product used by Director

Where to use:

  • Need to create a complex object with lot of details embedded into it.
  • Multiple data combination to create multiple types of products using same steps.

Design Pattern: Bridge Pattern


According to Gang of Four, “Decouple an abstraction from its implementation so the two can vary independently”. If you did not understand it, it is pretty normal 🙂 Lets discuss it in detail.

Lets understand the definition:

So for any kind of abstraction, we create an interface and implement that interface in a class. Now, our client code uses this interface to call the implemented code. But, every time the interface changes, we need to change the implemented code. How can we de couple the abstraction and its implementation?


We have different types of documents that have print functionality in them and we have created a abstract super class GenericDocument that has this print method.

Now lets say we get a requirement to have a Document to have two different types of printing (ex. remove images while printing). One option is to create a new subclass of that Document and implement print accordingly.

  • But what if all documents may need this new type of Print tomorrow? Do we create a new subclass every time?
  • What if we get multiple different print types. Should we create a new sub class for each type of document?


To have a generic print that can be used by each type of document if needed, we can create a print helper that can handle each type of print request and this helper can be used by each document’s print method without the need of any new sub class creation for each new print type.

GenericDocument <— Book, Research Paper, Design Document

Create a PrintHelper interface that has two implementations: NormalPrint and PrintWithoutImages implementation.

Now, GenericDocument has PrintHelper passed in the constructor that can be used by each document’s print method solving the whole purpose.

What this pattern did?

Abstraction (GenericDocument) uses Implementor (PrintHelper)

RefinedAbstraction (Book, Research Paper etc.) uses ConcreteImplementor (NormalPrint, PrintWithoutImages)

When to use it?

You have some abstraction (GenericDocument) and some implementations (Book, Research Paper etc.) already existing but you need to add some functionality (Print without images) that can be used by all the implementations, you can go ahead and use bridge pattern.