Python cheatsheet for data analysis

Reading a file

CSV

import pandas as pd
dataFrame = pd.read_csv("./file.csv")

Reading a csv file, pandas assumes that the file will have the first row as header. If the file doesn’t contain header row, it can be read as:

dataFrame = pd.read_csv("./file.csv", header=None)

Peeking into data

dataFrame.head()

Output:

     Id   Name           Marks     Percentage
0     1   Scott Summers    175           87.5 
1     2   Peter Parker      99           49.5

describe()

describe() works differently for numeric columns and categorical columns. If we apply describe() on Id column here is what it will look like:

dataFrame.Id.describe() OR dataFrame['Id'].describe()

Output:

count 2.000000
mean  1.500000
std   0.707107
min   1.000000
25%   1.250000
50%   1.500000
75%   1.750000
max   2.000000
Name: Id, dtype: float64

For numeric field, it shows total values (count), mean of all the values, standard deviation (std), minimum (min), maximum (max), 25, 50 and 75th percentile values to give an idea of the data for a particular attribute.

dataFrame.Name.describe() OR dataFrame['Name'].describe()

Output:

count     2
unique    2
top       Scott Summers
freq      1
Name: Name, dtype: object

For categorical fields, it shows total values, unique values and the one occurring maximum times along with the frequency.

Extracting and removing columns in data frame

Extracting numeric columns

import numpy as np
dataFrameNum = dataFrame.select_dtypes(include=[np.number])
print (dataFrameNum.columns)

Output:

Index(['Id', 'Marks', 'Percentage'], dtype='object')

Extracting categorical columns

dataFrameNonNum = dataFrame.select_dtypes(exclude=[np.number])
print (dataFrameNonNum.columns)

Output:

Index(['Name'], dtype='object')

Dropping Columns

dataFrameDroppedCols = dataFrame.drop(['Id'], axis=1)

Checking correlation between columns

corr = dataFrame.corr()
print (corr)

Output:

              Id    Marks    Percentage
Id           1.0     -1.0          -1.0
Marks       -1.0      1.0           1.0
Percentage  -1.0      1.0           1.0

It prints out a matrix showing correlation between every column. As percentage is calculated using marks and total marks, it shows perfect correlation (1.0) between them. Whereas -1.0 shows that there is absolutely no correlation between them.

In case there are many attributes, this matrix can be huge. We can sort the values to see the most correlated and least correlated attributes easily.

print (corr['Marks'].sort_values(ascending=False)[:5], '\n')
print (corr['Marks'].sort_values(ascending=False)[-5:], '\n')

Pivot Table for further correlation exploration

dataFrame.pivot_table(index='Marks', values='Percentage', aggfunc = np.median)

Visualising correlation between columns

import matplotlib.pyplot as plt

plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

plt.hist(dataFrame.Marks, color='blue')
plt.show()


Plotting using data frame

dataFrame.plot(kind='bar', color='blue')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.xticks(rotation=0)
plt.show()

Finding and removing outliers

Scatter Plot to find outliers

plt.scatter(x=dataFrame['Marks'], y=dataFrame['Percentage'])
plt.ylabel('Y Label')
plt.xlabel('X Label')
plt.xlim(-100, 600)
plt.show()

Data Frame operations to remove outliers

dataFrame = dataFrame[dataFrame['Percentage'] > 50]

Encoding attribute values

value_counts

Use value_counts on data frame series to see number of occurrences of each value

print (dataFrame.Marks.value_counts())

Output:

175      1
99       1
Name: Marks, dtype: int64

Encoding values

def encode_values(x):
    if x > 50.0:
       return 1
    else
       return 0

dataFrame['enc_percentage'] = dataFrame.Percentage.apply(encode)

Handling null values

Interpolate

In case, the values are following an order (increasing, decreasing, time etc.), interpolation can be used to fill the missing values.

dataFrame['Id'] = dataFrame['Id'].interpolate(method='linear')

Drop null values

Drop columns with all values null

dataFrame.dropna(axis=1, how='all')

Drop columns with any value null

dataFrame.dropna(axis=1, how='any')

Drop rows with all value null

dataFrame.dropna(axis=0, how='all')

Drop rows with any value null

dataFrame.dropna(axis=0, how='any')
Advertisements

JDK Tools for your application

In this post, I will talk about various tools that help you analyze your Java application’s behavior when something has went wrong. This is not an exhaustive list but just the tools that I have used at some point of time and found useful.

1. jmap
This is a tool bundled within JDK and useful for connecting to either running java processes locally/remotely or connecting to a core file.

  • Heap Dump: jmap -dump:format=b,file=/tmp/heap.bin PID
    This command dumps the heap which can be later analyzed using other tools mentioned below to analyze the application’s memory usage.
  • Heap Summary: jmap -heap PID
    This command dumps the current memory usage with the division of heap space (Young Gen/Old Gen/Survivor Space) depending on the Garbage Collector used

2. jstack
This tool is also included in JDK and useful to get details of a running application.

  • Thread details: jstack -l PID
    This command will print the stack trace of all the threads running in the java process. This can be used to get total threads with priority details of those threads.

3. jhat
Bundled within JDK, this tool can be used to analyze the heap dump

  • Heap Dump Analysis: jhat Heap_Dump_File
    This command will analyze the heap dump file extracting all the object details and publish this details to a web server running on default 7000 port (can be changed using options) where you can see all the object details and even run Object Queries (OQL) against the heap dump.

4. jvisualvm
This tool is also available within JDK and provides UI to connect to running java process and can be used to get a heap dump or used as a profiler.

5. MAT
Memory Analyzer Tool is an eclipse plugin used for heap dump analysis and very useful to find memory leaks or checking the heap usage of the application. More about MAT: MAT. This tool also provides the details of unreachable objects useful to study the behavior of Garbage Collector.

6. GCViewer
When the Java application is started, it can be asked to dump garbage collector logs using options like –XX:+PrintGCDetails –XX:+PrintGCTimeStamps and -Xloggc:logFile to redirect the logs to a file for further analysis.

GCViewer reads the GC logs to provide a summary of how the garbage collector performed.