Introduction to Data Science in Python: Pandas Module (Part 01)

3 minute read

pandas is the mostly used module while analyzing data from a CSV, JSON, or Excel as the module can convert these data into dataframes and helps to analyze the data.

In this tutorial, we will see some basic usages of this module.

First thing first, let’s import the module

import pandas as pd

Now, let’s take a look at some basic tips and tricks

Series

We can create a series from a list or dictionary where for a list, the index column has index values ($0,1,2,\dots$) and for a dictionary, the index column has key values (for below example, $a,b,c$).

mylist = ['a','b','c']
mynums = [1,2,3]
mydict = {'a':10,'b':20,'c':30}
print(pd.Series(data=mylist))
print(pd.Series(data=mydict))
print(pd.Series(mynums,labels))
# pd.Series(data=mynums,index=labels)

If we use two lists inside the pd.Series method, the first one’s items are put in the data column, and the second list items are put in the index column.

Dataframes

Creating Dataframes

We can simply create a dataframe using the pandas.Dataframe() method. The following example creates a matrix of size $5 \times 4$ with random numbers, and then set index labels from A to E, and column labels from W to Z.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

Let’s take a look at another example-

import pandas as pd
data = [['X', 10], ['Y', 15], ['Z', 20]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])

The above example creates two columns with putting first elements from all internal lists from data, under the column Name and then second elements under the column Age.

If we want to create a dataframe by reading a CSV file, we can use the following:

df = pd.read_csv('file_name.csv')

You can use a few other options described below:

If you want to import only a subset of columns, you can define which columns to import by using the option usecols
```
 pd.read_csv('file_name.csv', usecols= ['column_name_1','column_name_2'])
```
or the indices of the columns
```
 pd.read_csv('file_name.csv',usecols=[1,2,3])
```
If your CSV file does not have column headers, set header=None
```
 df.read_csv('file_name.csv’, header=None)
```
If you want to use a particular column as index, use the option index_col
```
 pd.read_csv('file_name.csv', index_col='column_name_to_set_index')
```
Import a range of rows from the file. The following example reads rows $31$ to $45$. skiprows skips the first $30$ while reading first $45$ rows using nrows.
```
 df = pd.read_csv('file_name.csv', dtype=float, skiprows=30, nrows=45)
```

Dataframe Operations

Get a summary of the dataframe using describe()
```
 df.describe()
```

Get unique or non-unique values of a column

 df['col_name'].unique()

 df['col_name'].nunique()

Access a particular column
```
 df['target_column_name']
```

Accessing multiple columns

 df[['one_column_name','another_column_name']]

Accessing first $n$ or last $n$ number of rows. head(n) is used to access first $n$ rows and tail(n) is used to access last $n$ rows.
```
 df.head(5)
```
```
 df.tail(15)
```
Count number of appearences of a value
```
 data['col_name'].value_counts()
```

Creating a new column

 df['new_column_name'] = df['column_name_1'] + df['column_name_2']

Deleting a column
```
 df.drop('target_column_name',axis=1)
```
Deleting a row
```
 df.drop('row_index',axis=0)
```
selecting a row using loc or iloc
```
df.loc['row_index']
```
```
df.iloc[row_index]
```
Accessing a single value
```
df.loc['row_index','column_header']
```

Accessing values from selected multiple rows and multiple columns

df.loc[['row_index_1','row_index_2'],['column_header_1','column_header_2']]

creating a new CSV file

df.to_csv('file_name.csv')

exporting without index

df.to_csv('file_name.csv', index=False)

sometimes you can get the UnicodeEncodeError. To avoid that

df.to_csv('file_name.csv', encoding='utf-8')

export particular columns only

dt.to_csv('file_name.csv',columns=['col_1','col_2'])

Dataframe Visualization (Graphs)

pandas module offers some direct visualizations (using matplotlib in the background).

bar plot

 df.plot.bar()

or a stacked bar plot

 df.plot.bar(stacked=True) 

histogram
```
 df['column_name'].hist()
```

line plot

 df.plot.line(y='column_name',figsize=(10,3),lw=1)

scatter plot

 df.plot.scatter(x='column_1',y='column_2')

using a colormap

 df.plot.scatter(x='col_1',y='col_2',c='col_3',cmap='coolwarm')

box plot
```
 df.plot.box()
```
density plot
```
 df.plot.density()
```

In this post, I just showed a few basic uses of the module pandas. In the next post, we will take a more elaborate look at the module.

For accessing all data science in python related posts, check this post:

Collection of Data Science in Python Posts in my Blog.

Have a nice day, cheers!

Share on

Twitter Facebook LinkedIn

Shanto Roy

Introduction to Data Science in Python: Pandas Module (Part 01)

Series

Dataframes

Creating Dataframes

Dataframe Operations

Dataframe Visualization (Graphs)

Share on

Leave a comment

You may also enjoy

Certification Preparation Question Bank – Practice & Contribute

#100DaysOfSRE (Day 36): Kubernetes Helm Charts – Package & Deploy Applications

#100DaysOfSRE (Day 35): Kubernetes CI/CD Pipeline with GitHub Actions & ArgoCD

#100DaysOfSRE (Day 34): Automating Kubernetes Deployments with ArgoCD & GitOps