Introduction to Data Science in Python: Pandas Module (Part 01)
pandas is the mostly used module while analyzing data from a CSV, JSON, or Excel as the module can convert these data into dataframes and helps to analyze the data.
In this tutorial, we will see some basic usages of this module.
First thing first, let’s import the module
import pandas as pd
Now, let’s take a look at some basic tips and tricks
Series
We can create a series from a list or dictionary where for a list, the index column has index values ($0,1,2,\dots$) and for a dictionary, the index column has key values (for below example, $a,b,c$).
mylist = ['a','b','c']
mynums = [1,2,3]
mydict = {'a':10,'b':20,'c':30}
print(pd.Series(data=mylist))
print(pd.Series(data=mydict))
print(pd.Series(mynums,labels))
# pd.Series(data=mynums,index=labels)
If we use two lists inside the pd.Series method, the first one’s items are put in the data column, and the second list items are put in the index column.
Dataframes
Creating Dataframes
We can simply create a dataframe using the pandas.Dataframe() method. The following example creates a matrix of size $5 \times 4$ with random numbers, and then set index labels from A to E, and column labels from W to Z.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
Let’s take a look at another example-
import pandas as pd
data = [['X', 10], ['Y', 15], ['Z', 20]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
The above example creates two columns with putting first elements from all internal lists from data, under the column Name and then second elements under the column Age.
If we want to create a dataframe by reading a CSV file, we can use the following:
df = pd.read_csv('file_name.csv')
You can use a few other options described below:
- If you want to import only a subset of columns, you can define which columns to import by using the option
usecolspd.read_csv('file_name.csv', usecols= ['column_name_1','column_name_2'])or the indices of the columns
pd.read_csv('file_name.csv',usecols=[1,2,3]) - If your CSV file does not have column headers, set
header=Nonedf.read_csv('file_name.csv’, header=None) - If you want to use a particular column as index, use the option
index_colpd.read_csv('file_name.csv', index_col='column_name_to_set_index') - Import a range of rows from the file. The following example reads rows $31$ to $45$.
skiprowsskips the first $30$ while reading first $45$ rows usingnrows.df = pd.read_csv('file_name.csv', dtype=float, skiprows=30, nrows=45)
Dataframe Operations
- Get a summary of the dataframe using
describe()df.describe() - Get unique or non-unique values of a column
df['col_name'].unique()df['col_name'].nunique() - Access a particular column
df['target_column_name'] - Accessing multiple columns
df[['one_column_name','another_column_name']] - Accessing first $n$ or last $n$ number of rows.
head(n)is used to access first $n$ rows andtail(n)is used to access last $n$ rows.df.head(5)df.tail(15) - Count number of appearences of a value
data['col_name'].value_counts() - Creating a new column
df['new_column_name'] = df['column_name_1'] + df['column_name_2'] - Deleting a column
df.drop('target_column_name',axis=1) - Deleting a row
df.drop('row_index',axis=0) - selecting a row using
locorilocdf.loc['row_index']df.iloc[row_index] - Accessing a single value
df.loc['row_index','column_header'] - Accessing values from selected multiple rows and multiple columns
df.loc[['row_index_1','row_index_2'],['column_header_1','column_header_2']] - creating a new CSV file
df.to_csv('file_name.csv')exporting without index
df.to_csv('file_name.csv', index=False)sometimes you can get the
UnicodeEncodeError. To avoid thatdf.to_csv('file_name.csv', encoding='utf-8')export particular columns only
dt.to_csv('file_name.csv',columns=['col_1','col_2'])
Dataframe Visualization (Graphs)
pandas module offers some direct visualizations (using matplotlib in the background).
- bar plot
df.plot.bar()or a stacked bar plot
df.plot.bar(stacked=True) - histogram
df['column_name'].hist() - line plot
df.plot.line(y='column_name',figsize=(10,3),lw=1) - scatter plot
df.plot.scatter(x='column_1',y='column_2')using a colormap
df.plot.scatter(x='col_1',y='col_2',c='col_3',cmap='coolwarm') - box plot
df.plot.box() - density plot
df.plot.density()
In this post, I just showed a few basic uses of the module pandas. In the next post, we will take a more elaborate look at the module.
For accessing all data science in python related posts, check this post:
Collection of Data Science in Python Posts in my Blog.
Have a nice day, cheers!
Leave a comment