Introduction to Data Science in Python: Pandas Module (Part 01)

3 minute read

Tags: , , , , ,

Categories:

Updated:




pandas is the mostly used module while analyzing data from a CSV, JSON, or Excel as the module can convert these data into dataframes and helps to analyze the data.

In this tutorial, we will see some basic usages of this module.

First thing first, let’s import the module

import pandas as pd

Now, let’s take a look at some basic tips and tricks

Series

We can create a series from a list or dictionary where for a list, the index column has index values ($0,1,2,\dots$) and for a dictionary, the index column has key values (for below example, $a,b,c$).

mylist = ['a','b','c']
mynums = [1,2,3]
mydict = {'a':10,'b':20,'c':30}
print(pd.Series(data=mylist))
print(pd.Series(data=mydict))
print(pd.Series(mynums,labels))
# pd.Series(data=mynums,index=labels)

If we use two lists inside the pd.Series method, the first one’s items are put in the data column, and the second list items are put in the index column.

Dataframes

Creating Dataframes

We can simply create a dataframe using the pandas.Dataframe() method. The following example creates a matrix of size $5 \times 4$ with random numbers, and then set index labels from A to E, and column labels from W to Z.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

Let’s take a look at another example-

import pandas as pd
data = [['X', 10], ['Y', 15], ['Z', 20]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])

The above example creates two columns with putting first elements from all internal lists from data, under the column Name and then second elements under the column Age.

If we want to create a dataframe by reading a CSV file, we can use the following:

df = pd.read_csv('file_name.csv')

You can use a few other options described below:

  1. If you want to import only a subset of columns, you can define which columns to import by using the option usecols
     pd.read_csv('file_name.csv', usecols= ['column_name_1','column_name_2'])
    

    or the indices of the columns

     pd.read_csv('file_name.csv',usecols=[1,2,3])
    
  2. If your CSV file does not have column headers, set header=None
     df.read_csv('file_name.csv’, header=None)
    
  3. If you want to use a particular column as index, use the option index_col
     pd.read_csv('file_name.csv', index_col='column_name_to_set_index')
    
  4. Import a range of rows from the file. The following example reads rows $31$ to $45$. skiprows skips the first $30$ while reading first $45$ rows using nrows.
     df = pd.read_csv('file_name.csv', dtype=float, skiprows=30, nrows=45)
    

Dataframe Operations

  1. Get a summary of the dataframe using describe()
     df.describe()
    
  2. Get unique or non-unique values of a column
     df['col_name'].unique()
    
     df['col_name'].nunique()
    
  3. Access a particular column
     df['target_column_name']
    
  4. Accessing multiple columns
     df[['one_column_name','another_column_name']]
    
  5. Accessing first $n$ or last $n$ number of rows. head(n) is used to access first $n$ rows and tail(n) is used to access last $n$ rows.
     df.head(5)
    
     df.tail(15)
    
  6. Count number of appearences of a value
     data['col_name'].value_counts()
    
  7. Creating a new column
     df['new_column_name'] = df['column_name_1'] + df['column_name_2']
    
  8. Deleting a column
     df.drop('target_column_name',axis=1)
    
  9. Deleting a row
     df.drop('row_index',axis=0)
    
  10. selecting a row using loc or iloc
    df.loc['row_index']
    
    df.iloc[row_index]
    
  11. Accessing a single value
    df.loc['row_index','column_header']
    
  12. Accessing values from selected multiple rows and multiple columns
    df.loc[['row_index_1','row_index_2'],['column_header_1','column_header_2']]
    
  13. creating a new CSV file
    df.to_csv('file_name.csv')
    

    exporting without index

    df.to_csv('file_name.csv', index=False)
    

    sometimes you can get the UnicodeEncodeError. To avoid that

    df.to_csv('file_name.csv', encoding='utf-8')
    

    export particular columns only

    dt.to_csv('file_name.csv',columns=['col_1','col_2'])
    

Dataframe Visualization (Graphs)

pandas module offers some direct visualizations (using matplotlib in the background).

  1. bar plot
     df.plot.bar()
    

    or a stacked bar plot

     df.plot.bar(stacked=True) 
    
  2. histogram
     df['column_name'].hist()
    
  3. line plot
     df.plot.line(y='column_name',figsize=(10,3),lw=1)
    
  4. scatter plot
     df.plot.scatter(x='column_1',y='column_2')
    

    using a colormap

     df.plot.scatter(x='col_1',y='col_2',c='col_3',cmap='coolwarm')
    
  5. box plot
     df.plot.box()
    
  6. density plot
     df.plot.density()
    

In this post, I just showed a few basic uses of the module pandas. In the next post, we will take a more elaborate look at the module.

For accessing all data science in python related posts, check this post:

Collection of Data Science in Python Posts in my Blog.

Have a nice day, cheers!

Leave a Comment