Introduction to Data Science in Python: Pandas Module (Part 02)

1 minute read

In the previous post, we have seen some basic usages of the pandas module. In this post, we will take a look at some more useful functionalities of the module.

Conditioning

Show a filtered dataframe with values that satisfies the condition. The following example presents the same dataframe with values greater than zero. Values that do not satisfy conditions, are set to np.NaN or Null.
```
 df[df>0]
```
Show all rows that satisfy condition for a particular column
```
 df[df['col_name']>0]
```

Show all rows with target column values

 df[df['conditioned_col_name']>0]['target_col_name']

 df[df['conditioned_col_name']>0][['target_col_name_1','target_col_name_2']]

Use of multiple conditions using and and or

 df[(df['conditioned_col_name_1']>0) & (df['conditioned_col_name_2'] > 0)]

 df[(df['conditioned_col_name_1']>0) | (df['conditioned_col_name_2'] > 0)]

Merging, Joining, and Concatenating

Concatenation
```
 pd.concat([df1,df2,df3])
```

Merging/Joining

 pd.merge(df5,df6,how='inner',on='index')

 df1.join(df2)

 ndf1.join(df2, how='outer')

Tips and Tricks

Column Slicing

 df.iloc[:,start_index:ending_index].head()

 df.loc[:,'start_column_name':'ending_column_name'].head()

Add row total and column total to a numerical dataframe

 df['col_total'] = df.apply(lambda x: x.sum(), axis=1)  
 df.loc['row_total'] = df.apply(lambda x: x.sum())

Access multiple files together with same columns

 from glob import glob
 files = sorted(glob(data/file_name*.csv))
 pd.concat((pd.read_csv(file) for file in files),ignore_index=True) 

Read dataframe from Clipboard (quickly copying from Excel or Google Sheet)
```
 df = pd.read_clipboard()
```
Read online JSON file
```
 df = pd.read_json('target_url')
```
Scrape tables from webpage
```
 pd.read_html("target_URL")
```

Apply functions to columns and save modified data in new columns

 for col in df.columns:
     df[f'{col}_modified'] = df[col].apply(function_name)

Change location of a new column

 cols = list(df.columns)
 target_loc = 3
 cols = [:target_loc] + ['new_col_name'] + cols[target_loc:-1]
 df = df[cols]

insert a new column at a particular location

 df.insert(target_loc, 'target_col_name', df['col_name'].apply(function_name))

renaming column names

df = df.rename({'col_1':'new_col_1','col_2':'new_col_2',}, axis='columns')

Reversing the column order
```
df.loc[:,::-1]
```
Reverse row order
```
df.loc[::-1]
```
Split names (first-last) or locations (city-state) in different columns
```
df[['new_col_1','new_col_2']] = df.col_name.str.split(' ',expand=True)
```
use split(',') if the city-state format is like Houston,TX.

Select columns by data type

df.select_dtypes(include=['number','object','category'])

df.select_dtypes(exclude=['object','category'])

Count number of values with different data types if exist in a single column
```
df.col_name.apply(type).value_counts()
```
You can transpose a dataframe like a matrix
```
df.T
```

I will keep adding useful functionalities of pandas dataframe in this post.

For accessing all data science in python related posts, check this post:

Collection of Data Science in Python Posts in my Blog.

References

100 pandas tricks to save you time and energy

Share on

Twitter Facebook LinkedIn

Shanto Roy

Introduction to Data Science in Python: Pandas Module (Part 02)

Conditioning

Merging, Joining, and Concatenating

Tips and Tricks

References

Share on

Leave a comment

You may also enjoy

Certification Preparation Question Bank – Practice & Contribute

#100DaysOfSRE (Day 36): Kubernetes Helm Charts – Package & Deploy Applications

#100DaysOfSRE (Day 35): Kubernetes CI/CD Pipeline with GitHub Actions & ArgoCD

#100DaysOfSRE (Day 34): Automating Kubernetes Deployments with ArgoCD & GitOps