Introduction to Data Science in Python: Pandas Module (Part 02)
In the previous post, we have seen some basic usages of the pandas
module. In this post, we will take a look at some more useful functionalities of the module.
Conditioning
- Show a filtered dataframe with values that satisfies the condition. The following example presents the same dataframe with values greater than zero. Values that do not satisfy conditions, are set to
np.NaN
or Null.df[df>0]
- Show all rows that satisfy condition for a particular column
df[df['col_name']>0]
- Show all rows with target column values
df[df['conditioned_col_name']>0]['target_col_name']
df[df['conditioned_col_name']>0][['target_col_name_1','target_col_name_2']]
- Use of multiple conditions using
and
andor
df[(df['conditioned_col_name_1']>0) & (df['conditioned_col_name_2'] > 0)]
df[(df['conditioned_col_name_1']>0) | (df['conditioned_col_name_2'] > 0)]
Merging, Joining, and Concatenating
- Concatenation
pd.concat([df1,df2,df3])
- Merging/Joining
pd.merge(df5,df6,how='inner',on='index')
df1.join(df2)
ndf1.join(df2, how='outer')
Tips and Tricks
- Column Slicing
df.iloc[:,start_index:ending_index].head()
or
df.loc[:,'start_column_name':'ending_column_name'].head()
- Add row total and column total to a numerical dataframe
df['col_total'] = df.apply(lambda x: x.sum(), axis=1) df.loc['row_total'] = df.apply(lambda x: x.sum())
- Access multiple files together with same columns
from glob import glob files = sorted(glob(data/file_name*.csv)) pd.concat((pd.read_csv(file) for file in files),ignore_index=True)
- Read dataframe from Clipboard (quickly copying from Excel or Google Sheet)
df = pd.read_clipboard()
- Read online JSON file
df = pd.read_json('target_url')
- Scrape tables from webpage
pd.read_html("target_URL")
- Apply functions to columns and save modified data in new columns
for col in df.columns: df[f'{col}_modified'] = df[col].apply(function_name)
- Change location of a new column
cols = list(df.columns) target_loc = 3 cols = [:target_loc] + ['new_col_name'] + cols[target_loc:-1] df = df[cols]
- insert a new column at a particular location
df.insert(target_loc, 'target_col_name', df['col_name'].apply(function_name))
- renaming column names
df = df.rename({'col_1':'new_col_1','col_2':'new_col_2',}, axis='columns')
- Reversing the column order
df.loc[:,::-1]
- Reverse row order
df.loc[::-1]
- Split names (first-last) or locations (city-state) in different columns
df[['new_col_1','new_col_2']] = df.col_name.str.split(' ',expand=True)
use
split(',')
if the city-state format is likeHouston,TX
. - Select columns by data type
df.select_dtypes(include=['number','object','category'])
df.select_dtypes(exclude=['object','category'])
- Count number of values with different data types if exist in a single column
df.col_name.apply(type).value_counts()
- You can transpose a dataframe like a matrix
df.T
I will keep adding useful functionalities of pandas dataframe
in this post.
For accessing all data science in python
related posts, check this post:
Collection of Data Science in Python
Posts in my Blog.
Leave a comment