Introduction to Data Science in Python: Pandas Module (Part 02)
In the previous post, we have seen some basic usages of the pandas module. In this post, we will take a look at some more useful functionalities of the module.
Conditioning
- Show a filtered dataframe with values that satisfies the condition. The following example presents the same dataframe with values greater than zero. Values that do not satisfy conditions, are set to
np.NaNor Null.df[df>0] - Show all rows that satisfy condition for a particular column
df[df['col_name']>0] - Show all rows with target column values
df[df['conditioned_col_name']>0]['target_col_name']df[df['conditioned_col_name']>0][['target_col_name_1','target_col_name_2']] - Use of multiple conditions using
andandordf[(df['conditioned_col_name_1']>0) & (df['conditioned_col_name_2'] > 0)]df[(df['conditioned_col_name_1']>0) | (df['conditioned_col_name_2'] > 0)]
Merging, Joining, and Concatenating
- Concatenation
pd.concat([df1,df2,df3]) - Merging/Joining
pd.merge(df5,df6,how='inner',on='index')df1.join(df2)ndf1.join(df2, how='outer')
Tips and Tricks
- Column Slicing
df.iloc[:,start_index:ending_index].head()or
df.loc[:,'start_column_name':'ending_column_name'].head() - Add row total and column total to a numerical dataframe
df['col_total'] = df.apply(lambda x: x.sum(), axis=1) df.loc['row_total'] = df.apply(lambda x: x.sum()) - Access multiple files together with same columns
from glob import glob files = sorted(glob(data/file_name*.csv)) pd.concat((pd.read_csv(file) for file in files),ignore_index=True) - Read dataframe from Clipboard (quickly copying from Excel or Google Sheet)
df = pd.read_clipboard() - Read online JSON file
df = pd.read_json('target_url') - Scrape tables from webpage
pd.read_html("target_URL") - Apply functions to columns and save modified data in new columns
for col in df.columns: df[f'{col}_modified'] = df[col].apply(function_name) - Change location of a new column
cols = list(df.columns) target_loc = 3 cols = [:target_loc] + ['new_col_name'] + cols[target_loc:-1] df = df[cols] - insert a new column at a particular location
df.insert(target_loc, 'target_col_name', df['col_name'].apply(function_name)) - renaming column names
df = df.rename({'col_1':'new_col_1','col_2':'new_col_2',}, axis='columns') - Reversing the column order
df.loc[:,::-1] - Reverse row order
df.loc[::-1] - Split names (first-last) or locations (city-state) in different columns
df[['new_col_1','new_col_2']] = df.col_name.str.split(' ',expand=True)use
split(',')if the city-state format is likeHouston,TX. - Select columns by data type
df.select_dtypes(include=['number','object','category'])df.select_dtypes(exclude=['object','category']) - Count number of values with different data types if exist in a single column
df.col_name.apply(type).value_counts() - You can transpose a dataframe like a matrix
df.T
I will keep adding useful functionalities of pandas dataframe in this post.
For accessing all data science in python related posts, check this post:
Collection of Data Science in Python Posts in my Blog.
Leave a comment