1 minute read

In the previous post, we have seen some basic usages of the pandas module. In this post, we will take a look at some more useful functionalities of the module.

Conditioning

  1. Show a filtered dataframe with values that satisfies the condition. The following example presents the same dataframe with values greater than zero. Values that do not satisfy conditions, are set to np.NaN or Null.
     df[df>0]
    
  2. Show all rows that satisfy condition for a particular column
     df[df['col_name']>0]
    
  3. Show all rows with target column values
     df[df['conditioned_col_name']>0]['target_col_name']
    
     df[df['conditioned_col_name']>0][['target_col_name_1','target_col_name_2']]
    
  4. Use of multiple conditions using and and or
     df[(df['conditioned_col_name_1']>0) & (df['conditioned_col_name_2'] > 0)]
    
     df[(df['conditioned_col_name_1']>0) | (df['conditioned_col_name_2'] > 0)]
    

Merging, Joining, and Concatenating

  1. Concatenation
     pd.concat([df1,df2,df3])
    
  2. Merging/Joining
     pd.merge(df5,df6,how='inner',on='index')
    
     df1.join(df2)
    
     ndf1.join(df2, how='outer')
    

Tips and Tricks

  1. Column Slicing
     df.iloc[:,start_index:ending_index].head()
    

    or

     df.loc[:,'start_column_name':'ending_column_name'].head()
    
  2. Add row total and column total to a numerical dataframe
     df['col_total'] = df.apply(lambda x: x.sum(), axis=1)  
     df.loc['row_total'] = df.apply(lambda x: x.sum())
    
  3. Access multiple files together with same columns
     from glob import glob
     files = sorted(glob(data/file_name*.csv))
     pd.concat((pd.read_csv(file) for file in files),ignore_index=True) 
    
  4. Read dataframe from Clipboard (quickly copying from Excel or Google Sheet)
     df = pd.read_clipboard()
    
  5. Read online JSON file
     df = pd.read_json('target_url')
    
  6. Scrape tables from webpage
     pd.read_html("target_URL")
    
  7. Apply functions to columns and save modified data in new columns
     for col in df.columns:
         df[f'{col}_modified'] = df[col].apply(function_name)
    
  8. Change location of a new column
     cols = list(df.columns)
     target_loc = 3
     cols = [:target_loc] + ['new_col_name'] + cols[target_loc:-1]
     df = df[cols]
    
  9. insert a new column at a particular location
     df.insert(target_loc, 'target_col_name', df['col_name'].apply(function_name))
    
  10. renaming column names
    df = df.rename({'col_1':'new_col_1','col_2':'new_col_2',}, axis='columns')
    
  11. Reversing the column order
    df.loc[:,::-1]
    
  12. Reverse row order
    df.loc[::-1]
    
  13. Split names (first-last) or locations (city-state) in different columns
    df[['new_col_1','new_col_2']] = df.col_name.str.split(' ',expand=True)
    

    use split(',') if the city-state format is like Houston,TX.

  14. Select columns by data type
    df.select_dtypes(include=['number','object','category'])
    
    df.select_dtypes(exclude=['object','category'])
    
  15. Count number of values with different data types if exist in a single column
    df.col_name.apply(type).value_counts()
    
  16. You can transpose a dataframe like a matrix
    df.T
    

I will keep adding useful functionalities of pandas dataframe in this post.

For accessing all data science in python related posts, check this post:

Collection of Data Science in Python Posts in my Blog.

References

  1. 100 pandas tricks to save you time and energy

Leave a comment