Web Scrapping: Clicking the ‘Show More’ Button Multiple times in Medium.com Blog via Selenium
We often need to scrap information from the Medium blog to create a dataset and need to download the content of different blog posts.
However, before scrapping a particular blog post, you need the URL and the best way to get many URLs from a single page is searching medium.com using keywords and finding all the post URLs.
Back in 2019, it was easier since then I just needed to scroll down using selenium
. But in 2022, when I have tried to do it again, I found that they again changed the structure. Now, we need to click on a Show more
button.
From my previous experience, I knew this is a dynamic site and the class is changed so often. Based on this Stack Overflow Thread, find_element_by_class_name
() only accept single classname. That’s why they suggested using css selector
or the xpath
.
I initially, tried using the xpath
by copying it via:
- Load a sample medium query page, e.g., https://medium.com/search?q=security
- Right click on
show more
button - inspect element
- right click on the element on the right pane
- copy
xpath
, notfull xpath
It looked like this:
//*[@id="root"]/div/div[3]/div/div/main/div/div/div/div/div[2]/div[9]/div/div/button
However, this only clicked the Show more
button once and I was wondering why! So, I went back to my regular browser, went upto the second Show more
button and copied the xpath
again. Now, it looks like as follows:
//*[@id="root"]/div/div[3]/div/div/main/div/div/div/div/div[2]/div[19]/div/div/button
If you watch carefully, a particular value is chamged: $9 \rightarrow 19$. I understood, it will continue as $29, 39, \dots,49$. So, I just needed to change the value by adding $1$ at the 67th number position.
The code is adapted from this stackoverflow thread. You need to change the path based on where you put your webdriver.
Also, you can customize the max_click_SHOW_MORE
based on how many times you want to click it and load more posts.
Here is the full working code:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, executable_path=r'../chromedriver')
driver.get("https://medium.com/search?q=metamask")
initial_XPATH = "//*[@id='root']/div/div[3]/div/div/main/div/div/div/div/div[2]/div[9]/div/div/button"
max_click_SHOW_MORE = 5
count = 1
WebDriverWait(driver, 100).until(EC.visibility_of_element_located((By.XPATH, initial_XPATH))).click()
while count <= max_click_SHOW_MORE:
try:
time.sleep(20)
new_XPATH = initial_XPATH[:67] + str(count) + initial_XPATH[67:]
WebDriverWait(driver, 100).until(EC.visibility_of_element_located((By.XPATH, new_XPATH))).click()
print("Button clicked #", count+1)
count += 1
except TimeoutException:
break
time.sleep(20)
driver.quit()
Let me know if you have any question. Thanks, have a good day!
Leave a comment