We often need to scrap information from the Medium blog to create a dataset and need to download the content of different blog posts.
However, before scrapping a particular blog post, you need the URL and the best way to get many URLs from a single page is searching medium.com using keywords and finding all the post URLs.
Back in 2019, it was easier since then I just needed to scroll down using
selenium. But in 2022, when I have tried to do it again, I found that they again changed the structure. Now, we need to click on a
Show more button.
From my previous experience, I knew this is a dynamic site and the class is changed so often. Based on this Stack Overflow Thread,
find_element_by_class_name() only accept single classname. That’s why they suggested using
css selector or the
I initially, tried using the
xpath by copying it via:
- Load a sample medium query page, e.g., https://medium.com/search?q=security
- Right click on
- inspect element
- right click on the element on the right pane
It looked like this:
However, this only clicked the
Show more button once and I was wondering why! So, I went back to my regular browser, went upto the second
Show more button and copied the
xpath again. Now, it looks like as follows:
If you watch carefully, a particular value is chamged: $9 \rightarrow 19$. I understood, it will continue as $29, 39, \dots,49$. So, I just needed to change the value by adding $1$ at the 67th number position.
The code is adapted from this stackoverflow thread. You need to change the path based on where you put your webdriver.
Also, you can customize the
max_click_SHOW_MORE based on how many times you want to click it and load more posts.
Here is the full working code:
import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_argument('disable-infobars') driver=webdriver.Chrome(chrome_options=options, executable_path=r'../chromedriver') driver.get("https://medium.com/search?q=metamask") initial_XPATH = "//*[@id='root']/div/div/div/div/main/div/div/div/div/div/div/div/div/button" max_click_SHOW_MORE = 5 count = 1 WebDriverWait(driver, 100).until(EC.visibility_of_element_located((By.XPATH, initial_XPATH))).click() while count <= max_click_SHOW_MORE: try: time.sleep(20) new_XPATH = initial_XPATH[:67] + str(count) + initial_XPATH[67:] WebDriverWait(driver, 100).until(EC.visibility_of_element_located((By.XPATH, new_XPATH))).click() print("Button clicked #", count+1) count += 1 except TimeoutException: break time.sleep(20) driver.quit()
Let me know if you have any question. Thanks, have a good day!