The following is a simple example showing how to scrape a web page in Python using the BeautifulSoup library.
The example web page we use is Hacker News:
The goal of the example is to print just the headlines from the page.
Scraping and parsing a web page requires first understanding the HTML document structure of the target page.
In this case, the relevant portion of the HTML structure is roughly as follows:
<table> <tbody> <tr> ... <td> <span class="titleline"> <a href="https://...">Headline</a> </span> </td> ...
Crucially, BeautifulSoup can find elements by many different criteria, regardless of depth of nesting.
One of the most useful techniques is searching by CSS class. In this case, we need to find anchor elements inside spans with the class “titleline”, inside table cells.
First, we create a BeautifulSoup object and use the find methods to drill down into the HTML elements in the DOM.
Note that findAll() finds a list of elements, while find() gets a single element.
The entire script is below.
scrape-example.py
from bs4 import BeautifulSoup from urllib import request url = "https://news.ycombinator.com/" htmlPage = request.urlopen(url) soup = BeautifulSoup(htmlPage, "html.parser") # Find list of elements by CSS class. spanList = soup.findAll( "span", attrs={"class": "titleline"} ) for span in spanList: anchor = span.find("a") headline = anchor.text print(headline)
Running the script:
$ python scrape-example.py
Example output of headlines:
The subtle art of designing physical controls for cars WASM will replace containers Backblaze Drive Stats for 2024 ...