Hone logo
Hone
Problems

Web Scraping with BeautifulSoup: Extracting Book Titles

This challenge focuses on using the BeautifulSoup library in Python to parse HTML content and extract specific information. You'll be tasked with retrieving all book titles from a provided HTML snippet, a fundamental skill in web scraping for data collection.

Problem Description

Your goal is to write a Python function that takes an HTML string as input and returns a list of all book titles found within that HTML. The book titles are consistently enclosed within <h2> tags that have a specific class attribute.

What needs to be achieved:

  • Parse the provided HTML string.
  • Identify all <h2> tags that contain the class attribute book-title.
  • Extract the text content from each of these identified tags.
  • Return these extracted text contents as a Python list.

Key requirements:

  • Use the BeautifulSoup library for parsing HTML.
  • Handle cases where the HTML might be malformed (BeautifulSoup is generally robust).
  • The function should be named extract_book_titles.

Expected behavior:

  • The function should return an empty list if no book titles are found.
  • The order of titles in the returned list should correspond to their order in the HTML.

Edge cases to consider:

  • HTML with no <h2> tags.
  • HTML with <h2> tags but none having the book-title class.
  • HTML where <h2> tags might have multiple classes.

Examples

Example 1:

Input:
html_content = "<html><body><h1>My Library</h1><div class='book'><h2 class='book-title'>The Hitchhiker's Guide to the Galaxy</h2><p>A classic sci-fi comedy.</p></div><div class='book'><h2 class='book-title'>Pride and Prejudice</h2><p>A novel by Jane Austen.</p></div></body></html>"

Output:
['The Hitchhiker's Guide to the Galaxy', 'Pride and Prejudice']

Explanation:
The function correctly identifies the two 'h2' tags with the class 'book-title' and extracts their text content.

Example 2:

Input:
html_content = "<html><body><p>No books here.</p><div><h2>Regular Heading</h2></div></body></html>"

Output:
[]

Explanation:
There are no 'h2' tags with the class 'book-title' in this HTML, so an empty list is returned.

Example 3:

Input:
html_content = "<html><body><h2 class='book-title prominent'>Dune</h2><h2 class='another-class book-title'>Foundation</h2></body></html>"

Output:
['Dune', 'Foundation']

Explanation:
The function correctly identifies 'h2' tags that have 'book-title' as one of their classes, even if other classes are present.

Constraints

  • The input html_content will always be a string.
  • The html_content can range from empty to a reasonably complex HTML structure.
  • You are expected to install the beautifulsoup4 and lxml libraries (if not already installed) using pip: pip install beautifulsoup4 lxml.

Notes

  • Remember to import the BeautifulSoup class from the bs4 module.
  • You will likely need to initialize a BeautifulSoup object with the HTML content and a parser (e.g., 'lxml').
  • Explore the find_all() method of the BeautifulSoup object to efficiently locate the desired tags. You can pass tag names and attributes as arguments to this method.
  • Consider how to extract just the text content from the found tags.
Loading editor...
python