Web Scraper for E-commerce Product Information
This challenge involves creating a Python web scraper to extract valuable product information from an e-commerce website. You'll practice using popular libraries like requests and BeautifulSoup to navigate HTML and parse data, a fundamental skill for data analysis and automation in many fields.
Problem Description
Your task is to build a Python script that can scrape product details from a given e-commerce product page URL. The script should extract the product's title, current price, and availability status. You need to handle potential variations in website structure and ensure your scraper is robust.
Key Requirements:
- Fetch HTML Content: Use the
requestslibrary to download the HTML content of a provided URL. - Parse HTML: Utilize
BeautifulSoupto parse the downloaded HTML, making it easy to search and extract data. - Extract Product Title: Locate and extract the main title of the product.
- Extract Product Price: Find and extract the current selling price of the product.
- Extract Availability Status: Determine and extract whether the product is "In Stock," "Out of Stock," or if the status is not explicitly mentioned.
- Handle Data Types: Ensure extracted price is converted to a numerical format (e.g., float) and availability is a clear string.
- Return Structured Data: The scraped information should be returned as a dictionary.
Expected Behavior:
Given a valid e-commerce product page URL, the script should return a dictionary containing the extracted title, price, and availability.
Edge Cases to Consider:
- Products with prices displayed in different formats (e.g., with or without currency symbols, using commas as thousands separators).
- Variations in HTML tags and class names used for product information across different product pages on the same or different e-commerce sites.
- Pages where availability information is not clearly stated or is displayed in a non-standard way.
- URLs that might lead to non-product pages or error pages.
- Missing elements on the page.
Examples
Example 1:
Input URL: A hypothetical URL for a specific product on a well-structured e-commerce site.
# Assuming a mock HTML response for a product page:
# <div class="product-details">
# <h1 class="product-title">Awesome Gadget Pro</h1>
# <span class="price">$199.99</span>
# <div class="stock-status available">In Stock</div>
# </div>
Input URL: "http://example.com/products/awesome-gadget-pro"
Output:
{
"title": "Awesome Gadget Pro",
"price": 199.99,
"availability": "In Stock"
}
Explanation: The scraper successfully identifies the title in the h1 tag with class product-title, the price in the span tag with class price, and the availability from the div with class stock-status. The price is converted to a float.
Example 2:
Input URL: A hypothetical URL for a product that is out of stock.
# Assuming a mock HTML response for an out-of-stock product:
# <div class="product-info">
# <h2 class="name">Super Widget Lite</h2>
# <div class="current-price">
# <span class="original-price">50.00</span>
# <span class="discount">(-10%)</span>
# </div>
# <p class="availability-message">Sorry, this item is currently out of stock.</p>
# </div>
Input URL: "http://example.com/products/super-widget-lite"
Output:
{
"title": "Super Widget Lite",
"price": 50.00,
"availability": "Out of Stock"
}
Explanation: The scraper finds the title in an h2 tag, the price in a span tag within current-price (ignoring discount), and correctly identifies "Out of Stock" from the availability message.
Example 3:
Input URL: A hypothetical URL for a product where availability is not explicitly stated.
# Assuming a mock HTML response for a product without clear availability:
# <div class="item-card">
# <h3 class="item-name">Basic Connector</h3>
# <span class="product-price">£25.50</span>
# <!-- No availability information -->
# </div>
Input URL: "http://example.com/products/basic-connector"
Output:
{
"title": "Basic Connector",
"price": 25.50,
"availability": "Not Specified"
}
Explanation: The scraper extracts the title and price, but since no clear availability indicator is found, it defaults to "Not Specified".
Constraints
- You will be provided with valid URLs that point to actual product pages. However, the exact HTML structure can vary.
- The script should be written to handle common HTML structures. You are not expected to build a scraper that works for every possible e-commerce site without modification. Focus on a representative set of common patterns.
- The script should be reasonably efficient and not overload the target server with requests. For this challenge, assume only one URL will be processed at a time.
- The output dictionary keys must be
title,price, andavailability.
Notes
- You will need to install the
requestsandbeautifulsoup4libraries:pip install requests beautifulsoup4 - Consider using CSS selectors or element tag names to locate the desired information.
- Regular expressions might be useful for cleaning up price strings.
- Think about how to handle cases where a specific element (e.g., price) might be missing from the HTML. Your script should not crash.
- For this challenge, you can assume you are scraping a single, static HTML page. Dynamic content loaded via JavaScript is out of scope.
- To test your code, you might need to create mock HTML files or use a web scraping sandbox/testing website. You will NOT be submitting code to a live website during evaluation.