Hone logo
Hone
Problems

Parallel Web Scraper with Crossbeam Channels

This challenge focuses on building a basic parallel web scraper using Rust's crossbeam library. You'll learn how to manage concurrent tasks, communicate data between threads, and collect results efficiently. This is a fundamental skill for many applications that require processing large amounts of data or performing I/O operations concurrently.

Problem Description

Your task is to implement a web scraper that can fetch content from multiple URLs in parallel. You should use crossbeam channels to send URLs to worker threads and to receive the fetched HTML content back from them.

Requirements:

  1. URL Distribution: Create a channel to send URLs to be scraped.
  2. Worker Threads: Spawn multiple worker threads. Each worker thread will:
    • Receive a URL from the input channel.
    • Make an HTTP GET request to fetch the content of the URL.
    • Send the fetched HTML content (as a String) back to the main thread via another channel.
  3. Result Collection: The main thread should receive the HTML content from all worker threads.
  4. Error Handling: Handle potential errors during HTTP requests (e.g., network issues, invalid URLs). If an error occurs, the worker should report an error message instead of the HTML content.
  5. Termination: Ensure all worker threads gracefully terminate after processing all URLs.

Expected Behavior:

The program should take a list of URLs as input. It should then spawn a specified number of worker threads to process these URLs concurrently. The main thread should collect the results (either the HTML content or an error message) for each URL.

Edge Cases to Consider:

  • Empty list of URLs.
  • URLs that return non-HTML content.
  • URLs that result in HTTP errors (e.g., 404 Not Found, server errors).
  • Network connectivity issues.

Examples

Example 1:

Input:
URLs: ["http://example.com", "https://www.rust-lang.org"]
Number of Workers: 2

Output:
(A HashMap or similar structure mapping URLs to their fetched content or error messages)
Example:
{
    "http://example.com": Ok("<!doctype html>..."),
    "https://www.rust-lang.org": Ok("<html><head>...</head><body>...")
}

Explanation: The program sends each URL to a worker. Worker 1 fetches "http://example.com" and sends its HTML back. Worker 2 fetches "https://www.rust-lang.org" and sends its HTML back. The main thread collects both results.

Example 2:

Input:
URLs: ["http://httpbin.org/status/404", "http://invalid-url-that-does-not-exist.com"]
Number of Workers: 1

Output:
{
    "http://httpbin.org/status/404": Err("HTTP Error: 404 Not Found"),
    "http://invalid-url-that-does-not-exist.com": Err("Network Error: ...")
}

Explanation: The first URL results in a 404 error, which is captured and reported. The second URL fails to resolve or connect, and a network error is reported.

Constraints

  • The number of worker threads will be between 1 and 10.
  • The list of URLs can contain up to 50 URLs.
  • Each URL will be a valid HTTP or HTTPS string.
  • The solution should be efficient, leveraging concurrency for performance.

Notes

  • You will need to add the crossbeam-channel and reqwest crates to your Cargo.toml.
  • reqwest is a popular HTTP client for Rust. You can use its get() method and text() method.
  • Consider using std::thread::spawn for creating threads.
  • The crossbeam-channel crate provides powerful and flexible channels for inter-thread communication.
  • Think about how to signal to the worker threads that there are no more URLs to process.
  • The main thread needs to wait for all worker threads to finish before exiting.
Loading editor...
rust