Building a Simple Data Pipeline for Log Processing
This challenge asks you to implement a basic data pipeline in Python to process log data. Log data is a common source of information in many applications, and pipelines are essential for cleaning, transforming, and loading this data for analysis or storage. This exercise will help you understand the fundamental concepts of data pipelines and how to structure them in Python.
Problem Description
You are tasked with building a data pipeline that reads log entries from a text file, filters them based on a severity level, transforms the entries to include a timestamp, and then writes the processed entries to a new file. The pipeline should consist of three stages: Reading, Processing, and Writing.
What needs to be achieved:
- Reading: Read log entries from an input file. Each line in the file represents a single log entry.
- Processing:
- Filter log entries based on a specified severity level (e.g., "ERROR", "WARNING"). Only entries with this severity level should be processed further.
- Add a timestamp to each filtered log entry. The timestamp should be in the format "YYYY-MM-DD HH:MM:SS".
- Writing: Write the processed log entries to an output file.
Key Requirements:
- The pipeline should be modular, with separate functions for reading, processing, and writing.
- The severity level to filter on should be configurable.
- Error handling should be included to gracefully handle file not found errors and invalid log entry formats.
- The pipeline should be efficient enough to handle reasonably sized log files (up to 10MB).
Expected Behavior:
The pipeline should read the input file, filter the log entries based on the specified severity level, add a timestamp to each filtered entry, and write the processed entries to the output file. If the input file is not found, an appropriate error message should be printed. If a log entry is malformed (doesn't contain a severity level), it should be skipped.
Edge Cases to Consider:
- Input file does not exist.
- Input file is empty.
- Log entries are malformed (e.g., missing severity level).
- Severity level specified is not found in any log entries.
- Large input file (performance considerations).
Examples
Example 1:
Input (input.log):
2023-10-26 10:00:00 INFO: Application started
2023-10-26 10:01:00 WARNING: Disk space low
2023-10-26 10:02:00 ERROR: Database connection failed
2023-10-26 10:03:00 INFO: User logged in
2023-10-26 10:04:00 ERROR: Network timeout
Output (output.log, severity="ERROR"):
2023-10-26 10:02:00 ERROR: Database connection failed
2023-10-26 10:04:00 ERROR: Network timeout
Explanation: The pipeline reads the input file, filters for entries with severity "ERROR", adds the current timestamp to each filtered entry, and writes the results to the output file.
Example 2:
Input (input.log):
2023-10-26 10:00:00 INFO: Application started
2023-10-26 10:01:00 WARNING: Disk space low
2023-10-26 10:02:00 ERROR: Database connection failed
2023-10-26 10:03:00 INFO: User logged in
Output (output.log, severity="WARNING"):
2023-10-26 10:01:00 WARNING: Disk space low
Explanation: The pipeline filters for entries with severity "WARNING" and adds a timestamp.
Example 3: (Edge Case - File Not Found)
Input: (input.log does not exist)
Severity: "ERROR"
Output: (Prints "Error: Input file not found.")
Explanation: The pipeline handles the case where the input file does not exist.
Constraints
- File Size: The input file should be no larger than 10MB.
- Input Format: Each line in the input file represents a log entry in the format "YYYY-MM-DD HH:MM:SS SEVERITY: Message".
- Timestamp Format: The timestamp added to the processed log entries should be in the format "YYYY-MM-DD HH:MM:SS".
- Severity Level: The severity level to filter on should be a string (e.g., "ERROR", "WARNING", "INFO").
- Performance: The pipeline should complete within a reasonable time (e.g., less than 5 seconds) for a 10MB input file.
Notes
- Consider using Python's built-in file handling capabilities.
- The
datetimemodule can be helpful for generating timestamps. - Think about how to handle potential errors gracefully.
- Modularity is key – break down the problem into smaller, manageable functions.
- You don't need to implement complex error reporting; simple error messages are sufficient.
- The current timestamp is sufficient for the timestamp addition. You don't need to parse the original timestamp from the log entry.