Implementing a Basic Full-Text Search Engine in Python

Full-text search is a crucial feature in many applications, allowing users to quickly find relevant information within a large body of text. This challenge asks you to implement a simplified full-text search engine in Python, focusing on indexing and searching a collection of documents. The goal is to build a system that can efficiently locate documents containing specific search terms.

Problem Description

You are tasked with creating a basic full-text search engine. The engine should consist of two primary components: an indexer and a searcher.

Indexer:

Takes a list of documents (strings) as input.
Creates an inverted index. An inverted index maps each unique word in the documents to a list of document IDs (indices in the original list) where that word appears.
The index should be stored in a suitable data structure (e.g., a dictionary).
Words should be converted to lowercase before indexing.
Punctuation should be removed from words before indexing.

Searcher:

Takes the inverted index and a search query (string) as input.
Converts the search query to lowercase and removes punctuation.
Splits the query into individual words.
Looks up each word in the inverted index.
Returns a set of document IDs that contain all the search query words. This is an AND search.
If a word in the query is not found in the index, the search should return an empty set.

Expected Behavior:

The search engine should accurately identify documents containing all the specified search terms. The results should be returned as a set of document IDs, ensuring uniqueness and order independence.

Edge Cases to Consider:

Empty documents: Handle cases where a document is an empty string.
Empty query: Handle cases where the search query is an empty string.
Documents with only punctuation: Handle documents that contain only punctuation.
Case sensitivity: The search should be case-insensitive.
Punctuation: Punctuation should be removed before indexing and searching.
Words not in the index: Handle search queries containing words not present in any of the documents.

Examples

Example 1:

Input: documents = ["The quick brown fox.", "The slow blue fox.", "The quick red rabbit."]
Input: query = "quick fox"
Output: {0, 1}
Explanation: Document 0 and Document 1 both contain the words "quick" and "fox". Document 2 contains "quick" but not "fox".

Example 2:

Input: documents = ["This is a test.", "This is another test.", "And yet another."]
Input: query = "is test"
Output: {0, 1}
Explanation: Documents 0 and 1 both contain "is" and "test".

Example 3: (Edge Case)

Input: documents = ["The quick brown fox.", "The slow blue fox.", "The quick red rabbit."]
Input: query = "purple elephant"
Output: set()
Explanation: Neither "purple" nor "elephant" appear in any of the documents.

Example 4: (Edge Case - Empty Document)

Input: documents = ["The quick brown fox.", "", "The quick red rabbit."]
Input: query = "quick"
Output: {0, 2}
Explanation: The empty string document is ignored.

Constraints

The number of documents will be between 1 and 1000.
Each document will be a string with a maximum length of 1000 characters.
The search query will be a string with a maximum length of 100 characters.
The words in the documents and the query will consist of alphanumeric characters and spaces.
Performance: The indexing process should complete within 1 second for the given constraints. The search process should complete within 0.5 seconds.

Notes

Consider using Python's built-in string methods for case conversion and punctuation removal.
The inverted index is the core data structure for efficient searching. Choose a data structure that allows for fast lookups.
Focus on clarity and readability in your code.
This is a simplified full-text search engine. More advanced features like stemming, stop word removal, and ranking are not required for this challenge.
The order of the document IDs in the output set is not important.