Python Full-Text Search Engine
You're tasked with building a simplified full-text search engine in Python. This engine will allow users to index a collection of documents and then query them for specific keywords. This is a fundamental component of many modern applications, from website search to document management systems.
Problem Description
Your goal is to implement a Python class, let's call it SimpleSearchEngine, that can perform the following operations:
-
Indexing Documents: The engine should be able to accept a collection of documents, where each document is represented as a string. The engine needs to process these documents to make them searchable. This involves tokenizing the text, cleaning it, and storing it in a way that facilitates efficient searching.
-
Searching for Keywords: Given a query string (a single word or multiple words), the engine should return a list of document IDs (indices) that contain the query terms. The search should be case-insensitive and should ideally handle basic stemming or ignore common "stop words" for better relevance.
Key Requirements:
index(document_id, text)method: To add a document to the search index.document_idwill be an integer identifier, andtextwill be the string content of the document.search(query)method: To search for documents containing thequerystring. It should return a list ofdocument_ids.- Case-Insensitivity: Searches should ignore capitalization.
- Basic Tokenization and Cleaning: You'll need to break down text into individual words (tokens) and remove punctuation.
- Document Identification: Each indexed document should have a unique identifier.
Expected Behavior:
- When a document is indexed, its content is processed and stored.
- When a query is made, the engine efficiently retrieves relevant document IDs.
- If no documents match the query, an empty list should be returned.
Edge Cases to Consider:
- Empty documents: What happens if an empty string is indexed?
- Empty queries: What should
search("")return? - Documents with only punctuation or whitespace.
- Queries with punctuation or extra whitespace.
Examples
Example 1:
engine = SimpleSearchEngine()
engine.index(1, "The quick brown fox jumps over the lazy dog.")
engine.index(2, "A brown dog chased a black cat.")
engine.index(3, "The lazy fox is sleeping.")
query = "fox"
output = engine.search(query)
# Expected Output: [1, 3]
Explanation: Document 1 contains "fox", and Document 3 contains "fox". Document 2 does not.
Example 2:
engine = SimpleSearchEngine()
engine.index(10, "Python is a powerful programming language.")
engine.index(20, "Java is also a popular programming language.")
engine.index(30, "JavaScript is used for web development.")
query = "programming language"
output = engine.search(query)
# Expected Output: [10, 20]
Explanation: Both Document 10 and Document 20 contain the terms "programming" and "language". Document 30 does not contain both.
Example 3: (Case-Insensitivity and Punctuation)
engine = SimpleSearchEngine()
engine.index(5, "Hello, World!")
engine.index(6, "hello world")
query = "WORLD"
output = engine.search(query)
# Expected Output: [5, 6]
Explanation: Both documents contain variations of "world" and are matched by the case-insensitive query.
Constraints
- The maximum number of documents to be indexed is 10,000.
- The maximum length of a document string is 500 characters.
- The query string will consist of alphanumeric characters and spaces, with a maximum length of 100 characters.
- The search operation should ideally be performed in O(N*M) time complexity or better, where N is the number of documents and M is the average number of unique terms per document, for a single-word query. For multi-word queries, performance will depend on how you handle intersections of word occurrences.
Notes
- You will need to decide on a data structure to store your indexed documents. A dictionary mapping terms to lists of document IDs is a common approach.
- Consider using Python's built-in string methods for cleaning and tokenization (e.g.,
.lower(),.split(),.translate()). - For a more robust solution, you might consider stemming (reducing words to their root form, e.g., "running" -> "run") or removing stop words (common words like "the", "a", "is" that don't add much meaning). For this challenge, basic cleaning and case-insensitivity are sufficient, but you can explore these advanced features if you wish.
- Your
searchmethod should handle queries with multiple words by finding documents that contain all the query words.