Hone logo
Hone
Problems

Python Full-Text Search Engine

You're tasked with building a simplified full-text search engine in Python. This engine will allow users to index a collection of documents and then query them for specific keywords. This is a fundamental component of many modern applications, from website search to document management systems.

Problem Description

Your goal is to implement a Python class, let's call it SimpleSearchEngine, that can perform the following operations:

  1. Indexing Documents: The engine should be able to accept a collection of documents, where each document is represented as a string. The engine needs to process these documents to make them searchable. This involves tokenizing the text, cleaning it, and storing it in a way that facilitates efficient searching.

  2. Searching for Keywords: Given a query string (a single word or multiple words), the engine should return a list of document IDs (indices) that contain the query terms. The search should be case-insensitive and should ideally handle basic stemming or ignore common "stop words" for better relevance.

Key Requirements:

  • index(document_id, text) method: To add a document to the search index. document_id will be an integer identifier, and text will be the string content of the document.
  • search(query) method: To search for documents containing the query string. It should return a list of document_ids.
  • Case-Insensitivity: Searches should ignore capitalization.
  • Basic Tokenization and Cleaning: You'll need to break down text into individual words (tokens) and remove punctuation.
  • Document Identification: Each indexed document should have a unique identifier.

Expected Behavior:

  • When a document is indexed, its content is processed and stored.
  • When a query is made, the engine efficiently retrieves relevant document IDs.
  • If no documents match the query, an empty list should be returned.

Edge Cases to Consider:

  • Empty documents: What happens if an empty string is indexed?
  • Empty queries: What should search("") return?
  • Documents with only punctuation or whitespace.
  • Queries with punctuation or extra whitespace.

Examples

Example 1:

engine = SimpleSearchEngine()
engine.index(1, "The quick brown fox jumps over the lazy dog.")
engine.index(2, "A brown dog chased a black cat.")
engine.index(3, "The lazy fox is sleeping.")

query = "fox"
output = engine.search(query)
# Expected Output: [1, 3]

Explanation: Document 1 contains "fox", and Document 3 contains "fox". Document 2 does not.

Example 2:

engine = SimpleSearchEngine()
engine.index(10, "Python is a powerful programming language.")
engine.index(20, "Java is also a popular programming language.")
engine.index(30, "JavaScript is used for web development.")

query = "programming language"
output = engine.search(query)
# Expected Output: [10, 20]

Explanation: Both Document 10 and Document 20 contain the terms "programming" and "language". Document 30 does not contain both.

Example 3: (Case-Insensitivity and Punctuation)

engine = SimpleSearchEngine()
engine.index(5, "Hello, World!")
engine.index(6, "hello world")

query = "WORLD"
output = engine.search(query)
# Expected Output: [5, 6]

Explanation: Both documents contain variations of "world" and are matched by the case-insensitive query.

Constraints

  • The maximum number of documents to be indexed is 10,000.
  • The maximum length of a document string is 500 characters.
  • The query string will consist of alphanumeric characters and spaces, with a maximum length of 100 characters.
  • The search operation should ideally be performed in O(N*M) time complexity or better, where N is the number of documents and M is the average number of unique terms per document, for a single-word query. For multi-word queries, performance will depend on how you handle intersections of word occurrences.

Notes

  • You will need to decide on a data structure to store your indexed documents. A dictionary mapping terms to lists of document IDs is a common approach.
  • Consider using Python's built-in string methods for cleaning and tokenization (e.g., .lower(), .split(), .translate()).
  • For a more robust solution, you might consider stemming (reducing words to their root form, e.g., "running" -> "run") or removing stop words (common words like "the", "a", "is" that don't add much meaning). For this challenge, basic cleaning and case-insensitivity are sufficient, but you can explore these advanced features if you wish.
  • Your search method should handle queries with multiple words by finding documents that contain all the query words.
Loading editor...
python