Hone logo
Hone
Problems

Implementing a Basic Full-Text Search Engine in Python

Full-text search is a crucial feature in many applications, allowing users to quickly find relevant information within a large body of text. This challenge asks you to implement a simplified full-text search engine in Python, focusing on indexing and searching a collection of documents. The goal is to build a system that can efficiently locate documents containing specific search terms.

Problem Description

You are tasked with creating a basic full-text search engine. The engine should consist of two primary components: an indexer and a searcher.

Indexer:

  • Takes a list of documents (strings) as input.
  • Creates an inverted index. An inverted index maps each unique word in the documents to a list of document IDs (indices in the original list) where that word appears.
  • The index should be stored in a suitable data structure (e.g., a dictionary).
  • Words should be converted to lowercase before indexing.
  • Punctuation should be removed from words before indexing.

Searcher:

  • Takes the inverted index and a search query (string) as input.
  • Converts the search query to lowercase and removes punctuation.
  • Splits the query into individual words.
  • Looks up each word in the inverted index.
  • Returns a set of document IDs that contain all the search query words. This is an AND search.
  • If a word in the query is not found in the index, the search should return an empty set.

Expected Behavior:

The search engine should accurately identify documents containing all the specified search terms. The results should be returned as a set of document IDs, ensuring uniqueness and order independence.

Edge Cases to Consider:

  • Empty documents: Handle cases where a document is an empty string.
  • Empty query: Handle cases where the search query is an empty string.
  • Documents with only punctuation: Handle documents that contain only punctuation.
  • Case sensitivity: The search should be case-insensitive.
  • Punctuation: Punctuation should be removed before indexing and searching.
  • Words not in the index: Handle search queries containing words not present in any of the documents.

Examples

Example 1:

Input: documents = ["The quick brown fox.", "The slow blue fox.", "The quick red rabbit."]
Input: query = "quick fox"
Output: {0, 1}
Explanation: Document 0 and Document 1 both contain the words "quick" and "fox". Document 2 contains "quick" but not "fox".

Example 2:

Input: documents = ["This is a test.", "This is another test.", "And yet another."]
Input: query = "is test"
Output: {0, 1}
Explanation: Documents 0 and 1 both contain "is" and "test".

Example 3: (Edge Case)

Input: documents = ["The quick brown fox.", "The slow blue fox.", "The quick red rabbit."]
Input: query = "purple elephant"
Output: set()
Explanation: Neither "purple" nor "elephant" appear in any of the documents.

Example 4: (Edge Case - Empty Document)

Input: documents = ["The quick brown fox.", "", "The quick red rabbit."]
Input: query = "quick"
Output: {0, 2}
Explanation: The empty string document is ignored.

Constraints

  • The number of documents will be between 1 and 1000.
  • Each document will be a string with a maximum length of 1000 characters.
  • The search query will be a string with a maximum length of 100 characters.
  • The words in the documents and the query will consist of alphanumeric characters and spaces.
  • Performance: The indexing process should complete within 1 second for the given constraints. The search process should complete within 0.5 seconds.

Notes

  • Consider using Python's built-in string methods for case conversion and punctuation removal.
  • The inverted index is the core data structure for efficient searching. Choose a data structure that allows for fast lookups.
  • Focus on clarity and readability in your code.
  • This is a simplified full-text search engine. More advanced features like stemming, stop word removal, and ranking are not required for this challenge.
  • The order of the document IDs in the output set is not important.
Loading editor...
python