SQL Full-Text Search Implementation
You are tasked with building a simplified full-text search capability for a document management system. This feature will allow users to quickly find documents containing specific keywords or phrases within their content. This is a fundamental feature for many applications, from e-commerce product searches to knowledge base systems.
Problem Description
Implement a SQL query that can efficiently search through a collection of documents to find those that contain a given search query. The search should be case-insensitive and should match documents where all words in the search query appear. The order of the words in the search query does not matter.
Key Requirements:
- Case-Insensitive Search: The search should not distinguish between uppercase and lowercase letters.
- All Word Matching: A document should only be returned if it contains all the individual words present in the search query.
- Order Agnostic: The order of keywords in the search query should not affect the search results.
- Efficiency: The query should be reasonably efficient, especially for a large number of documents.
Expected Behavior:
Given a table named documents with columns id (integer, primary key) and content (text), and a search query string, the query should return the ids of all documents whose content contains all words from the search query, ignoring case.
Edge Cases to Consider:
- Empty Search Query: If the search query is empty or contains only whitespace, what should happen? (Assume no documents are returned).
- Single Word Search Query: The query should correctly handle searching for a single word.
- Punctuation: How should punctuation attached to words be handled? (For this challenge, assume punctuation attached to words should be treated as part of the word, or ideally ignored if it's a common separator). For simplicity in this pseudocode challenge, we will assume words are separated by spaces, and we'll normalize by converting to lowercase.
- Multiple Spaces: Multiple spaces between words in the search query should be treated as a single separator.
Examples
Example 1:
-
Input:
documentstable:id content 101 "The quick brown fox jumps over the lazy dog." 102 "A lazy fox is not a quick fox." 103 "The dog is brown and furry." 104 "Quick, quick, quick!" search_query: "quick brown fox"
-
Output:
id 101 -
Explanation: Document 101 contains "quick", "brown", and "fox". Document 102 contains "fox" and "quick" but not "brown". Document 103 contains "brown" but not "quick" or "fox". Document 104 contains "quick" but not "brown" or "fox".
Example 2:
-
Input:
documentstable: (Same as Example 1)search_query: "Lazy Dog"
-
Output:
id 101 103 -
Explanation: Document 101 contains "Lazy" and "Dog". Document 103 contains "dog" (case-insensitive match) but not "Lazy". This example highlights a misunderstanding of the requirement. Correct explanation: Document 101 contains "lazy" and "dog". Document 103 contains "dog" but not "lazy". So only document 101 should be returned.
Correction to example logic:
- Output:
id 101 - Explanation: Document 101 contains "lazy" and "dog". Document 103 contains "dog" but not "lazy". Document 102 contains "lazy" but not "dog". Document 104 contains neither.
- Output:
Example 3:
-
Input:
documentstable: (Same as Example 1)search_query: " QUICK "
-
Output:
id 101 102 104 -
Explanation: The search query is normalized to "quick". Document 101, 102, and 104 all contain the word "quick" (case-insensitive).
Constraints
- The
documentstable can contain up to 1,000,000 rows. - The
contentcolumn can store up to 10,000 characters per document. - The
search_querystring can be up to 500 characters long. - The query should aim to execute within 5 seconds for typical search queries on the given constraints.
Notes
- You will need to split the
search_queryinto individual words. - You'll need a way to check for the presence of each individual word within the document content, ensuring case-insensitivity.
- Consider how to handle the "all words must match" requirement. This often involves checking for the existence of each word and then ensuring that all such checks are true for a given document.
- While true full-text search engines have sophisticated indexing and stemming capabilities, for this challenge, a simpler approach focusing on exact word matching (after normalization) is sufficient.
- Pseudocode for splitting strings and checking substring presence will be assumed to be available. For example,
SPLIT_STRING(text, delimiter)andCONTAINS(text, substring).