Build an Aho-Corasick Automaton in JavaScript
The Aho-Corasick algorithm is an efficient string-searching algorithm that allows you to find all occurrences of a finite set of keywords within a text in a single pass. This is incredibly useful for tasks like plagiarism detection, spell checking, and pattern matching in large datasets. Your challenge is to implement this algorithm in JavaScript.
Problem Description
You need to create a JavaScript class that represents an Aho-Corasick automaton. This automaton will be built from a given set of keywords. Once built, it should be able to efficiently search a given text and report all occurrences of any of the keywords.
Key Requirements:
AhoCorasickClass: Create a class namedAhoCorasick.- Constructor (
constructor(keywords)):- Accepts an array of strings (
keywords) representing the patterns to search for. - Builds the Aho-Corasick automaton internally based on these keywords.
- Accepts an array of strings (
- Search Method (
search(text)):- Accepts a string (
text) to search within. - Returns an array of objects. Each object should represent an occurrence of a keyword and contain:
keyword: The keyword that was found.index: The starting index in thetextwhere thekeywordwas found.
- Accepts a string (
- Trie Construction: The core of the Aho-Corasick algorithm is a trie (prefix tree) built from the keywords. Each node in the trie should represent a character.
- Failure Links: For each node in the trie, you need to compute its "failure link". The failure link of a node points to the longest proper suffix of the string represented by that node, which is also a prefix of some keyword in the set.
- Output Links (Optional but Recommended for Efficiency): For each node, you should also compute an "output link" (or "dictionary suffix link"). This link points to the nearest ancestor node that corresponds to a complete keyword. This helps in reporting all matches efficiently.
- Efficient Searching: The
searchmethod should traverse thetextand the automaton simultaneously, utilizing the trie structure and failure links to avoid redundant comparisons.
Expected Behavior:
- The automaton should correctly identify all occurrences of all keywords.
- If a keyword appears multiple times, each occurrence should be reported.
- If a keyword is a substring of another keyword, both should be reported if they match. For example, if keywords are "he" and "she", searching "she" should report both "he" (at index 1) and "she" (at index 0).
- The algorithm should be case-sensitive.
Edge Cases to Consider:
- Empty
keywordsarray. - Empty
textstring. - Keywords with shared prefixes.
- Keywords that are substrings of other keywords.
- Keywords that are identical.
- Keywords containing special characters (though the problem assumes standard string characters).
Examples
Example 1:
const keywords = ["he", "she", "his", "hers"];
const text = "ahishers";
Output: [
{ keyword: "his", index: 1 },
{ keyword: "hers", index: 3 },
{ keyword: "she", index: 2 }
]
Explanation:
The text "ahishers" contains:
- "his" starting at index 1.
- "hers" starting at index 3.
- "she" starting at index 2. Note that "she" is found even though it overlaps with "hers". The order of output might vary depending on implementation, but all correct matches should be present.
Example 2:
const keywords = ["a", "ab", "abc"];
const text = "abcabc";
Output: [
{ keyword: "a", index: 0 },
{ keyword: "ab", index: 0 },
{ keyword: "abc", index: 0 },
{ keyword: "a", index: 3 },
{ keyword: "ab", index: 3 },
{ keyword: "abc", index: 3 }
]
Explanation: In "abcabc", each keyword ("a", "ab", "abc") appears twice, starting at index 0 and index 3.
Example 3:
const keywords = ["apple", "app"];
const text = "applepie";
Output: [
{ keyword: "app", index: 0 },
{ keyword: "apple", index: 0 }
]
Explanation: "applepie" contains "app" starting at index 0 and "apple" starting at index 0.
Constraints
- The number of keywords will be between 0 and 1000.
- The length of each keyword will be between 1 and 50 characters.
- The length of the text to search will be between 0 and 100,000 characters.
- Keywords and text will only contain lowercase English letters (a-z).
- The solution should aim for a time complexity of O(N + M + K) where N is the length of the text, M is the total length of all keywords, and K is the number of matches found.
Notes
- This is a challenging algorithm to implement correctly. Focus on understanding the trie construction and the failure link computation.
- Breadth-First Search (BFS) is a common approach to compute failure links after the initial trie is built.
- When a match is found, you might need to follow output links to find shorter keywords that are suffixes of the current match.
- Consider using a sentinel character or a special node to represent the root of your trie.