Hone logo
Hone
Problems

Finding Repeated DNA Sequences

The problem asks us to identify all 10-letter-long sequences that appear more than once in a given DNA sequence. DNA sequences are represented as strings composed of the characters 'A', 'C', 'G', and 'T'. This is a common bioinformatics task, useful for tasks like gene annotation and identifying repetitive elements within genomes.

Problem Description

Given a string s representing a DNA sequence, find all 10-letter-long substrings that occur more than once in s.

Key Requirements:

  • A substring is a contiguous sequence of characters within a string.
  • The length of the substrings we are looking for is exactly 10.
  • We need to return a list of all such 10-letter substrings that appear at least twice.
  • The order of the repeated sequences in the output does not matter.
  • Each repeated sequence should appear only once in the output list, even if it appears more than twice in the input string.

Expected Behavior:

The function should iterate through the input DNA string, consider all possible 10-letter substrings, and keep track of their occurrences. If a substring is encountered for the second time (or more), it should be added to the result list.

Edge Cases to Consider:

  • Short Input String: If the input string s has a length less than 10, there can be no 10-letter substrings, so the output should be an empty list.
  • No Repeated Sequences: If all 10-letter substrings are unique, the output should be an empty list.
  • All Characters Identical: A string like "AAAAAAAAAAAA..." might have many overlapping repeated sequences.

Examples

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Explanation: The 10-letter substrings "AAAAACCCCC" and "CCCCCAAAAA" both appear twice.

Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Explanation: The 10-letter substring "AAAAAAAAAA" appears 3 times.

Example 3:

Input: s = "AGCTAGCTAGCT"
Output: []
Explanation: All 10-letter substrings are unique. The only 10-letter substrings are "AGCTAGCTAG" and "GCTAGCTAGC".

Constraints

  • 1 <= s.length <= 10^5
  • s consists only of the characters 'A', 'C', 'G', and 'T'.
  • The solution should aim for an efficient time complexity.

Notes

  • Consider how you can efficiently keep track of the substrings you've already seen.
  • Think about data structures that can help with frequency counting or set membership checks.
  • The problem involves overlapping substrings, so simply splitting the string into non-overlapping chunks won't work. You'll need to consider sliding windows or similar approaches.
Loading editor...
plaintext