Hone logo
Hone
Problems

Implement Log-Structured Merge-Tree (LSM Tree) Compaction in Go

In modern databases and storage systems, particularly those dealing with high write throughput, Log-Structured Merge-Trees (LSM Trees) are a common data structure. A key operation in LSM Trees is compaction, which involves merging sorted data files (SSTables) to maintain performance and reduce storage overhead. This challenge will focus on implementing a simplified version of the compaction process.

Problem Description

Your task is to implement a compaction function in Go that merges multiple sorted input data files into a single sorted output data file. Each input file represents a segment of an LSM Tree, containing key-value pairs sorted by key. The compaction process should combine these segments, eliminate duplicate keys (keeping the latest value for a given key), and produce a new, larger sorted segment.

Key Requirements:

  1. Merge Sorted Inputs: Read data from multiple input files, each containing key-value pairs sorted by key.
  2. Handle Duplicates: If a key appears in multiple input files, the value from the file that was "written" most recently should be preserved. For simplicity, assume files are ordered chronologically by their index in the input slice (i.e., inputs[0] is older than inputs[1], and so on).
  3. Produce Sorted Output: The output file must contain all unique keys from the input files, sorted by key.
  4. Key-Value Format: Assume each line in the input and output files represents a key-value pair, separated by a colon (:). For example: key:value.
  5. File Handling: The function should accept a slice of file paths for the input files and a single file path for the output file.

Expected Behavior:

The function will read all key-value pairs from the provided input files, merge them while respecting the recency of values for duplicate keys, and write the consolidated, sorted data to the specified output file.

Edge Cases:

  • Empty input files.
  • Input files with no data.
  • A single input file.
  • No input files provided.
  • Duplicate keys across multiple input files.

Examples

Example 1:

Input Files: input1.txt:

a:1
b:2
d:4

input2.txt:

b:20
c:3
e:5

Output File (output.txt):

a:1
b:20
c:3
d:4
e:5

Explanation: a:1 is unique. b appears in both; input2.txt is newer, so b:20 is kept. c:3 and e:5 are unique. d:4 is unique.

Example 2:

Input Files: inputA.txt:

apple:red
banana:yellow

inputB.txt:

apple:green
orange:orange

inputC.txt:

banana:green
grape:purple

Output File (output_final.txt):

apple:green
banana:green
grape:purple
orange:orange

Explanation: apple from inputB.txt overwrites apple from inputA.txt. banana from inputC.txt overwrites banana from inputA.txt. orange and grape are unique.

Example 3: (Edge Case - Empty File)

Input Files: empty.txt: (empty file) data.txt:

x:10
y:20

Output File (output_edge.txt):

x:10
y:20

Explanation: The empty file has no effect on the final output.

Constraints

  • The number of input files will be between 0 and 100.
  • Each input file will contain between 0 and 1000 lines.
  • Keys and values will be strings consisting of alphanumeric characters and hyphens.
  • Keys will not be empty. Values can be empty.
  • The total number of key-value pairs across all input files will not exceed 100,000.
  • The function should complete within a reasonable time, assuming typical file I/O speeds. (No strict time limit is imposed, but efficiency is encouraged).

Notes

  • You will need to implement the logic for reading from files, parsing key-value pairs, merging them, and writing to the output file.
  • Consider how to efficiently store and manage the key-value pairs as you process them. A map is a good starting point, but think about how to handle the sorted output requirement.
  • The "recency" of a value is determined by the order of the input files in the provided slice. inputs[0] is older than inputs[1], inputs[1] is older than inputs[2], and so on. The last file in the slice is the most recent.
  • You will need to create temporary files for testing purposes. The os and io/ioutil (or os in newer Go versions) packages will be helpful.
Loading editor...
go