Rustic File Operations: Word Counter
This challenge will guide you through implementing fundamental file input and output operations in Rust. You'll build a program that reads the content of a text file, counts the occurrences of each word, and then writes the results to another file. This is a common task in data processing and text analysis.
Problem Description
Your task is to create a Rust program that performs the following actions:
- Read from a specified input file: The program should accept a file path as input and read its entire content into memory.
- Count word occurrences: After reading the file, you need to process the text to count how many times each unique word appears.
- Words should be case-insensitive. For example, "The" and "the" should be treated as the same word.
- Punctuation attached to words (e.g., "word.", "word,", "word!") should be removed so that only the alphanumeric characters of the word are considered.
- Write to a specified output file: The program should then write the word counts to a new file. The output file should list each unique word along with its count, one word-count pair per line, separated by a colon and a space. The output should be sorted alphabetically by word.
Key Requirements:
- Use Rust's standard library for file operations (
std::fsandstd::io). - Handle potential errors during file reading and writing gracefully.
- Implement case-insensitive word comparison.
- Strip common punctuation from the ends of words.
- Sort the output alphabetically by word.
Expected Behavior:
Given an input file input.txt with the following content:
This is a sample text.
This text is for testing.
Sample text, sample file.
The program should produce an output file output.txt with the following content:
a: 1
file: 1
for: 1
is: 2
sample: 3
testing: 1
text: 3
this: 2
Edge Cases to Consider:
- Empty input file: If the input file is empty, the output file should also be empty.
- File not found: The program should handle cases where the input file does not exist.
- Special characters/multiple spaces: Consider how your word splitting and cleaning logic handles multiple spaces between words or other non-alphanumeric characters within the text.
Examples
Example 1:
Input File (input.txt):
Rust is fun. Fun with Rust!
Output File (output.txt):
fun: 2
is: 1
rust: 2
with: 1
Explanation: The program reads the text, splits it into words, converts them to lowercase, removes punctuation ("."), counts occurrences, and writes the sorted results.
Example 2:
Input File (input.txt):
One, two, three.
One! two?
Output File (output.txt):
one: 2
three: 1
two: 2
Explanation: Punctuation like commas, periods, exclamation marks, and question marks are removed before counting.
Example 3:
Input File (input.txt):
Output File (output.txt):
Explanation: An empty input file results in an empty output file.
Constraints
- The input text file will be a standard UTF-8 encoded text file.
- Words are considered sequences of alphanumeric characters.
- The maximum file size for testing will not exceed 10MB.
- The program should complete within a reasonable time for files of this size.
Notes
- You'll likely want to use a
HashMapfromstd::collectionsto store and update word counts efficiently. - Consider using iterators and methods like
split_whitespace,to_lowercase, andfilterto process the text. - For removing punctuation, you can iterate through characters of a word and keep only alphanumeric ones.
- Remember to handle
Resulttypes returned by I/O operations.