Implementing a Simplified Raft Consensus Algorithm in Go
Raft is a consensus algorithm used to achieve fault tolerance and data consistency in distributed systems. This challenge asks you to implement a simplified version of the Raft protocol in Go, focusing on the core concepts of leader election, log replication, and state machine consistency. Successfully completing this challenge demonstrates a strong understanding of distributed systems principles.
Problem Description
You are tasked with implementing a basic Raft node in Go. The node should handle leader election, log replication, and state machine updates. The implementation should support a single, shared state machine (represented as a simple integer counter). The goal is to create a system where, after a leader is elected, all nodes agree on the value of the counter.
Key Requirements:
- Node States: Implement the three Raft states:
Leader,Follower, andCandidate. - Leader Election: Implement a basic leader election mechanism using randomized timers and voting. Nodes should transition to the
Candidatestate when they don't receive heartbeats from the current leader within a timeout period. Candidates should increment their term number, request votes from other nodes, and become the leader if they receive a majority of votes. - Log Replication: The leader should be able to append new entries to its log and replicate them to followers. Replication should be done via
AppendEntriesRPCs. - State Machine Consistency: The leader should apply committed log entries to its state machine (the integer counter). Followers should apply committed entries received from the leader to their own state machines.
- Heartbeats: The leader should periodically send heartbeats (empty
AppendEntriesRPCs) to followers to maintain its leadership. - Term Management: Nodes should maintain a
currentTermand increment it during leader elections.AppendEntriesRPCs should include the leader'scurrentTerm, and followers should reject RPCs with an older term. - Vote Management: Nodes should maintain a
votedForfield to track which candidate they have voted for in the current term.
Expected Behavior:
- Initially, all nodes start as followers.
- If a node doesn't receive heartbeats from the leader within a timeout, it becomes a candidate.
- Candidates request votes from other nodes.
- The candidate with a majority of votes becomes the leader.
- The leader replicates log entries to followers.
- Followers apply committed log entries to their state machine.
- If the leader fails, a new election is triggered.
- All nodes should eventually converge on the same value for the state machine (the integer counter).
Edge Cases to Consider:
- Network Partitions: Consider how the system behaves if nodes become isolated from each other.
- Split Brain: How to prevent multiple leaders from being elected in a network partition. (This simplified version doesn't require a full split-brain resolution, but awareness is important).
- Duplicate Log Entries: The leader should handle duplicate log entries gracefully.
- Out-of-Order Log Entries: The leader should reject
AppendEntriesRPCs with out-of-order log entries. - Election Timeout Variations: Randomized election timeouts are crucial for preventing split-brain scenarios.
Examples
Example 1:
Input: 3 nodes (N1, N2, N3) initialized as followers. N1 becomes the leader. N1 appends the value 10 to its log.
Output: All nodes' state machines (integer counters) should eventually be 10.
Explanation: N1 replicates the log entry to N2 and N3. Upon successful replication, N1 commits the entry and applies it to its state machine. N2 and N3 then apply the committed entry to their state machines.
Example 2:
Input: 3 nodes (N1, N2, N3) initialized as followers. N1 is the leader. N1 fails. After a timeout, N2 and N3 become candidates. N3 wins the election. N3 appends the value 20 to its log.
Output: All nodes' state machines should eventually be 20.
Explanation: N2 and N3 request votes. N3 receives a majority and becomes the leader. N3 replicates the log entry to N2. Both N3 and N2 apply the committed entry to their state machines.
Example 3:
Input: 5 nodes (N1-N5) initialized as followers. N1 is the leader. N2 experiences a network partition and is disconnected for a period longer than the election timeout. N2 rejoins the network after N4 becomes the leader.
Output: N2 should eventually catch up and have its state machine equal to N4's state machine.
Explanation: N2, after the timeout, becomes a candidate. It loses the election to N4. When N2 rejoins, it receives `AppendEntries` RPCs from N4 and applies the committed entries to its state machine, bringing it into sync.
Constraints
- Number of Nodes: The system should support 3-5 nodes.
- State Machine: The state machine is a simple integer counter.
- RPCs: Implement basic
RequestVoteandAppendEntriesRPCs. You don't need to implement a full RPC framework; simple function calls simulating RPCs are acceptable. - Election Timeout: The election timeout should be randomized within a range of, for example, 150-300 milliseconds.
- Log Size: The log size should be limited to a maximum of 10 entries.
- Performance: While not a primary focus, the implementation should be reasonably efficient. Avoid unnecessary overhead.
Notes
- This is a simplified implementation of Raft. Many complexities (e.g., persistent storage, advanced log compaction, handling of concurrent operations) are omitted for clarity.
- Focus on the core concepts of leader election, log replication, and state machine consistency.
- Use Go's concurrency features (goroutines and channels) to handle asynchronous operations.
- Consider using a simple data structure (e.g., a slice of structs) to represent the log.
- Start with a small number of nodes (3) and gradually increase the complexity.
- Thoroughly test your implementation with various scenarios, including leader failures and network partitions.
- Error handling is important, but for this challenge, you can focus on the core logic and keep error handling relatively simple. Log errors instead of panicking.