Hone logo
Hone
Problems

Implementing a Hashing-Based Sharding Strategy

Distributed systems often require data to be partitioned across multiple servers for scalability and fault tolerance. A common technique for this is sharding, where data is divided into smaller, more manageable pieces called shards. This challenge asks you to implement a basic hashing-based sharding strategy in Python.

Problem Description

You need to create a Python function that takes a piece of data (represented as a string) and a number of available shards, and determines which shard the data should be assigned to. The assignment should be based on a consistent hashing algorithm. This is a fundamental building block for many distributed databases and caching systems.

Your function should:

  • Accept a data_key (string) and num_shards (integer) as input.
  • Calculate a hash value for the data_key.
  • Use the modulo operator to map the hash value to a specific shard index.
  • Return the index of the shard (an integer between 0 and num_shards - 1, inclusive).

Consider the following:

  • Consistency: The same data_key should always map to the same shard for a given num_shards.
  • Distribution: The hashing algorithm should aim for a reasonably uniform distribution of data across shards.
  • Edge Cases: Handle cases where num_shards is 1.

Examples

Example 1:

Input: data_key="user_12345", num_shards=10
Output: 7
Explanation: The hash of "user_12345" is computed, and then the modulo 10 operation is applied to determine the shard index. For instance, if the hash is 1234567, then 1234567 % 10 = 7.

Example 2:

Input: data_key="product_abc", num_shards=5
Output: 2
Explanation: Similar to Example 1, the hash of "product_abc" is calculated, and then the modulo 5 operation determines the shard. For instance, if the hash is 98765, then 98765 % 5 = 0. (Correction: If hash is 98765, 98765 % 5 = 0. Let's assume hash was 98762 for output 2)
Let's use a more concrete hash for illustration: hash("product_abc") might be 12. 12 % 5 = 2.

Example 3:

Input: data_key="session_xyz", num_shards=1
Output: 0
Explanation: When there is only one shard, all data must be assigned to shard 0, regardless of the hash value.

Constraints

  • 1 <= num_shards <= 1000
  • data_key will be a non-empty string containing alphanumeric characters.
  • The hashing function used should be deterministic.
  • The solution should execute within reasonable time limits for typical inputs.

Notes

You can use Python's built-in hash() function for simplicity. However, be aware that hash() is not guaranteed to be consistent across different Python interpreter invocations or versions. For production systems, you would typically use a more robust and consistent hashing algorithm like MurmurHash or SHA-256. For this challenge, the built-in hash() is sufficient.

Consider how to handle potential negative results from the hash() function before applying the modulo operator to ensure a valid shard index.

Loading editor...
python