Hone logo
Hone
Problems

Python Data Anonymization Challenge

Protecting sensitive personal information is crucial in today's data-driven world. This challenge focuses on building a Python function to anonymize personal data by replacing identifying information with pseudonyms. This technique is vital for privacy compliance, enabling data analysis and sharing without compromising individual identities.

Problem Description

Your task is to create a Python function anonymize_data(data) that takes a dictionary representing a record of personal information and returns a new dictionary with sensitive fields anonymized.

Key Requirements:

  1. Anonymization Strategy: For each sensitive field, you need to replace the original value with a unique, consistent pseudonym. If the same sensitive value appears multiple times within the same record, it should be replaced by the same pseudonym.
  2. Sensitive Fields: The function should specifically anonymize the following fields:
    • 'name'
    • 'email'
    • 'phone'
    • 'address'
  3. Pseudonym Generation: Pseudonyms should be generated in a predictable yet anonymized format. For example, 'name' could be replaced with 'ANONYMIZED_NAME_1', 'email' with 'ANONYMIZED_EMAIL_2', and so on. The numbering should be unique for each distinct type of sensitive field.
  4. Handling Non-Sensitive Fields: Any fields in the input dictionary that are not listed as sensitive should be copied directly to the output dictionary without modification.
  5. Data Integrity: The function must not modify the original input dictionary. It should return a new, anonymized dictionary.
  6. Case Sensitivity: Anonymization should be case-sensitive. For example, "John Doe" and "john doe" should be treated as distinct values if they appear in the 'name' field.

Expected Behavior:

The anonymize_data function will receive a dictionary. It will iterate through the dictionary, identify sensitive fields, generate unique pseudonyms for each distinct sensitive value encountered for a given field type, and construct a new dictionary with anonymized values.

Edge Cases to Consider:

  • Empty Dictionary: What happens if an empty dictionary is provided as input?
  • Missing Sensitive Fields: What if a record is missing one or more of the specified sensitive fields?
  • Non-String Values: While typically sensitive data is string-based, consider how your pseudonym generation would handle non-string values if they were present in sensitive fields (though for this challenge, assume sensitive fields will contain strings).
  • Duplicate Sensitive Values: Ensure that if the same sensitive value appears multiple times within a single record (e.g., the same email listed twice, though unlikely in a single record structure), it gets the same pseudonym.

Examples

Example 1:

input_data = {
    "id": 101,
    "name": "Alice Smith",
    "email": "alice.smith@example.com",
    "age": 30,
    "city": "New York"
}

output_data = {
    "id": 101,
    "name": "ANONYMIZED_NAME_1",
    "email": "ANONYMIZED_EMAIL_1",
    "age": 30,
    "city": "New York"
}

Explanation: The name "Alice Smith" is replaced with "ANONYMIZED_NAME_1" and the email "alice.smith@example.com" is replaced with "ANONYMIZED_EMAIL_1". The id, age, and city fields are not sensitive and are copied as is.

Example 2:

input_data = {
    "record_id": "A123",
    "name": "Bob Johnson",
    "address": "123 Main St, Anytown, USA",
    "phone": "555-123-4567",
    "occupation": "Engineer"
}

output_data = {
    "record_id": "A123",
    "name": "ANONYMIZED_NAME_1",
    "address": "ANONYMIZED_ADDRESS_1",
    "phone": "ANONYMIZED_PHONE_1",
    "occupation": "Engineer"
}

Explanation: "Bob Johnson" becomes "ANONYMIZED_NAME_1", "123 Main St, Anytown, USA" becomes "ANONYMIZED_ADDRESS_1", and "555-123-4567" becomes "ANONYMIZED_PHONE_1". Non-sensitive fields remain unchanged.

Example 3:

input_data = {
    "user_id": "U789",
    "name": "Alice Smith",
    "email": "alice.smith@example.com",
    "secondary_email": "alice.s@company.net",
    "address": "123 Main St, Anytown, USA",
    "phone": "555-123-4567",
    "notes": "Customer has been with us for 5 years."
}

output_data = {
    "user_id": "U789",
    "name": "ANONYMIZED_NAME_1",
    "email": "ANONYMIZED_EMAIL_1",
    "secondary_email": "ANONYMIZED_EMAIL_2", # Distinct email, gets a new number
    "address": "ANONYMIZED_ADDRESS_1",
    "phone": "ANONYMIZED_PHONE_1",
    "notes": "Customer has been with us for 5 years."
}

Explanation: "Alice Smith" is anonymized. "alice.smith@example.com" and "alice.s@company.net" are distinct values and thus get separate pseudonyms. The address and phone number are also anonymized. The user_id and notes fields are untouched.

Constraints

  • The input data will always be a dictionary.
  • The values in the sensitive fields ('name', 'email', 'phone', 'address') will always be strings.
  • The input dictionary can contain any number of key-value pairs.
  • The function should execute within reasonable time limits for typical dictionary sizes (e.g., up to a few thousand key-value pairs).

Notes

  • You'll need a mechanism to keep track of the pseudonyms generated for each sensitive field type to ensure consistency within a single record. A dictionary mapping field names to their respective counters or lists of already used pseudonyms might be helpful.
  • Consider how to handle the numbering of pseudonyms. Should it reset for each function call, or maintain a global state? For this challenge, assume the numbering should be unique per function call for a given record, but different records processed in the same script run should not necessarily share pseudonym numbers unless the values are identical. The examples suggest a simple incrementing counter per field type within a single record's processing.
  • Think about the order of processing. Does it matter if you process 'name' before 'email'?
  • For a more robust solution, you might consider using hashing for pseudonym generation, but for this challenge, a simple incrementing counter is sufficient.
Loading editor...
python