Hone logo
Hone
Problems

Designing a Hierarchical Product Catalog with Temporal Data

This challenge focuses on designing and querying a database schema to represent a hierarchical product catalog that also tracks historical changes to product attributes over time. This is crucial for businesses that need to manage complex product relationships and understand how product information has evolved for reporting, auditing, and business intelligence.

Problem Description

You are tasked with designing an SQL database schema to model a product catalog with two key advanced features:

  1. Hierarchical Categories: Products can belong to multiple categories, and categories can be nested within other categories, forming a tree-like structure.
  2. Temporal Product Attributes: Product attributes (like price, description, or status) can change over time. You need to store a history of these changes, allowing you to query product information as it was at any given point in the past.

Key Requirements:

  • Categories:
    • Categories can have a parent category, allowing for nesting.
    • A category can have multiple child categories.
    • A category can be the parent of multiple products.
  • Products:
    • Each product should have a unique identifier.
    • A product can be associated with one or more categories.
  • Temporal Attributes:
    • For a given product, certain attributes (e.g., price, description, is_active) should have their history tracked.
    • Each historical record of an attribute should have a valid_from timestamp and, optionally, a valid_to timestamp.
    • When querying for a product's attribute at a specific point in time, the system should return the most recent valid record for that time.

Expected Behavior:

  • The schema should efficiently support queries for products within specific categories, including subcategories.
  • The schema should efficiently retrieve the state of product attributes at any given historical timestamp.
  • The schema should handle situations where an attribute is updated multiple times.

Edge Cases to Consider:

  • Categories without parents (top-level categories).
  • Products that are not yet active (is_active = false).
  • Products that have had no attribute changes since creation.
  • Querying for data before the first record or after the last record for temporal attributes.
  • Categories that are deleted or become orphaned (though for this challenge, assume categories are not deleted, only potentially re-parented or have their temporal validity managed).

Examples

Example 1: Basic Product and Category Structure

Input Data (Conceptual):

  • Categories:
    • category_id: 1, name: "Electronics", parent_id: NULL
    • category_id: 2, name: "Computers", parent_id: 1
    • category_id: 3, name: "Laptops", parent_id: 2
    • category_id: 4, name: "Desktops", parent_id: 2
    • category_id: 5, name: "Accessories", parent_id: 1
    • category_id: 6, name: "Mice", parent_id: 5
  • Products:
    • product_id: 101, name: "UltraBook Pro", description: "Lightweight laptop"
    • product_id: 102, name: "Gaming PC", description: "High-performance desktop"
    • product_id: 103, name: "Wireless Mouse", description: "Ergonomic mouse"
  • Product Category Associations:
    • product_id: 101, category_id: 3
    • product_id: 101, category_id: 5 (can be in multiple categories)
    • product_id: 102, category_id: 4
    • product_id: 103, category_id: 6

SQL Schema Design (Pseudocode):

-- Table for Categories
CREATE TABLE Categories (
    category_id INT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    parent_id INT NULL, -- For hierarchical structure
    FOREIGN KEY (parent_id) REFERENCES Categories(category_id)
);

-- Table for Products
CREATE TABLE Products (
    product_id INT PRIMARY KEY,
    name VARCHAR(255) NOT NULL
);

-- Junction table for many-to-many relationship between Products and Categories
CREATE TABLE ProductCategories (
    product_id INT,
    category_id INT,
    PRIMARY KEY (product_id, category_id),
    FOREIGN KEY (product_id) REFERENCES Products(product_id),
    FOREIGN KEY (category_id) REFERENCES Categories(category_id)
);

-- Table for Temporal Product Attributes
CREATE TABLE ProductAttributeHistory (
    attribute_history_id BIGINT PRIMARY KEY AUTO_INCREMENT, -- Unique ID for each history record
    product_id INT NOT NULL,
    attribute_name VARCHAR(50) NOT NULL, -- e.g., 'price', 'description', 'is_active'
    attribute_value VARCHAR(255) NOT NULL, -- Store value as string, cast as needed
    valid_from TIMESTAMP NOT NULL,
    valid_to TIMESTAMP NULL, -- NULL indicates the current valid record
    FOREIGN KEY (product_id) REFERENCES Products(product_id)
);

Output (Conceptual Query Result): Find all products in the "Electronics" category and its subcategories, as of a specific timestamp.

  • Query: Retrieve product_id, name for products associated with "Electronics" or any of its descendants.
  • Result:
    • product_id: 101, name: "UltraBook Pro"
    • product_id: 102, name: "Gaming PC"
    • product_id: 103, name: "Wireless Mouse"

Explanation: The query would use a recursive CTE (Common Table Expression) or similar mechanism to traverse the Categories hierarchy starting from "Electronics" (category_id 1). It would then join with ProductCategories and Products to fetch the relevant product information.


Example 2: Temporal Attribute Query

Input Data (Continuing from Example 1, with added history):

  • Products:
    • product_id: 101, name: "UltraBook Pro"
  • ProductAttributeHistory:
    • attribute_history_id: 1, product_id: 101, attribute_name: "price", attribute_value: "1200.00", valid_from: "2023-01-01 00:00:00", valid_to: "2023-06-30 23:59:59"
    • attribute_history_id: 2, product_id: 101, attribute_name: "price", attribute_value: "1150.00", valid_from: "2023-07-01 00:00:00", valid_to: NULL
    • attribute_history_id: 3, product_id: 101, attribute_name: "description", attribute_value: "Sleek and powerful laptop", valid_from: "2023-01-01 00:00:00", valid_to: "2023-09-15 12:00:00"
    • attribute_history_id: 4, product_id: 101, attribute_name: "description", attribute_value: "Ultra-lightweight laptop with premium features", valid_from: "2023-09-15 12:00:01", valid_to: NULL
    • attribute_history_id: 5, product_id: 101, attribute_name: "is_active", attribute_value: "TRUE", valid_from: "2023-01-01 00:00:00", valid_to: NULL

Query: Get the price and description of "UltraBook Pro" (product_id 101) as of "2023-08-15 10:00:00".

Output (Conceptual Result):

product_id | price  | description
-----------|--------|----------------------------
101        | 1150.00| Sleek and powerful laptop

Explanation:

  • For price:
    • Record 1: valid_from < "2023-08-15", valid_to < "2023-08-15". Not the latest valid.
    • Record 2: valid_from < "2023-08-15", valid_to is NULL (meaning it's current). This is the selected record.
  • For description:
    • Record 3: valid_from < "2023-08-15", valid_to < "2023-08-15". Not the latest valid.
    • Record 4: valid_from < "2023-08-15", valid_to is NULL. This is the selected record.

The query needs to find the record for each attribute_name where valid_from is less than or equal to the target timestamp, and (valid_to is NULL OR valid_to is greater than or equal to the target timestamp), and then select the one with the latest valid_from among those that satisfy the condition.


Example 3: Edge Case - Querying Past a valid_to Date

Input Data (Same as Example 2):

Query: Get the price and description of "UltraBook Pro" (product_id 101) as of "2023-07-01 15:00:00".

Output (Conceptual Result):

product_id | price  | description
-----------|--------|----------------------------
101        | 1150.00| Sleek and powerful laptop

Explanation:

  • For price:
    • Record 1: valid_from < "2023-07-01", valid_to < "2023-07-01". This record ended its validity before the query timestamp.
    • Record 2: valid_from < "2023-07-01", valid_to is NULL. This is the current valid record.
  • For description:
    • Record 3: valid_from < "2023-07-01", valid_to > "2023-07-01". This record is valid at the query timestamp.
    • Record 4: valid_from > "2023-07-01", valid_to is NULL. This record starts its validity after the query timestamp.

The query correctly identifies Record 2 for price and Record 3 for description as being valid at "2023-07-01 15:00:00".

Constraints

  • The Categories table will have a maximum of 10,000 categories.
  • The Products table will have a maximum of 1,000,000 products.
  • The ProductAttributeHistory table can grow very large, potentially containing billions of records.
  • All timestamps will be in UTC.
  • Attribute values for price will be decimal numbers, is_active will be boolean, and description will be text. Your schema should accommodate these types or a common type for storage with casting.
  • Queries for temporal data should aim to execute within acceptable performance limits for analytical workloads (e.g., sub-second to a few seconds for typical historical queries).

Notes

  • Consider using a database system that excels at handling time-series data or large datasets.
  • For hierarchical queries, research common SQL patterns like recursive CTEs or adjacency lists.
  • For temporal queries, think about how to efficiently find the correct record for a given point in time. Indexing will be critical here.
  • The valid_to field being NULL signifies that the record is the current, active version of the attribute.
  • When designing the ProductAttributeHistory table, consider the trade-offs of storing different data types in a single attribute_value column versus having separate columns for different attribute types (e.g., price_value, description_value). For this challenge, a single attribute_value column with appropriate casting is acceptable.
  • Success will be measured by a well-designed, normalized (or appropriately denormalized for performance) schema that can accurately answer the specified queries efficiently.
Loading editor...
plaintext