Designing a Hierarchical Product Catalog with Temporal Data
This challenge focuses on designing and querying a database schema to represent a hierarchical product catalog that also tracks historical changes to product attributes over time. This is crucial for businesses that need to manage complex product relationships and understand how product information has evolved for reporting, auditing, and business intelligence.
Problem Description
You are tasked with designing an SQL database schema to model a product catalog with two key advanced features:
- Hierarchical Categories: Products can belong to multiple categories, and categories can be nested within other categories, forming a tree-like structure.
- Temporal Product Attributes: Product attributes (like price, description, or status) can change over time. You need to store a history of these changes, allowing you to query product information as it was at any given point in the past.
Key Requirements:
- Categories:
- Categories can have a parent category, allowing for nesting.
- A category can have multiple child categories.
- A category can be the parent of multiple products.
- Products:
- Each product should have a unique identifier.
- A product can be associated with one or more categories.
- Temporal Attributes:
- For a given product, certain attributes (e.g.,
price,description,is_active) should have their history tracked. - Each historical record of an attribute should have a
valid_fromtimestamp and, optionally, avalid_totimestamp. - When querying for a product's attribute at a specific point in time, the system should return the most recent valid record for that time.
- For a given product, certain attributes (e.g.,
Expected Behavior:
- The schema should efficiently support queries for products within specific categories, including subcategories.
- The schema should efficiently retrieve the state of product attributes at any given historical timestamp.
- The schema should handle situations where an attribute is updated multiple times.
Edge Cases to Consider:
- Categories without parents (top-level categories).
- Products that are not yet active (
is_active= false). - Products that have had no attribute changes since creation.
- Querying for data before the first record or after the last record for temporal attributes.
- Categories that are deleted or become orphaned (though for this challenge, assume categories are not deleted, only potentially re-parented or have their temporal validity managed).
Examples
Example 1: Basic Product and Category Structure
Input Data (Conceptual):
- Categories:
category_id: 1,name: "Electronics",parent_id: NULLcategory_id: 2,name: "Computers",parent_id: 1category_id: 3,name: "Laptops",parent_id: 2category_id: 4,name: "Desktops",parent_id: 2category_id: 5,name: "Accessories",parent_id: 1category_id: 6,name: "Mice",parent_id: 5
- Products:
product_id: 101,name: "UltraBook Pro",description: "Lightweight laptop"product_id: 102,name: "Gaming PC",description: "High-performance desktop"product_id: 103,name: "Wireless Mouse",description: "Ergonomic mouse"
- Product Category Associations:
product_id: 101,category_id: 3product_id: 101,category_id: 5 (can be in multiple categories)product_id: 102,category_id: 4product_id: 103,category_id: 6
SQL Schema Design (Pseudocode):
-- Table for Categories
CREATE TABLE Categories (
category_id INT PRIMARY KEY,
name VARCHAR(255) NOT NULL,
parent_id INT NULL, -- For hierarchical structure
FOREIGN KEY (parent_id) REFERENCES Categories(category_id)
);
-- Table for Products
CREATE TABLE Products (
product_id INT PRIMARY KEY,
name VARCHAR(255) NOT NULL
);
-- Junction table for many-to-many relationship between Products and Categories
CREATE TABLE ProductCategories (
product_id INT,
category_id INT,
PRIMARY KEY (product_id, category_id),
FOREIGN KEY (product_id) REFERENCES Products(product_id),
FOREIGN KEY (category_id) REFERENCES Categories(category_id)
);
-- Table for Temporal Product Attributes
CREATE TABLE ProductAttributeHistory (
attribute_history_id BIGINT PRIMARY KEY AUTO_INCREMENT, -- Unique ID for each history record
product_id INT NOT NULL,
attribute_name VARCHAR(50) NOT NULL, -- e.g., 'price', 'description', 'is_active'
attribute_value VARCHAR(255) NOT NULL, -- Store value as string, cast as needed
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP NULL, -- NULL indicates the current valid record
FOREIGN KEY (product_id) REFERENCES Products(product_id)
);
Output (Conceptual Query Result): Find all products in the "Electronics" category and its subcategories, as of a specific timestamp.
- Query: Retrieve
product_id,namefor products associated with "Electronics" or any of its descendants. - Result:
product_id: 101,name: "UltraBook Pro"product_id: 102,name: "Gaming PC"product_id: 103,name: "Wireless Mouse"
Explanation: The query would use a recursive CTE (Common Table Expression) or similar mechanism to traverse the Categories hierarchy starting from "Electronics" (category_id 1). It would then join with ProductCategories and Products to fetch the relevant product information.
Example 2: Temporal Attribute Query
Input Data (Continuing from Example 1, with added history):
- Products:
product_id: 101,name: "UltraBook Pro"
- ProductAttributeHistory:
attribute_history_id: 1,product_id: 101,attribute_name: "price",attribute_value: "1200.00",valid_from: "2023-01-01 00:00:00",valid_to: "2023-06-30 23:59:59"attribute_history_id: 2,product_id: 101,attribute_name: "price",attribute_value: "1150.00",valid_from: "2023-07-01 00:00:00",valid_to: NULLattribute_history_id: 3,product_id: 101,attribute_name: "description",attribute_value: "Sleek and powerful laptop",valid_from: "2023-01-01 00:00:00",valid_to: "2023-09-15 12:00:00"attribute_history_id: 4,product_id: 101,attribute_name: "description",attribute_value: "Ultra-lightweight laptop with premium features",valid_from: "2023-09-15 12:00:01",valid_to: NULLattribute_history_id: 5,product_id: 101,attribute_name: "is_active",attribute_value: "TRUE",valid_from: "2023-01-01 00:00:00",valid_to: NULL
Query: Get the price and description of "UltraBook Pro" (product_id 101) as of "2023-08-15 10:00:00".
Output (Conceptual Result):
product_id | price | description
-----------|--------|----------------------------
101 | 1150.00| Sleek and powerful laptop
Explanation:
- For
price:- Record 1:
valid_from< "2023-08-15",valid_to< "2023-08-15". Not the latest valid. - Record 2:
valid_from< "2023-08-15",valid_tois NULL (meaning it's current). This is the selected record.
- Record 1:
- For
description:- Record 3:
valid_from< "2023-08-15",valid_to< "2023-08-15". Not the latest valid. - Record 4:
valid_from< "2023-08-15",valid_tois NULL. This is the selected record.
- Record 3:
The query needs to find the record for each attribute_name where valid_from is less than or equal to the target timestamp, and (valid_to is NULL OR valid_to is greater than or equal to the target timestamp), and then select the one with the latest valid_from among those that satisfy the condition.
Example 3: Edge Case - Querying Past a valid_to Date
Input Data (Same as Example 2):
Query: Get the price and description of "UltraBook Pro" (product_id 101) as of "2023-07-01 15:00:00".
Output (Conceptual Result):
product_id | price | description
-----------|--------|----------------------------
101 | 1150.00| Sleek and powerful laptop
Explanation:
- For
price:- Record 1:
valid_from< "2023-07-01",valid_to< "2023-07-01". This record ended its validity before the query timestamp. - Record 2:
valid_from< "2023-07-01",valid_tois NULL. This is the current valid record.
- Record 1:
- For
description:- Record 3:
valid_from< "2023-07-01",valid_to> "2023-07-01". This record is valid at the query timestamp. - Record 4:
valid_from> "2023-07-01",valid_tois NULL. This record starts its validity after the query timestamp.
- Record 3:
The query correctly identifies Record 2 for price and Record 3 for description as being valid at "2023-07-01 15:00:00".
Constraints
- The
Categoriestable will have a maximum of 10,000 categories. - The
Productstable will have a maximum of 1,000,000 products. - The
ProductAttributeHistorytable can grow very large, potentially containing billions of records. - All timestamps will be in UTC.
- Attribute values for
pricewill be decimal numbers,is_activewill be boolean, anddescriptionwill be text. Your schema should accommodate these types or a common type for storage with casting. - Queries for temporal data should aim to execute within acceptable performance limits for analytical workloads (e.g., sub-second to a few seconds for typical historical queries).
Notes
- Consider using a database system that excels at handling time-series data or large datasets.
- For hierarchical queries, research common SQL patterns like recursive CTEs or adjacency lists.
- For temporal queries, think about how to efficiently find the correct record for a given point in time. Indexing will be critical here.
- The
valid_tofield being NULL signifies that the record is the current, active version of the attribute. - When designing the
ProductAttributeHistorytable, consider the trade-offs of storing different data types in a singleattribute_valuecolumn versus having separate columns for different attribute types (e.g.,price_value,description_value). For this challenge, a singleattribute_valuecolumn with appropriate casting is acceptable. - Success will be measured by a well-designed, normalized (or appropriately denormalized for performance) schema that can accurately answer the specified queries efficiently.