Hone logo
Hone
Problems

Designing a Multi-Dimensional Data Warehouse for E-commerce Sales

Imagine you are tasked with designing a data warehouse for a rapidly growing e-commerce company. The business needs to analyze sales performance across various dimensions like products, customers, geographical regions, and time. A well-designed schema is crucial for efficient querying, reporting, and business intelligence.

Problem Description

Your goal is to design a relational schema for a data warehouse that stores historical sales data for an e-commerce platform. The schema should support a wide range of analytical queries. You will need to identify and define the fact table(s) and the dimension tables that will store the transactional data and its associated descriptive attributes.

Key requirements:

  • Fact Table: The central table should capture core sales metrics.
  • Dimension Tables: Separate tables should store descriptive attributes for products, customers, geographical locations, and time.
  • Granularity: The fact table should be at the transaction line-item level.
  • Star Schema Principles: While flexibility is key, aim for a design that aligns with star schema principles for ease of understanding and querying.
  • Degenerate Dimensions: Consider how to handle dimensions that don't have many associated attributes.

Expected behavior: The designed schema should allow for efficient retrieval of information such as:

  • Total sales revenue per product category per month.
  • Top 10 customers by total spending in the last quarter.
  • Sales volume by country and product brand.
  • Average order value for new customers versus returning customers.

Important edge cases:

  • Handling of returns and cancelled orders.
  • Dealing with products that might not have all attributes initially.
  • Ensuring historical accuracy even when dimension attributes change (e.g., customer address changes).

Examples

Example 1: Basic Sales Fact

Let's consider a single sale transaction.

Input: A single online purchase record with the following information:

  • Order ID: 12345
  • Customer ID: CUST001
  • Product ID: PROD005
  • Quantity Sold: 2
  • Price Per Unit: $15.99
  • Sale Timestamp: 2023-10-27 10:30:00
  • Shipping Address Country: USA

Output: This input would contribute to the fact table and potentially update dimension tables. A simplified representation of a fact table row might look like:

OrderItemIDOrderIDProductKeyCustomerKeyDateKeyQuantitySaleAmount
112345510120231027231.98

Note: ProductKey, CustomerKey, and DateKey are surrogate keys linking to dimension tables.

Explanation: The sales fact table captures the core transactional details. The SaleAmount is calculated as Quantity Sold * Price Per Unit. Surrogate keys are used for efficient joins to dimension tables.

Example 2: Product Dimension

Input: Information about a specific product:

  • Product ID (natural key): PROD005
  • Product Name: Wireless Mouse X1
  • Category: Electronics
  • Brand: TechGear
  • Supplier: GlobalSupplies

Output: A row in the Product Dimension table.

ProductKey (Surrogate)ProductID (Natural)ProductNameCategoryBrandSupplier
5PROD005Wireless Mouse X1ElectronicsTechGearGlobalSupplies

Explanation: This dimension table stores descriptive attributes about each product. The ProductKey is a surrogate key, while ProductID is the business's natural identifier.

Example 3: Handling Changes (Customer Address)

Input: A customer makes two purchases at different times.

  • Purchase 1:
    • Order ID: 12346
    • Customer ID: CUST002
    • Sale Timestamp: 2023-11-15 09:00:00
    • Shipping Address Country: USA
  • Purchase 2:
    • Order ID: 12347
    • Customer ID: CUST002
    • Sale Timestamp: 2023-12-01 14:00:00
    • Shipping Address Country: Canada (Customer moved)

Output: This scenario requires a dimension table design that handles Slowly Changing Dimensions (SCDs). A common approach is SCD Type 2, which creates new rows for changes.

Customer Dimension (Simplified SCD Type 2):

CustomerKey (Surrogate)CustomerID (Natural)CustomerNameCurrentAddressValidFromValidToIsCurrent
102CUST002Jane Doe123 Main St, USA2023-01-012023-11-300
103CUST002Jane Doe456 Oak Ave, Canada2023-12-019999-12-311

Note: ValidFrom, ValidTo, and IsCurrent flags are used to track the active period of a customer's record.

Explanation: When the customer's address (a dimension attribute) changed, a new record was created in the Customer dimension table. The previous record was marked as no longer current. This allows historical sales to be associated with the address that was valid at the time of the sale.

Constraints

  • The data warehouse will store at least 5 years of historical sales data.
  • The fact table can grow up to billions of rows.
  • Dimension tables will contain millions of rows for customers and products.
  • The schema must be implementable using standard SQL data types and constructs.
  • Queries for common analytical reports (as described in Expected Behavior) should execute within acceptable performance thresholds (e.g., sub-minute for typical reports).

Notes

  • Consider how you will handle date and time attributes. A dedicated Date Dimension table is highly recommended for efficient time-based analysis.
  • Think about surrogate keys for all dimension tables. These are typically auto-generated integers and are crucial for performance and managing dimension changes.
  • For attributes that change over time (like customer addresses), research and choose an appropriate Slowly Changing Dimension (SCD) strategy.
  • The definition of "sales" should include gross sales and potentially net sales (after discounts, before returns). Define how you will represent these.
  • You will need to define the primary and foreign key relationships between your tables.
  • Pseudocode for table definitions and relationships is acceptable if you cannot use specific SQL syntax. Focus on the structure and intent.
Loading editor...
plaintext