Hone logo
Hone
Problems

SQL Data Archiving Strategy

Many applications generate a significant amount of historical data that is rarely accessed but must be retained for compliance or auditing purposes. Storing this "cold" data alongside active data can lead to performance degradation and increased storage costs. This challenge requires you to design and implement a strategy for moving older, less frequently accessed data from a primary "active" table to a separate "archive" table.

Problem Description

Your task is to create a mechanism within a SQL database to archive records from a primary table (Orders) to a secondary table (OrdersArchive). This process should be triggered periodically or based on specific criteria, moving data that meets the archiving condition to the archive table and then removing it from the primary table.

Key Requirements:

  1. Archiving Condition: Records should be archived if their order_date is older than a specified retention period (e.g., 1 year).
  2. Data Transfer: Successfully archived records must be copied from Orders to OrdersArchive before being deleted from Orders.
  3. Deletion: Once successfully copied, the original records must be removed from the Orders table.
  4. Transactionality: The entire archiving process (copy and delete) must be atomic, meaning it either completes successfully or fails entirely, leaving the database in its original state.
  5. Table Structure:
    • Orders table: order_id (INT, PRIMARY KEY), customer_id (INT), order_date (DATE), total_amount (DECIMAL)
    • OrdersArchive table: Should have the same structure as Orders, potentially with an additional archived_at timestamp.

Expected Behavior:

A process should be implemented (e.g., a stored procedure or script) that, when executed:

  • Identifies all records in Orders that satisfy the archiving condition.
  • Copies these identified records into OrdersArchive.
  • Deletes the same identified records from Orders.
  • Ensures that no data is lost or duplicated during this operation.

Edge Cases:

  • What happens if the archiving process fails midway (e.g., after copying but before deletion)?
  • What if there are no records to archive? The process should handle this gracefully.
  • Consider potential locking issues if the archive process runs concurrently with other operations on the Orders table.

Examples

Example 1:

Assume the retention period is 1 year, and today's date is '2023-10-27'.

Orders Table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 101      | 1           | 2022-05-15 | 150.75       |
| 102      | 2           | 2023-01-20 | 75.00        |
| 103      | 1           | 2023-08-10 | 200.50       |
| 104      | 3           | 2022-11-01 | 120.00       |
+----------+-------------+------------+--------------+

OrdersArchive Table: (Initially empty)
+----------+-------------+------------+--------------+---------------+
| order_id | customer_id | order_date | total_amount | archived_at   |
+----------+-------------+------------+--------------+---------------+
|          |             |            |              |               |
+----------+-------------+------------+--------------+---------------+

Expected Output:

After executing the archiving process:

Orders Table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 102      | 2           | 2023-01-20 | 75.00        |
| 103      | 1           | 2023-08-10 | 200.50       |
+----------+-------------+------------+--------------+

OrdersArchive Table:
+----------+-------------+------------+--------------+-------------------------+
| order_id | customer_id | order_date | total_amount | archived_at             |
+----------+-------------+------------+--------------+-------------------------+
| 101      | 1           | 2022-05-15 | 150.75       | 2023-10-27 10:00:00     |
| 104      | 3           | 2022-11-01 | 120.00       | 2023-10-27 10:00:00     |
+----------+-------------+------------+--------------+-------------------------+

Explanation: Records with order_date before '2022-10-27' (i.e., order_id 101 and 104) are moved to OrdersArchive and then deleted from Orders.

Example 2: No records to archive

Orders Table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 201      | 4           | 2023-09-01 | 50.00        |
| 202      | 5           | 2023-10-15 | 80.25        |
+----------+-------------+------------+--------------+

OrdersArchive Table: (Assume it might contain older data)
+----------+-------------+------------+--------------+---------------+
| order_id | customer_id | order_date | total_amount | archived_at   |
+----------+-------------+------------+--------------+---------------+
| 101      | 1           | 2022-05-15 | 150.75       | 2023-01-01    |
+----------+-------------+------------+--------------+---------------+

Expected Output:

After executing the archiving process:

Orders Table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 201      | 4           | 2023-09-01 | 50.00        |
| 202      | 5           | 2023-10-15 | 80.25        |
+----------+-------------+------------+--------------+

OrdersArchive Table: (Remains unchanged)
+----------+-------------+------------+--------------+---------------+
| order_id | customer_id | order_date | total_amount | archived_at   |
+----------+-------------+------------+--------------+---------------+
| 101      | 1           | 2022-05-15 | 150.75       | 2023-01-01    |
+----------+-------------+------------+--------------+---------------+

Explanation: No records in Orders are older than 1 year from '2023-10-27', so no archiving occurs.

Constraints

  • The Orders table can contain up to 1,000,000,000 records.
  • The order_date column will always be a valid date.
  • The archiving process should ideally complete within 1 hour for a full day's worth of data to be archived.
  • The retention period will be a positive integer representing years.

Notes

  • You will need to define the retention period. For the purpose of testing, assume a retention period of 1 year from the current date.
  • Consider using SQL's transaction management features to ensure atomicity.
  • Think about how you would make this process repeatable and potentially automated (e.g., using scheduled jobs).
  • The archived_at column in OrdersArchive should record the timestamp when the record was archived.
Loading editor...
plaintext