SQL Data Archiving Strategy
Many applications generate a significant amount of historical data that is rarely accessed but must be retained for compliance or auditing purposes. Storing this "cold" data alongside active data can lead to performance degradation and increased storage costs. This challenge requires you to design and implement a strategy for moving older, less frequently accessed data from a primary "active" table to a separate "archive" table.
Problem Description
Your task is to create a mechanism within a SQL database to archive records from a primary table (Orders) to a secondary table (OrdersArchive). This process should be triggered periodically or based on specific criteria, moving data that meets the archiving condition to the archive table and then removing it from the primary table.
Key Requirements:
- Archiving Condition: Records should be archived if their
order_dateis older than a specified retention period (e.g., 1 year). - Data Transfer: Successfully archived records must be copied from
OrderstoOrdersArchivebefore being deleted fromOrders. - Deletion: Once successfully copied, the original records must be removed from the
Orderstable. - Transactionality: The entire archiving process (copy and delete) must be atomic, meaning it either completes successfully or fails entirely, leaving the database in its original state.
- Table Structure:
Orderstable:order_id(INT, PRIMARY KEY),customer_id(INT),order_date(DATE),total_amount(DECIMAL)OrdersArchivetable: Should have the same structure asOrders, potentially with an additionalarchived_attimestamp.
Expected Behavior:
A process should be implemented (e.g., a stored procedure or script) that, when executed:
- Identifies all records in
Ordersthat satisfy the archiving condition. - Copies these identified records into
OrdersArchive. - Deletes the same identified records from
Orders. - Ensures that no data is lost or duplicated during this operation.
Edge Cases:
- What happens if the archiving process fails midway (e.g., after copying but before deletion)?
- What if there are no records to archive? The process should handle this gracefully.
- Consider potential locking issues if the archive process runs concurrently with other operations on the
Orderstable.
Examples
Example 1:
Assume the retention period is 1 year, and today's date is '2023-10-27'.
Orders Table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 101 | 1 | 2022-05-15 | 150.75 |
| 102 | 2 | 2023-01-20 | 75.00 |
| 103 | 1 | 2023-08-10 | 200.50 |
| 104 | 3 | 2022-11-01 | 120.00 |
+----------+-------------+------------+--------------+
OrdersArchive Table: (Initially empty)
+----------+-------------+------------+--------------+---------------+
| order_id | customer_id | order_date | total_amount | archived_at |
+----------+-------------+------------+--------------+---------------+
| | | | | |
+----------+-------------+------------+--------------+---------------+
Expected Output:
After executing the archiving process:
Orders Table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 102 | 2 | 2023-01-20 | 75.00 |
| 103 | 1 | 2023-08-10 | 200.50 |
+----------+-------------+------------+--------------+
OrdersArchive Table:
+----------+-------------+------------+--------------+-------------------------+
| order_id | customer_id | order_date | total_amount | archived_at |
+----------+-------------+------------+--------------+-------------------------+
| 101 | 1 | 2022-05-15 | 150.75 | 2023-10-27 10:00:00 |
| 104 | 3 | 2022-11-01 | 120.00 | 2023-10-27 10:00:00 |
+----------+-------------+------------+--------------+-------------------------+
Explanation:
Records with order_date before '2022-10-27' (i.e., order_id 101 and 104) are moved to OrdersArchive and then deleted from Orders.
Example 2: No records to archive
Orders Table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 201 | 4 | 2023-09-01 | 50.00 |
| 202 | 5 | 2023-10-15 | 80.25 |
+----------+-------------+------------+--------------+
OrdersArchive Table: (Assume it might contain older data)
+----------+-------------+------------+--------------+---------------+
| order_id | customer_id | order_date | total_amount | archived_at |
+----------+-------------+------------+--------------+---------------+
| 101 | 1 | 2022-05-15 | 150.75 | 2023-01-01 |
+----------+-------------+------------+--------------+---------------+
Expected Output:
After executing the archiving process:
Orders Table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 201 | 4 | 2023-09-01 | 50.00 |
| 202 | 5 | 2023-10-15 | 80.25 |
+----------+-------------+------------+--------------+
OrdersArchive Table: (Remains unchanged)
+----------+-------------+------------+--------------+---------------+
| order_id | customer_id | order_date | total_amount | archived_at |
+----------+-------------+------------+--------------+---------------+
| 101 | 1 | 2022-05-15 | 150.75 | 2023-01-01 |
+----------+-------------+------------+--------------+---------------+
Explanation:
No records in Orders are older than 1 year from '2023-10-27', so no archiving occurs.
Constraints
- The
Orderstable can contain up to 1,000,000,000 records. - The
order_datecolumn will always be a valid date. - The archiving process should ideally complete within 1 hour for a full day's worth of data to be archived.
- The retention period will be a positive integer representing years.
Notes
- You will need to define the retention period. For the purpose of testing, assume a retention period of 1 year from the current date.
- Consider using SQL's transaction management features to ensure atomicity.
- Think about how you would make this process repeatable and potentially automated (e.g., using scheduled jobs).
- The
archived_atcolumn inOrdersArchiveshould record the timestamp when the record was archived.