I have a SQL Server dataset of 50M rows. Workbench runs the APPEND Update Method HOURLY. I use the table ID as the UPSERT key and the processing query checks the past 7 days ("Insertion Date") for new records. This updates older data unnecessarily with the same data, but protects against issues with the job not running for up to 7 days.
-"Insertion Date" column [datetime in SQL Server] is NOT seen by customers
-"Transaction Date" [datetime in SQL Server] IS seen by customers and used for filtering.
Performance is terrible once we have several cards attached to this dataset, and I'm guessing not using PARTITIONING is part of the problem.
A "day" grain on "Transaction Date" would make sense, but I can't risk data coming in with an old "Transaction Date" potentially removing any existing data from that date. My source data can get old records where "Transaction Date", but a recent "Insertion Date". For example, transactions from 6 months ago could get keyed in today.
What is a proper strategy for partitioning/maintaining this data set that would never lose data, and still update every hour?