How to properly partition in Workbench while allowing older data to come in?

JunkDoom · October 2022

I have a SQL Server dataset of 50M rows. Workbench runs the APPEND Update Method HOURLY. I use the table ID as the UPSERT key and the processing query checks the past 7 days ("Insertion Date") for new records. This updates older data unnecessarily with the same data, but protects against issues with the job not running for up to 7 days.

-"Insertion Date" column [datetime in SQL Server] is NOT seen by customers

-"Transaction Date" [datetime in SQL Server] IS seen by customers and used for filtering.

Performance is terrible once we have several cards attached to this dataset, and I'm guessing not using PARTITIONING is part of the problem.

A "day" grain on "Transaction Date" would make sense, but I can't risk data coming in with an old "Transaction Date" potentially removing any existing data from that date. My source data can get old records where "Transaction Date", but a recent "Insertion Date". For example, transactions from 6 months ago could get keyed in today.

What is a proper strategy for partitioning/maintaining this data set that would never lose data, and still update every hour?

GrantSmith · October 2022

What you might need to do is query your dataset to see what transaction dates were updated during the last partition chunk of insertion date and then re-pull the all of the transactions with the transaction dates from the list of transactions dates which were inserted / updated.

So you're partitioning based off the transaction date but keying off the insertion date if that makes sense.

This may not be as performant though depending on the amount of transaction dates that would be updating each job execution.

How to properly partition in Workbench while allowing older data to come in?

Answers

Categories