What are the est practices to maximize speed and minimize data storage when updating a large dataset? I have an example but any tips on performance I'm willing to hear.
We have a dataset that consists of an initial load (~ 1 million records) and a daily* update of ~ 20k records (and growing). Within the first year we expect to be at 8 million records, 20 by year 2, 40 by year 3. Each record has an ID field that is a unique identifier in the database, but the data we are reading is effectively a transaction log. Each row has an I for an insert and U for update (no D for deletes here).
* The days are UTC days so in a perfect world I'd grab a "day" twice so that when I run the job at say 2am CT, I get "yesterday" and "today" (so far).
The end dataset should have a row for every record with the most recent update being shown (I don't need the transactional history).
I'm wanting to minimize the amount of data we are storing in Domo with this data as the raw data will be retained indefinitely in an S3 bucket. And I'm wanting to maximize processing speed as this data will be used downstream for datasets that have larger runs. I'm willing to sacrifice more storage for speed though.
Option1 - I can have the dataset just do an append, do a rank and window on update date, and filter that data into my dataset.
- This would be super inefficient as I'm processing the whole of my data every single time and I'm ranking and windowing which is always a performance dog in magic (I'm guessing because we can't specify an index?)
- This would be the maximum amount of storage used as all my raw data would continue to stack up.
- The only thing I'm saving is not doing it at the visualization layer and I'm not getting into recursion.
I can have my dataset do a replace and do a recursive dataset. From here I have two options...
Option 2A - Append the new data to the recursed dataset, rank and window and filter the whole thing, write my new dataset.
- This is still inefficient as I'm processing the whole of the data every single time and I'm ranking and windowing the whole thing which is again a performance dog in magic.
- I've reduced my storage by quite a bit and I'm still not doing it in visualization.
Option 2B - Bring my new data in and filter it amongst Inserts and Updates. Append the inserts to the recused dataset. Rank and window and filter the Updates. Join the filtered Updates to the Recursed + Inserts and filter out those matches. Append the Filtered Updates to the Filtered Recursed and Inserts.
- I think this is as efficient as I can get as I've minimized what I'm doing a Rank and Window on but I've added more steps. Because my join is on an INT, that should be efficient.
- Storage is still reduced and still not doing it in the visualization layer.
After 30 days, the data is effectively locked. It would be great if I could only bring in the last 30 days of data from the recursed dataset and write back just those 30 days of data, leaving days 31+ just hanging out. https://www.youtube.com/watch?v=grUeb9O1Tm0 talks about partitioning but I don't think that really helps me here.