What are the est practices to maximize speed and minimize data storage when updating a large dataset

SeanPT
SeanPT Contributor
edited March 2023 in Scheduled Reports

What are the est practices to maximize speed and minimize data storage when updating a large dataset? I have an example but any tips on performance I'm willing to hear.

We have a dataset that consists of an initial load (~ 1 million records) and a daily* update of ~ 20k records (and growing). Within the first year we expect to be at 8 million records, 20 by year 2, 40 by year 3. Each record has an ID field that is a unique identifier in the database, but the data we are reading is effectively a transaction log. Each row has an I for an insert and U for update (no D for deletes here).

* The days are UTC days so in a perfect world I'd grab a "day" twice so that when I run the job at say 2am CT, I get "yesterday" and "today" (so far).

The end dataset should have a row for every record with the most recent update being shown (I don't need the transactional history).

I'm wanting to minimize the amount of data we are storing in Domo with this data as the raw data will be retained indefinitely in an S3 bucket. And I'm wanting to maximize processing speed as this data will be used downstream for datasets that have larger runs. I'm willing to sacrifice more storage for speed though.

Option1 - I can have the dataset just do an append, do a rank and window on update date, and filter that data into my dataset.

  • This would be super inefficient as I'm processing the whole of my data every single time and I'm ranking and windowing which is always a performance dog in magic (I'm guessing because we can't specify an index?)
  • This would be the maximum amount of storage used as all my raw data would continue to stack up.
  • The only thing I'm saving is not doing it at the visualization layer and I'm not getting into recursion.

I can have my dataset do a replace and do a recursive dataset. From here I have two options...

Option 2A - Append the new data to the recursed dataset, rank and window and filter the whole thing, write my new dataset.

  • This is still inefficient as I'm processing the whole of the data every single time and I'm ranking and windowing the whole thing which is again a performance dog in magic.
  • I've reduced my storage by quite a bit and I'm still not doing it in visualization.

Option 2B - Bring my new data in and filter it amongst Inserts and Updates. Append the inserts to the recused dataset. Rank and window and filter the Updates. Join the filtered Updates to the Recursed + Inserts and filter out those matches. Append the Filtered Updates to the Filtered Recursed and Inserts.

  • I think this is as efficient as I can get as I've minimized what I'm doing a Rank and Window on but I've added more steps. Because my join is on an INT, that should be efficient.
  • Storage is still reduced and still not doing it in the visualization layer.

After 30 days, the data is effectively locked. It would be great if I could only bring in the last 30 days of data from the recursed dataset and write back just those 30 days of data, leaving days 31+ just hanging out. https://www.youtube.com/watch?v=grUeb9O1Tm0 talks about partitioning but I don't think that really helps me here.

Tagged:

Answers

  • You could have two datasets, one to pull in 31 days ago (set to append) and another to pull in the last 30 days (set to replace) and then use an ETL to append both of the datasets together.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • SeanPT
    SeanPT Contributor

    We finally cracked this nut this month.

    Here is a big thing I was missing - the datasets created by the CLI tool or Workbench can have many, many partitions. The limit is MUCH higher than 1500. You still don't want to get stupid with them, but there is more room. Magic can only process 1500 partitions. So when you bring in partitioned data, you need to get < 1500 partitions and when you write to partitioned data, you need to write to a dataset with < 1500.

    Here is what we did …

    1. We bring in data every day. We went back and grabbed all of our back data and it is all loaded by date of business and timezone. 1460 partitions in a year. So if I ever needed in magic to parse a full year, I can. But not much more than that. But that's ok.
    2. To build the data, we use the ability in our dataflow to select only certain partitions, e.g. getting only a year of data to back build. We then output to data partitioned by month (YYYY-MM to be more clear).
    3. Once the back data is done, we start on our regular processing. Every day when new data comes in we grab all partitions from the current month AND previous month. Why? We can't trust that on April 1st we won't have to re-state the data from March 30th. Best case, we are processing 29 days of data on March 1st. Worst case we are processing 62 days of data on August 31. But that's a-okay. That data is again partitioned out by month (YYYY-MM). So if some weird thing happened where on April 1 we realized we needed to restate 2/28's data, we could manually re-select all of february's data and restate that partition.

    And then Bob is our uncle. If we need to re-use the finished datasets downstream, we account for those partitions and carry over the YYYY-MM partitioning. So even on our 3B row dataset, we can process what needs to be processed in minutes.

    We thought we would have to take all of our raw input datasets and JUST partition them and use those partitioned datasets downstream, effectively re-stating our data. We ended up not having to do that. Hooray.

    A few other roadblocks we thought we had I addressed above.

    Going back to how we bring the data in. Here we are dealing with a SOAP XML endpoint that we use a python script to grab, parse, and output. We used to use the python sdk to upload to domo but that doesn't support partitions.

    So in python we grab everything, and then call the CLI tool and let the CLI tool do the uploading. We do CLI tool over WB because these are running in linux containers. Master control launches the container, the container does the work, container closes. I make budget, everyone on my team get a bottle of bourbon. Win, win, win.

  • Thanks for posting your solution @SeanPT ! Glad you got it figured out.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**