How to remove rows from a dataset, with the output being the same dataset

Summary: I want to remove rows from a dataset based on certain conditions. I want the final dataset to be the same as the input dataset (but with the deleted rows)

 

Use Case: 

 

Current Usage:

Step 1: I get a file with a 100,000 rows everyday (through the Amazon S3 connector). I use the append method, so each day's data keeps on adding to the old data. Let's call this "S3 File". 

Step 2: I then use the SQL dataflow to join this "S3 File" with some another table (to add additional information). I get another output file from this flow. Let's call this the "Final File". I use the "Final File" in a dashboard. 

 

Problem: 

The "S3 File", over a period of a month, now has over 3 million rows. This has 2 drawbacks:

      .1. The SQL dataflow (Step 1 in my explanation) takes a lot of time to run. Now its over 1 hour

      .2. In the morning when hundreds of our company's dataflows are running, the server gets slowed down (we're assuming because of huge files like the one discussed being processed). Processing times are doubled, even tripled. 

      .3. We have a limit to the number of rows we're allowed, and it will soon run out if this keeps on going. 

       .4. With a dataflow time of one hour (which can soon go to 2 hours or even more), our dashboard users get to see the latest data very late in the day, which they prefer seeing earlier. 

 

Proposed Solution:

I'm looking for a way in which I can only keep the rows that are relevant in the dataset. I determine relevance in Step 2 (in my explanation above) when I do the SQL dataflow (when I join tables with the "S3 File"). For simplicity, let's assume I see the "start date" for each row. If the start date is 30 days older than the current date, I want to delete that row. 

 

One way proposed was to use the replace method in Step 1 (for the "S3 File"). I then take file to Step 2, perform a SQL dataflow, and weed out outdated rows (with delete command, or just the select command), storing the relevant ones only. The problem with this is that the data that we get daily is ALWAYS relevant that day, for upto 30 days. So I cannot just weed out outdated rows when we get data. 

 

The only way then is to perform SQL operations on the big dataset, and delete the outdated rows. The problem with this is that in a SQL dataflow, the output dataset cannot be the same as the input dataset (I tried naming them the same, but it doesn't help). Even if I make a new dataset with reduced rows, there is still going to be a big dataset in the backend. 

 

I would love an answer on this from the Domo experts.  

 

 

Best Answer

  • zcameron
    zcameron Contributor
    Answer ✓

    hamza_123,

     

    The way to do what you're looking for is to include the output of the dataflow in an input to the same dataflow. This is referred to as a recursive dataflow. It takes the output of the previous run of the dataflow and brings it back in as an input. You can then append the new data to the recursive input, filter out the rows you don't want to keep and then output to the same recursive output. This will work in either SQL-based or Magic ETL dataflows. 

     

    The key to making this work is that you have to run the dataflow once so that the output exists in the data center. Then go back and edit the dataflow and add the new dataset as an input.

     

    I hope that helps!

Answers

  • zcameron
    zcameron Contributor
    Answer ✓

    hamza_123,

     

    The way to do what you're looking for is to include the output of the dataflow in an input to the same dataflow. This is referred to as a recursive dataflow. It takes the output of the previous run of the dataflow and brings it back in as an input. You can then append the new data to the recursive input, filter out the rows you don't want to keep and then output to the same recursive output. This will work in either SQL-based or Magic ETL dataflows. 

     

    The key to making this work is that you have to run the dataflow once so that the output exists in the data center. Then go back and edit the dataflow and add the new dataset as an input.

     

    I hope that helps!