Cleaning data off an appending dataset

Hi Dojo,

We have several datasets that are appending a decent amount of data each day, and we do not have a need to keep data from as far back as our datasets are holding. Does anyone know of a solution that makes connectors work with a "moving window" to only hold relevant data?

ETL solutions work well for our datasets that use them for recursive update/append, but we also have connector datasets that simply append data onto themselves. There is the option to delete rows individually, but I do not see any good way to clear off chunks of data using any filtering logic.

Does anyone know of a good solution, outside of moving data storage to an ETL output?

Thanks,

Andrew

Answers

  • You could use the Java CLI tool to do this, but it would likely be a manual approach. I know you want avoid creating an ETL, but creating a simple Magic ETL with a filter tile that looks at the current date and the date field in your dataset would be an easy and dynamic way to filter out the older data.

    **Check out my Domo Tips & Tricks Videos

    **Make sure to <3 any users posts that helped you.
    **Please mark as accepted the ones who solved your issue.
  • Hello @user045907 is it possible to replace these datasets instead of appending?

    Depending on the window you are looking to populate, and how you are using it, grabbing a window that is x days historical from today is an option in many connectors, but not all.

    Which connectors are you using, do they provide a switch for append / replace?

  • user045907
    user045907 Member
    edited May 2023

    The datasets I am wanting to clean up are all powered through the Advanced CSV Connector, with data appending daily.

    To the point @MarkSnodgrass made, I have used this idea for several cases. I would more readily take the ETL approach if I could single out specific updates to preview and revert, or if I could roll back the entire dataset within so many days. Recursive ETL logic typically works well for us, until we have to investigate specific updates.

    The solution posed by @JasonAltenburg would work well with what we currently have, but I do not see that option anywhere in the configuration for the Advanced CSV Connector. I also do not see a good way to have the connector run through multiple files during an upload, but that would be something I am looking for. We could have a moving window that holds the past 30 uploads (for a month) to feed into a recursive ETL for historical needs. Is there a trick to getting this connector to work in that way? I can think of some Python based solutions to use through the Workbench, but that is not preferred.

    Thanks for the replies!

  • @user045907 -

    Have you thought about using a Magic ETL dataflow to have your dataset as an input and then filter the data so you don't have historical then output it to another dataset making it update on a daily schedule? Once that's in place you may be able to extract the dataflow definition with the Java CLI, change the ID of the output dataset to be your input dataset in the definition, and then upload the new definition to your dataset.


    This isn't really officially supported but may let you remove the historical data from your dataset on a schedule.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • user045907
    user045907 Member
    edited May 2023

    @GrantSmith I have not thought of using the CLI tool in that way, but it makes me want to try it with a test dataset. That might be what @MarkSnodgrass was mentioning. I would be curious to see what this action does to the history tracking in the connector dataset.

  • @user045907 - Have you considered using dataset views to filter the data to only the rows you want to keep?


    “There is a superhero in all of us, we just need the courage to put on the cape.” -Superman