Finding the sum of values based on a common column, excluding duplicates

I have a dataset with rows with IDs, start dates, end dates, and interval lengths. I want to find the sum of all the intervals for each ID, excluding duplicate intervals. There can be 2 rows with the same start and end dates so they can't be counted twice when summing (but I also cannot remove the duplicates). For example (the sum column is what I need to calculate):

ID   Start   End   Length   Sum
1    1/1     1/5   5        10
1    1/1     1/5   5        10
1    1/10    1/15  5        10
1    1/10    1/15  5        10
2    2/1     2/10  10       15
2    2/3     2/9   7        15
2    2/20    2/25  5        15
2    2/20    2/25  5        15


The sum for ID 1 is 10 because it's only counting each date interval once.

The sum for ID 2 is 15 because even though there are 3 unique intervals, the one from 2/3 to 2/9 falls inside 2/1 to 2/10 so it shouldn't be counted.

I would like to do this in ETL but if it's possible to do in Analyzer I would go that route.

Best Answer

  • GrantSmith
    GrantSmith Coach
    Answer ✓

    You’d need to incorporate a calendar dimension table in an ETL. However because magic ETL doesn’t support conditional joins it gets a little messy. You would need to add a constant to both a Calendar data set and your actual data set and Collett join column and set it to a value of one. Then you would join your calendar Dataset to your actual data set based on just the join column. After the join then you would filter your data such that the calendar date is between your start date and end date. Then you would run a group by based off of your ID and count distinct the different dates . After that you can then join this outfit to your original Dataset based on the id field to get your sum.


    alternatively you could go with the MySQL dataflow to do the conditional join with the same calendar Dataset and your actual data set and then also doing a group by on your ID and then doing a count distinct on the dates. It is a little simpler and easier to understand but magic ETL tends to run much faster than MySQL

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**

Answers

  • GrantSmith
    GrantSmith Coach
    Answer ✓

    You’d need to incorporate a calendar dimension table in an ETL. However because magic ETL doesn’t support conditional joins it gets a little messy. You would need to add a constant to both a Calendar data set and your actual data set and Collett join column and set it to a value of one. Then you would join your calendar Dataset to your actual data set based on just the join column. After the join then you would filter your data such that the calendar date is between your start date and end date. Then you would run a group by based off of your ID and count distinct the different dates . After that you can then join this outfit to your original Dataset based on the id field to get your sum.


    alternatively you could go with the MySQL dataflow to do the conditional join with the same calendar Dataset and your actual data set and then also doing a group by on your ID and then doing a count distinct on the dates. It is a little simpler and easier to understand but magic ETL tends to run much faster than MySQL

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • @GrantSmith Hi, thanks for the in depth answer. Could you elaborate on what you mean by a Calendar data set? Also maybe could you provide a pseudo ETL to help me visualize the dataflow?

  • @GrantSmith Ok, this has been great help and I understand the logic. I did all the steps but I'm hung up on the end, how do I find the sum?

  • A calendar dataset is one that lists all of the dates as sort of a dimension table.

    The sum is the count of dates you're getting from the group by when grouping on your ID field.

    You just need to join your dataset with the group by tile output based on the ID field to tie the sum of days / count back to your original dataset.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • @GrantSmith I got it now. Thanks for the concise explanation!