Creating a marker and deduping in ETL

Hey guys,

Hope you all are doing well. I have an issue with data transformation in the ETL. It is best described on the screenshot below. I have multiple entries for duplicate IDs (one row for each product) and want to mark all entries for a given ID that contain Product 'c' as 1. That's the formula I used below to visualize. My next step after creating that is deduping based on the ID so that as a final product I have 1 entry for each ID and a 'Marker' indicating whether initially there was a Product 'c' for that ID. Is that possible to recreate in the ETL?

Thanks!

Best Answers

  • ArborRose
    ArborRose Coach
    Answer βœ“

    using:

    CASE
    WHEN SUM(CASE WHEN `Product`='c' THEN 1 ELSE 0 END) OVER (PARTITION BY `ID`) > 0 THEN 1
    ELSE 0
    END

    ** Was this post helpful? Click Agree or Like below. **
    ** Did this solve your problem? Accept it as a solution! **

  • DavidChurchman
    Answer βœ“

    Yes, I would create a column with this case statement in a formula tile:

    CASE WHEN `Product`='c' then `Column with values you want to add` else 0 END

    Then do a group-by tile that sums that column by ID.

    Looking at @ArborRose's solution, it looks like you could do it all within the group-by tile using a partition formula for the aggregation, but those feel like dark-magic to me, so personally I think it would be worth breaking it into two simpler steps that I know I understand.

    Please πŸ’‘/πŸ’–/πŸ‘/😊 this post if you read it and found it helpful.

    Please accept the answer if it solved your problem.

Answers

  • Yes, you could use a Group-by tile, select ID as your group. If your data is structured like this, where ID 1 is all 1s for the Marker column, then you could use "Average" as your aggregating function.

    On the other hand, the way I'd approach this is I would create the marker column with a simple CASE WHEN Product = 'c' then 1 else 0 end. Then you could aggregate with a sum and get the same result.

    Please πŸ’‘/πŸ’–/πŸ‘/😊 this post if you read it and found it helpful.

    Please accept the answer if it solved your problem.

  • @DavidChurchman Thank you, exactly what I was looking for. I'm thinking if it is possible to take it a step further, my dataset also includes quantities for those products. Please see the screenshot, I added more products for order ID 4. My dataset can include multiple entries for the same product for the same ID. Looking at ID 4. Is there a way to sum up both entries for product C, so that it is combined together and shown as 3?

  • ArborRose
    ArborRose Coach
    Answer βœ“

    using:

    CASE
    WHEN SUM(CASE WHEN `Product`='c' THEN 1 ELSE 0 END) OVER (PARTITION BY `ID`) > 0 THEN 1
    ELSE 0
    END

    ** Was this post helpful? Click Agree or Like below. **
    ** Did this solve your problem? Accept it as a solution! **

  • DavidChurchman
    Answer βœ“

    Yes, I would create a column with this case statement in a formula tile:

    CASE WHEN `Product`='c' then `Column with values you want to add` else 0 END

    Then do a group-by tile that sums that column by ID.

    Looking at @ArborRose's solution, it looks like you could do it all within the group-by tile using a partition formula for the aggregation, but those feel like dark-magic to me, so personally I think it would be worth breaking it into two simpler steps that I know I understand.

    Please πŸ’‘/πŸ’–/πŸ‘/😊 this post if you read it and found it helpful.

    Please accept the answer if it solved your problem.