I have a dataset with financial data, and it has one row per month per project. However, it also pulls a Run Cost from an attribute, and this attribute is displayed with the exact same value on every row, because it's not captured as a monthly figure; it's just a total.
I need to SUM the Run Cost for matching criteria, but I need to ignore these duplicate rows and count it for each ProjectId only once.
I previously thought I could do this using MAX instead of SUM, because for each ProjectID the MAX would give me what I need. However, my code is giving me the MAX for the entire dataset column where the criteria match, and not the SUM/MAX for each Project ID.
Here's a mockup of a portion of my dataset:
BatchTimeStamp | ProjectId | First Year of Impact | Yearly Total Run Costs (CHF) |
---|
2023-11-22 | ID1 | 2024 | 700 |
2023-11-22 | ID1 | 2024 | 700 |
2023-11-22 | ID1 | 2024 | 700 |
2023-11-22 | ID2 | 2024 | 300 |
2023-11-22 | ID2 | 2024 | 300 |
2023-11-22 | ID2 | 2024 | 300 |
Long story short, the result I'm looking for here would be 1000 (700+300)
This is my not working BM at the moment
MAX
(CASE
WHEN DATE(`BatchTimestamp`) = DATE('2023-11-22')
AND (`First Year of Impact_p` = '2024' OR `First Year of Impact_p` IS NULL)
THEN `Yearly Total RUN Costs (CHF)_p`
ELSE 0
END)
This is part of an ETL, and I can't change the input dataset. I do need the monthly rows for another part of the ETL, so I can't remove the duplicate rows either.
Is there a solution someone could provide?
Thanks!