Can't get Group & Join to Work

PJG
PJG Member
edited August 2024 in Magic ETL

I have a singular dataset that includes Live dynamic data, and historical fixed data. For each ProjectId, I need to SUM the PlannedCost of the historical data with a specific timestamp, and SUM the PlannedCost of Live data, so that I can compare the two in Analyzer.

Below is an example of the Input dataset for a single ProjectId, and how I'd like the Output dataset configured. In the Input dataset, the FinancialCategoryName is irrelevant to this, but I included it to show why there are multiple rows for each ProjectId

Group By

Join

What it's currently doing is it's only giving me the total for some of the Live column, and for none of the Nov 2023 column. Any ideas where I've gone wrong?

Thanks!

Best Answers

  • rco
    rco Domo Employee
    Answer ✓

    The issue is the comparison with that string, 'Nov 22, 2023 12:00:00 AM'. That comparison will first convert the left hand side to a string according to the Timestamp Format setting or the default timestamp format, which would be '2023-11-22T00:00:00'. Then it will perform the comparison on the strings, and they won't be equal (unless your timestamp format setting does match that format you used in the string constant, but I assume it doesn't).

    This behavior is unique to Magic ETL. Most other SQL implementations will either produce an error when you attempt to compare a timestamp with a string, or convert the string to a timestamp for the comparison rather than the timestamp to a string.

    I think what you probably want to do is compare these as Dates, which would look like this:

    DATE(BatchTimestamp) = DATE('2023-11-22')

    Randall Oveson <randall.oveson@domo.com>

  • rco
    rco Domo Employee
    Answer ✓

    What's happening there is pretty weird, I'll try to explain it:

    When you write a Group By formula, Magic ETL allows you to reference columns both inside an aggregate function (like SUM()) and outside one. When you reference a column inside an aggregate, that reference is evaluated for every row to produce the input to that function. But when you reference a column outside an aggregate function, it is only evaluated once for every group, for the first value. What you have there is really this:

    CASE WHEN FIRST_VALUE(RecordType) = 'Live' THEN SUM(PlannedCost) END

    Because the first value of RecordType is actually Historical, this just evaluates to null.

    What you really want is this:

    SUM(CASE WHEN RecordType = 'Live' THEN PlannedCost END)

    Randall Oveson <randall.oveson@domo.com>

  • rco
    rco Domo Employee
    Answer ✓

    Sure, add an ELSE 0 to the case statement.

    Randall Oveson <randall.oveson@domo.com>

Answers

  • I think we would have to see what you have in the group by. The group by paths are going to aggregate the left and right sides. Assuming what I show below in yellow is the Nov 2023 path, you would be summing the planned cost and grouping by the other fields. On the green path, you would be summing the same way. Your tiles would determine the fields.

    ** Was this post helpful? Click Agree or Like below. **
    ** Did this solve your problem? Accept it as a solution! **

  • PJG
    PJG Member

    Oops, sorry, I thought I had that in there. Added to my original post.

  • From what I can see, your output is already coming from your Group By path (left side of the join). Although I wouldn't put the condition the way you have it. I would compare the date only and knock off the timestamp. Your bottom branch does not appear to be doing anything.

    ** Was this post helpful? Click Agree or Like below. **
    ** Did this solve your problem? Accept it as a solution! **

  • PJG
    PJG Member

    I believe I need the timestamp as there are other RecordType = Historic with different BatchTimestamps. In this case, I only care about the Nov 22, 2023 12:00:00 AM one.

    Not following what you mean by bottom branch not doing anything… it's needed to add back in some extra columns that I need that were removed with the Group…. or is there a better way to do this?

    Thanks!

  • "Not following what you mean by bottom branch not doing anything…"

    Sorry, I don't see any additional columns in your screenshots. Therefore, I couldn't see it doing anything beneficial.

    ** Was this post helpful? Click Agree or Like below. **
    ** Did this solve your problem? Accept it as a solution! **

  • PJG
    PJG Member

    Ah, understood. Yes, my output screenshot is just a simplified version of what I need. Adding in my extra columns is working as intended.

    The issue is just that my output is only giving me the total for some of the Live column, and for none of the Nov 2023 column. Not sure if it's an issue with my Beast Mode, or if it's the join or something else.

  • rco
    rco Domo Employee
    Answer ✓

    The issue is the comparison with that string, 'Nov 22, 2023 12:00:00 AM'. That comparison will first convert the left hand side to a string according to the Timestamp Format setting or the default timestamp format, which would be '2023-11-22T00:00:00'. Then it will perform the comparison on the strings, and they won't be equal (unless your timestamp format setting does match that format you used in the string constant, but I assume it doesn't).

    This behavior is unique to Magic ETL. Most other SQL implementations will either produce an error when you attempt to compare a timestamp with a string, or convert the string to a timestamp for the comparison rather than the timestamp to a string.

    I think what you probably want to do is compare these as Dates, which would look like this:

    DATE(BatchTimestamp) = DATE('2023-11-22')

    Randall Oveson <randall.oveson@domo.com>

  • PJG
    PJG Member

    Hi @rco, thank you! Progress!

    The Nov 2023 Planned Cost column is now populated for all projects with a Historic timestamp of 2023-11-22, which it wasn't before! Thank you!

    However, it is not populating the Live Planned Cost column if the Nov 2023 Planned Cost column is already populated. I.e. only one of the two columns are populated. In the input dataset, many projects do have both Historical Nov 2023 PlannedCosts and Live PlannedCosts.

  • rco
    rco Domo Employee

    I think you're saying that the case statement matching in the first formula is affecting the second formula somehow. I'm not sure how that could happen; they should be independent of each other.

    However, I do notice that you don't have a SUM() function wrapping your formulas, and it seems like you would want one. Could that be the cause of unexpected results?

    Randall Oveson <randall.oveson@domo.com>

  • PJG
    PJG Member

    I did have SUM() initially, but I read elsewhere that it wasn't needed, and Group would SUM it without that, so took it out. However, tried adding it back in, and same output unfortunately.

    To troubleshoot, I tried temporarily changing the Group to only include the Live code:

    I then did Run Preview and checked the dataset on the Group By. It does indeed show no values for the entire Live Planned Cost column.

    It should have values there. For example first row from above, ProjectId 2751 has a Planned Cost of 50,000.00 per Input dataset:

  • rco
    rco Domo Employee
    Answer ✓

    What's happening there is pretty weird, I'll try to explain it:

    When you write a Group By formula, Magic ETL allows you to reference columns both inside an aggregate function (like SUM()) and outside one. When you reference a column inside an aggregate, that reference is evaluated for every row to produce the input to that function. But when you reference a column outside an aggregate function, it is only evaluated once for every group, for the first value. What you have there is really this:

    CASE WHEN FIRST_VALUE(RecordType) = 'Live' THEN SUM(PlannedCost) END

    Because the first value of RecordType is actually Historical, this just evaluates to null.

    What you really want is this:

    SUM(CASE WHEN RecordType = 'Live' THEN PlannedCost END)

    Randall Oveson <randall.oveson@domo.com>

  • PJG
    PJG Member

    Randall, thank you so much for all your help on this. Understood, and perfectly explained. I realized I needed to wrap my other code like that too, and it is now working perfectly! Any way to force blanks to appear as 0 in the output dataset?

  • rco
    rco Domo Employee
    Answer ✓

    Sure, add an ELSE 0 to the case statement.

    Randall Oveson <randall.oveson@domo.com>

  • PJG
    PJG Member

    Beautiful, thank you so much again!

  • PJG
    PJG Member

    Hi Randall, please let me trouble you for one more question on this.

    I have another column, Yearly Total RUN Costs (CHF)_p. It's the same for every row, so I don't want to SUM it, but I do want it displayed in the output dataset, and I again need to distinguish between Live and Historic/ 2023-11-22.

    Because I'm not using a SUM to wrap it around, how do I avoid it looking only at the first row? I thought I could use MAX() since all values are the same per ProjectId, but within Add Formula, this does not seem to be possible:

  • rco
    rco Domo Employee

    To be clear, you just want the maximum (or only) value of "Yearly Total RUN Costs (CHF)_p" where BatchTimestamp is November 22nd 2023 to show up on every row of the dataset?

    You'll have to use a Group By instead of an Add Formula since it's an aggregate function. Depending on your Domo version, it may allow you to specify no grouping columns. If it forces you to specify a grouping column, you could add a constant zero and group by that. This will produce just one row with the desired maximum value. Then you join that row to your original dataset. Again, depending on your Domo version it may allow you to join without specifying any join keys, but if it requires join keys then simply add constant zeroes to both sides of the join and join on those.

    Randall Oveson <randall.oveson@domo.com>

  • PJG
    PJG Member
    edited August 2024

    How can I check my DOMO version? There's a few things I don't have access, as it's a vendor controlled layer over another system that we use, and I know some features are restricted (Buzz for instance).

    Yes, for "Yearly Total RUN Costs (CHF)_p", I just need to show the only value per project for Live and only value per project where BatchTimestamp is November 22nd 2023.

    Could I please trouble you to elaborate on how to set this up "If it forces you to specify a grouping column, you could add a constant zero and group by that"

  • rco
    rco Domo Employee

    I'm referring to the first configuration step of the Group By, which in my Domo version looks like this:

    Here you really want to put nothing, since you just want to do the MAX() across the entire dataset. But it may not let you put nothing, so we put a constant value 0 instead. i.e. add an Add Constants tile before the Group By tile that adds a constant named "const" with value 0 and reference that here in the Group By tile. Then, put your MAX() formula in for the second step of the Group By configuration, and finally do they join to the original source.

    With the join, there's a very similar problem. The join requires you to configure these join keys in step 2:

    But since you want to join your one Group By output row with every row on the other side of the join, you really want to put nothing at all, to do a "Cross Join". If it doesn't let you do that (it probably won't unless you're enrolled in a very recent beta), you have to add a constant 0 as above to both sides of the join, and join on that. You can drop the constant immediately from both sides in the same join tile.

    Randall Oveson <randall.oveson@domo.com>

  • PJG
    PJG Member

    Maybe I was making this too complicated, or explained it poorly, but I tried just adding this to the existing Group by before getting to your reply, and this looks to be working.

  • rco
    rco Domo Employee

    This will give you the maximum/only matching value within each ProjectId. The other solution was for the max/only across the whole dataset. If the value you want is within each unique ProjectId then this is your solution.

    Randall Oveson <randall.oveson@domo.com>

  • PJG
    PJG Member

    Hi yes, per ProjectId is what I'm looking for, and again, sorry if I explained that poorly! Each ProjectId could appear in the dataset dozens of times, but each ProjectId will have the same Yearly Total RUN Costs for Live and the same Yearly Total RUN Costs for Nov 2023, so I only needed to identify a value for each ProjectId and make a Nov 2023 column and a Live column

    Everything you've helped me with today is enough knowledge gained to build more ETL, more datasets, and many more cards. Thank you!

    And I appreciate you replying to my other topic also; I will try to tackle that tomorrow, but I will no doubt have more questions, as there are many things there I haven't used before; only started using ETL this month! :)