Beast Mode "Share Calculation on DataSet" vs. Calculating in Data Flow

Question

Hi all,

Are there performance advantages to using the Beast Mode "Share Calculation on DataSet" to create a metric for a DataSet compared to putting logic/formula in a DataFlow to have the metric added as a physical column on the DataSet?

Curious to learn others thoughts/approachs as to balancing when logic should be part of the underlying DataFlow or be handled a shared Beast Mode.

Thanks,

Samir

jaeW_at_Onyx · Accepted Answer

Are there performance advantages to using the Beast Mode "Share Calculation on DataSet" to create a metric for a DataSet compared to putting logic/formula in a DataFlow to have the metric added as a physical column on the DataSet?

Yes...beast modes are evaluated / calculated at runtime.  For small datasets, the impact will be trivial or non-noticeable;however as datasets get large (100+ mil rows) certain types of calculations will take longer to evaluate.

if possible / reasonable, materialize transforms.  this will make managing beast modes easier.  that said, from the usability perspective, if you can surface the transform to business users, it makes metadata management slightly easier (assuming they can read basic SQL).

also, keep in mind, calculated metrics like percents or ratios cannot be implemented at the dataset level b/c frequently they MUST be calculated at runtime in order to return the 'right' answer.

if you have specific questions let me know!

jaeW_at_Onyx · Accepted Answer

Sure @sdarba

I'll compare two majore use cases.

1) adding dimension attributes to a dataset

2) add metrics to a dataset.

-- Dimension Attributes --

Some clients will build complex beast modes for categorizing data.  Consider:

Case

when lower(`campaign name`) like '%disney%' then 'Disney'

when lower(`campaign name`) like '%universal%` then 'Universal'

...

else 'Campaign Not Matched'

END

UPSIDE

Beast modes like this are easy to manage / see because you just open the beast mode to understand why you're not getting the expected result.

DOWNSIDE

Imagine you have the same beast mode deployed to 15 datasets and you add a new campaign.  Now you need to update 15 beast modes.

From the technology side, imagine your dataset is multiple 100 millions of rows.  The more data you have the worse a transform with LIKE will function.

RECOMMENDATION

If it's reasonable materialize the transform, make it part of the dataset (use a lookup table).

--- Metrics --

consider the- example of profit margin percent. sum(amount) - sum(cost) /sum(cost)

if you calculate profit margin percent on each row of your data  .02, .03, .07 etc.  if you were to add up the profit margin percent row, eventuallly that column would exceed 100% and you can't have more than 100% profit margin.  It is inappropriate to 'materialize the metric in the dataset'.

THAT SAID

consider the example of profit.  (sales - cost).  You COULD materialize that calculation because you CAN sum profit and get a sensical result.

THAT SAID, this type of basic math is something that Adrenaline will be good at even into the 100s of millions of rows, so it makes more sense to show the math to the users (in a beast mode) where they can understand the metadata (i.e. how profit margin is calculated).

Hope that helps!