Clicks vs Sales Conversion Calculation - Modeling large datasets

New to Domo and have seeking best practice or options for modeling. I have 2 tables - Clicks and Sales. We look at clicks for a given day divided by sales for a given day to measure conversion. The clicks table is very large and approximately 558GB uncompressed, to update data we append. The sales table is smaller around 9GB, to update data we overwite. 

 

What's the best way to model this data? I'm assuming we can't use Data Fusion because of the volume of data. We also want to avoid creating a flatten view of Clicks+Sales prior to data ingestion, as Domo being a BI tool we expect it can handle the calculation.  

 

Open to ideas/feedback - Thanks!

Comments

  • RGranada
    RGranada Contributor

    Hi,

     

    You can't use data from 2 different datasets in one DOMO Card, you have to create one dataset with the data already related (Dataflow either ETL, MySQL or RedShift)

     

    With the kind of sizes, you are reporting I would always update either by append or upsert in order to get the minimum data refresh interval.

     

    In your case, I would definitely use Redshift because of the sizes involved. I have a lot of data flows with dataset even bigger then what you stated with no problems.

     

    Hope this helps to answer your questions. Don't hesitate to ask if you need any more insight on this.

    Ricardo Granada 

    MajorDomo@Lusiaves

    **If the post solves your problem, mark it by clicking on "Accept as Solution"
    **You can say "Thank you" by clicking the thumbs up in the post that helped you.
  • Thanks for the feedback. I understand the datasets have to be merged prior to exposing them via analyzer/card. I've been told varying information in regards to data volume and data fusion vs data flow. What's the best approach for modeling for large data volume? Also how are you utilizing Redshift - can you provide more specifics? 

  • RGranada
    RGranada Contributor

    I had the chance to try out various approaches to large quantity data modeling/uploading on DOMO when we started our project in 2015.

     

    Now usually I select between two data loading schemas depending on the data destination, which type of user will use that data, and the level of reusability I aim for that particular data.

     

    If the dataset contains let's say sales information that will be directly used by a large number of users and needs an "almost" real-time accuracy  I try to get most of the ETL done before DOMO ( it can be done by other tools, or even at WorkBench level) with the goal of getting into DOMO the widest possible dataset containing all the possible data as a row. A dataset that follows this type of rules tends to be able to act as a source of a very large number of Cards, sometimes coming close to fulfilling the needs of a whole department. I also came to the conclusion that end users do appreciate this kind of datasets as they allow them to do things close to data mining using no more the DOMO analyzer, either by building new cards or making changes to copies of cards made by our BI team  (with no ETL or SQL knowledge).

     

    In the other case if the data tends to come from a large set of heterogeneous systems, in narrow datasets, represents more "nuclear" areas of information, has a great number of dependencies with other datasets and will be exclusively processed by our BI team and presented to users in a ready-made fashion we try to upload it as narrow datasets, each of them with a hi reusability level, and then merge, join and even apply some business rules in DOMO through various levels of RedShift Dataflows, where each level can have datasets directly used in cards or furthermore processed by other data flows. When a decent level of maturation has been achieved in the resulting datasets their reusability levels tend to also get very hi. This is my personally preferred approach because as it grows and more business areas have data modeled in this fashion, the reusability levels tend to grow and time to production starts to decay in a noticeable way when addressing multiple departments and user needs, especially in niche types of KPIs.

     

    In my experience, I came to the conclusion that when handling large amounts of data, Redshift dataflows are way more performant than their MySql counterparts, and personally, I find redshift SQL language variant much more feature reach them MySql. 

     

    Being DOMO a Bussiness Intelligence Tool, a good data modeling and upload strategy can enable it to be used not only as a BI tool but in some cases as something close to a real-time or near real-time reporting system, wherein some cases we can attain data in our cards with 10 to 15 minutes delay from its inception on the various business systems (some of them are legacy systems that need their own ETL processes in order to get data in a format readable by DOMO).

     

    In summary, if processing large amounts of data, whatever the modeling approach avoid using other ETL options on DOMO, other then RedShift DataFlows:

     

    This was my own personal take on the subject, I'm certain that there are other valid views on this.

     

    I hope I was able to convey my message in the best way in order to answer your questions. Anyhow if you have some doubts feel free to contact me, I will try and help.

     

    Regards

     

    Ricardo Granada 

    MajorDomo@Lusiaves

    **If the post solves your problem, mark it by clicking on "Accept as Solution"
    **You can say "Thank you" by clicking the thumbs up in the post that helped you.