Best strategy for appending data from multiple sites to one DataSet?

DataSquirrel
DataSquirrel Contributor

Background
I posted a question earlier that had way too many parts, so I'm breaking out individual questions. Hopefully, these are easier to answer now and easier for other people to discover later. I can figure much of this out by testing myself, but am hoping others already know and can answer without much effort.

 

Question

Currently, we're storing a lot of custom data in a central server. To get it into Domo, we summarize it, generate spreadsheets, and feed these into Domo via Workbench. It works really well, partly because we've been able to do full DataSet replacements rather than "appends." As we're accumulating more data, the process is slowing down. Also, as we build out, we'll have multiple servers, not a single consolidated servers. This information is the background for what we're looking at doing next.

 

We've got our DatSets, page filters and PDP rules set up to allow us to reuse pages and cards to show very different slices of data. Our DataSets use consistent column names which makes it easy to write straightforward PDP rules that make Page Filters super nice. Rather than rework cards or pages, we've got consolidated DataSets + filters for custom views for each site/department/user. What we're hoping to do is continue along these lines, but also have multiple sites appending data to a single DataSet.

 

Is this possible, or does each site need a distinct DataSet that we then combine on the back-end? If we do need to combine DataSets, should I look at a DataFlow or a DataFusion? The columns will be identical for all DataSets. (It's been a few months since I tried out DataFlows and I've never tried DataFusion.)

 

Hopefully, we wont' need to clutter up our system with lots of little DataSets + DataFlow/Fusion and can append directly the master DataSet. I'm sure others have faced exactly the situation we're in and would be grateful for advice. I've had a look at several options for how to handle multiple sites pushing into Domo:

 

  • Workbench + append.
  • Centralizing everything in Postgres and pulling from there in Domo directly, without Workbench.
  • ODBC calls from Workbench
  • Using the DataSet API to perform custom appends.
  • Using the Stream API to perform custom appends.
  • ? What have I missed?

Of these options, the DataSet API seems like the most straightforward. I've used a bunch of the "read" APIs in the past to pull down DataSet, etc. details for auditing and review. The APIs were fine to work with. I just tried a little "hello world" code and was able to create a DataSet easily enough. Below are my initial observations on each option I've found, using a nightly update file on each site as a sample scenario.

 

Workbench + append
Problematic. Workbench looks for a specific, exact file path and does not manage a watch folder. Therefore, ff Workbench fails to process a nightly job, it's lost. (It gets overwritten by our code the next night.) I can't see a sensible way to automate detecting when our pushes have gaps. Also, the forums have many mentions of bugs/problems with APPEND from Workbench. This may no longer be active bugs, I don't know.

 

Centralizing everything in Postgres and pulling from there in Domo directly, without Workbench.
Love Postgres, don't really want to do this. Also, the Domo PostgreSQL Connector is....not super efficient. I've watched the raw SQL coming through and it generated ~25,000 log lines to pull 10 simple rows. So, I'd rather not lean too hard on the Domo PostgresSQL connector. (It looks to be using some kind of Java framework that uses cursors and excessively iterative code.)

 

ODBC calls from Workbench
No. Just no. I'm totally alergic to ODBC.

 

Using the DataSet API to perform custom appends
Looks good!

 

Using the Stream API to perform custom appends
Looks good...but more complicated. We can do this way if necessary.

 

Thanks very much for any help.

Comments

  • DataSquirrel
    DataSquirrel Contributor

    ODBC even makes me spell wrong. Like I said, I'm allergic to it. Smiley Wink

  • rado98
    rado98 Contributor

    I'll tell you how I do it, in the case it applies or would work for you.

     

    I hate the idea of appending as the thought of having to worry about data not bieng pushed through and spending hours trying to wokr out what happends plus fixing the gap is just to much. 

     

    What I do is to create a historical and a current part of the DataSet. The historical would push data from up to say 31 Dec 2017 and the Current from 1 Jan 2018. The historical runs once a day, the Current once every hours. Both replacing the data in DOMO and then I append them using DataFlow (which is just to easy).

    Every couple of months I change the dates.

     

    This way there is no chance of having a gap and the job that needs to run often is processed quite fast.

     

    Appending dataset using DataFlow once in DOMO is dead easy and that is probalby the way I would do it if I had multiple sites

     

     

    I'm using WB for this and ODBC

  • I am still very new to the tool but this is how I envisage implementing this:

     

    Step 1 - On Domo Create the different Datasets for each site as 'Replace':

    - STAGE_SITE_ONE

    - STAGE_SITE_TWO

    - STAGE_SITE_THREE

    etc.

     

    Step 2 - Create an SQL Mapping with the three sites as inputs. In the Output do an SQL Union of the three datasets like this:
    SELECT COL_A, COL_B,COL_C FROM STAGE_SITE_A

    UNION SELECT COL_A, COL_B,COL_C FROM STAGE_SITE_B

    UNION SELECT COL_A, COL_B,COL_C FROM STAGE_SITE_C;

     

    Give the output mapping a new name like 'ALL_SITES_DATASET'

     

    Step 3 - Save and Run the SQL Mapping.

     

    Step 4 - You want to make sure that next time you run the dataset, you will not replace the old data.

    So now edit the SQL Mapping again, and since you ran the mapping you should now also be able to add 'ALL_SITES_DATASET' as an input dataset.

     

    Step 5 - Now once again add 'ALL_SITES_DATASET' to the union in the output dataset so it reads something like this:

     

    SELECT COL_A, COL_B,COL_C FROM STAGE_SITE_A

    UNION SELECT COL_A, COL_B,COL_C FROM STAGE_SITE_B

    UNION SELECT COL_A, COL_B,COL_C FROM STAGE_SITE_C

    UNION ALL_SITES_DATASET;

     

     

    I am still testing this method out and I am not sure how it will perform with very large datasets.

    Please test it out and let me know if you find it good.

     

  • Jon
    Jon Member

    We have a scenario similar to this. I've chosen, for a variety of reasons, to let Domo do the work.

     

    My data (fortunately) has a "week number" embedded. So when I do the updates to historical data, I look for week numbers in my new dataset, and eliminate those from my historical, so that I never get a duplicate week in my history.

     

    [Our data is attendance data, and gets updated after-the-fact quite often, so we have the merge "what we thought was correct last week" with "now it's correct this week".]

     

    This keeps me from having to change my date ranges, I simply get the most recet 4 weeks of data every time, and it runs like a champ.

     

    This only works if you can, with 100% certainty, identify duplicate data.

This discussion has been closed.