Strategies for monitoring or alerting about failed updates

For production data, we've got two paths for our data to get into Domo:

 

1) Scrips build Excel workbooks that are imported via Workbench.

 

2) Centralized data is stored in Postgres, exposed through a view, and then pulled in via the Postgres SSH connector.

 

The question is, how do people go about detecting late jobs? This came up the other day with our Workbench jobs. We noticed that several DataSets had not been updated for some time. It turned out that Workbench was no longer set to try these jobs.The schedule was set to never. No clue why. I rescheduled everything and now its flowing. Workbench can notify you if a job fails, but it doesn't have any reason to notify you if a job isn't ever set to run.

 

So what I'm after is a server-side monitoring strategy - something that would detect that a Workbench job hadn't showed up since too long. 

 

On the Postgres side, I think I've got options in the Advanced Scheduling section.

 

It would be nice to have a single strategy for monitoring that our flows are all working as expected. This doesn't have to be a Domo feature, it could just be something that pulls data from Domo to feed into something else. For example, the old Domo DataSet API returns 'last updated' information. I could potentially script something to pull the DataSet definitions on a regular schedule and combine that with the thresholds to build a custom alert system. That seems a bit over the top for what must be a quite common requirement. Can anyone point me in the right direction for a way to detect jobs that are late that does not rely on the settings section of the connectors themselves?

 

Thanks.

Best Answer

  • AS
    AS Coach
    Answer ✓

    I've had the same question and I've used and heard of a few ways to watch out.

    First, if your users are like mine, they'll definitely tell you before you figure anything out.  They're like the autodetect. The bad thing is those are the people you don't want to find out if something has gone wrong.

    So I've built cards whose sole purpose is the alert me if there are delays.  I need one card for each dataset I'm concerned about.  For example, we have a dataset that runs about every 4 hours and I can build a card that has a beast mode that calculates the time between now and the last update.  If that goes over 4 hours a counter increments and I set an alert on that counter for if it's ever above zero.  In small doses, this data health dashboard of cards works well.

    When you run hundreds or thousands of datasets this solution isn't as tenable.  I haven't found a solution for that yet, but I imagine using the DomoStats apps could provide some help.
    Once upon a time I posted an idea here or sent a request to Support to provide us an automated alerting system that utilizes dataset update patterns and calls out any outliers.  I think that would be great and the technology is probably already available.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"

Answers

  • AS
    AS Coach
    Answer ✓

    I've had the same question and I've used and heard of a few ways to watch out.

    First, if your users are like mine, they'll definitely tell you before you figure anything out.  They're like the autodetect. The bad thing is those are the people you don't want to find out if something has gone wrong.

    So I've built cards whose sole purpose is the alert me if there are delays.  I need one card for each dataset I'm concerned about.  For example, we have a dataset that runs about every 4 hours and I can build a card that has a beast mode that calculates the time between now and the last update.  If that goes over 4 hours a counter increments and I set an alert on that counter for if it's ever above zero.  In small doses, this data health dashboard of cards works well.

    When you run hundreds or thousands of datasets this solution isn't as tenable.  I haven't found a solution for that yet, but I imagine using the DomoStats apps could provide some help.
    Once upon a time I posted an idea here or sent a request to Support to provide us an automated alerting system that utilizes dataset update patterns and calls out any outliers.  I think that would be great and the technology is probably already available.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Thanks for the detailed answer!

     

    I like your idea of a status card, simple and workable. We don't have so many DataSets now that it couldn't work, but I see your point about scale. 

     

    You feature requests sound good, I'd vote for them if I found them!

    I may end up writing something with the API as I've already got the code to do the pulls. Kind of a bit elaborate, but it should work.

     

    * Use the API to pull DataSet summaries, which include a last update timestamp. They're keyed off an internal ID, so it's okay if the name changes.

    * Stash each DataSet ID, name, last update, column count, row count, etc. in a table.

    * Add a field to the table for "too late threshold."

    * Periodically run a script to find too late jobs.


    In my case it would be a server-side script, pushing to Postgres, and then pulling again into Domo. It would be nice if there were a place in the DataSet settings to specify "Send an alert if this DataSet hasn't updated since 8 hours ago", or whatever you need. I didn't see anything of that sort for DataSets overall.

  • You're welcome and that method sounds great.

    I'd set a single card with the summary number as a beast mode which calculates the number of datasets running late, and the main content of the card being a list of all datasets and their status.

    You'd get an alert when the count of datasets out of date is too high and then you'd check the card to see which are out of date.

    A single card to monitor all statuses is definitely better than dozens or hundreds.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Now that is a great idea! I wasn't thinking about the presentation, and would not have come up with something some simple and complete. Now I want to go and write the code to make it happen!

     

    Thanks again!

  • If you're willing to share, I bet a number of us here would be interested in what that solution looks like!

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • An option which would enable you to monitor all your datasets in one visualization would be to use the datasets in the new Governance connector. 

    If the frequency doesn't lend itself to a simple calculation, you could set up a webform with the datasets you'd like to monitor, and a value for the frequency which would properly compare with your time calculation.  

    You could even have the output color code in green, yellow, and red to show on time, small delay, and major delay.  

    If you run into any issues, please feel free to post and tag me.  

    DataMaven
    Breaking Down Silos - Building Bridges
    **Say "Thanks" by clicking a reaction in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • I was not aware of of the Governance or DomoStats connectors, they look very interesting. I'll check these out next. I see that the Governance module was only released in December of 2018, so it's probably new for most people here. If anyone else checks it out and has comments, please post back!

     

     

  • I need to get my head back into a lot of Postgres+Domo work, so I figured I'd dive in here...it's a no-risk problem to solve. I'm excited about checking out the Governance connector, as that's likely what we'll want.

     

    But I figured I'd give this a shot and I came up with a visualization that I don't love. I figured I'd just use a bar chart showing late DataSets, and how late they are. I ended up with a scatterplot. mockup.png

    On the full chart, I've got quick filters to hide/show things, which makes all of the not late jobs on the left easy to hide. That's fine, but I'm unclear why my simple bar chart looked terrible. Anyway, here are some nerdy notes for anyone that has a good reason to roll their own. We, for example, might want that as we send "red alerts" to a Slack channel and, the last I checked, Slack integration wasn't available for Domo.

     

    Okay, the first step is to start from your scripting tool of choice. I'm using a minority language, so there's no point to posting the code here. The Domo API is pretty straightforward, for anyone used to dealing with things like that. You get an auth token, pass it in, and grab results. The relevant endpoint is documented here:

    https://developer.domo.com/docs/dataset-api-reference/dataset#List%20DataSets

    I reworked some code I wrote months ago and didn't look too closly. If I remember correctly, you don't get a count up front and have to keep fetching until the well runs dry. In any case, it's not that hard to do. What you get back is an array of JSON objects with a bit of information about each DataSet, such as this example from the docs:

    {
      "id": "08a061e2-12a2-4646-b4bc-20beddb403e3",
      "name": "Questions regarding Euclid's Fundamental Theorem of Arithmetic",
      "rows": 1,
      "columns": 6,
      "createdAt": "2015-12-10T07:06:14Z",
      "updatedAt": "2016-02-29T20:56:20.567Z"
    }

     

    That's very close to everything you need. All that's required past this is a threshold for how long you should wait before declaring a DataSet late/overdue/stale. 

     

    I pushed everything up into Postgres in a simple table using the columns from Domo, plus an extra number for the threshold value.

    BEGIN;

    DROP TABLE IF EXISTS "admin"."domo_dataset" CASCADE;

    CREATE TABLE IF NOT EXISTS "admin"."domo_dataset" (
    "id" uuid NOT NULL DEFAULT NULL,
    "name_" citext NOT NULL DEFAULT NULL,
    "rows_count" integer NOT NULL DEFAULT 0,
    "columns_count" integer NOT NULL DEFAULT 0,
    "created_at_dts" timestamptz NOT NULL DEFAULT '-infinity',
    "updated_at_dts" timestamptz NOT NULL DEFAULT '-infinity',
    "late_threshold_minutes" integer NOT NULL DEFAULT 0,

    CONSTRAINT domo_dataset_id_pkey
    PRIMARY KEY ("id")
    );

    ALTER TABLE "admin"."domo_dataset" OWNER TO "user_change_structure";

    COMMIT;

     

    The exact details would depend on your database....I used Postgres style names. (All lower with underscores.) 

    From there, the next step is to add a view that does the calculations. This kind of thing is very idiom-specific, so you would need different functions in Oracle/MS SQL Server/MySQL, etc. Here's a Postgres 10 version...and I may have messed up a calculation...this is just a proof-of-concept exercise for now.

    -- View: admin.domo_dataset_plus
    DROP VIEW IF EXISTS admin.domo_dataset_plus;

    CREATE OR REPLACE VIEW admin.domo_dataset_plus AS
    /* Create a CTE so that we can used the calculation easily in a case structure in the main select. */
    with datasets as

    (select domo_dataset.id,
    domo_dataset.name_,
    domo_dataset.columns_count,
    domo_dataset.rows_count,
    domo_dataset.created_at_dts,
    domo_dataset.updated_at_dts,
    domo_dataset.late_threshold_minutes,
    domo_dataset.late_threshold_minutes > 0 as monitor_this,
    round(extract(epoch from(age(now(),domo_dataset.updated_at_dts)))::numeric,0) as minutes_since_update

    from domo_dataset)

    select id,
    name_,
    columns_count,
    rows_count,
    created_at_dts,
    updated_at_dts,
    monitor_this,
    late_threshold_minutes,
    minutes_since_update,
    minutes_since_update - late_threshold_minutes as minutes_late,

    case
    when late_threshold_minutes < 1 then 'unmonitored'
    when late_threshold_minutes < minutes_since_update then 'late'
    else 'okay'
    end as dataset_status,

    case
    when late_threshold_minutes < 1 then false
    when minutes_since_update - late_threshold_minutes > 0 then true
    else false
    end as is_late,

    case
    when late_threshold_minutes < 1 then 0
    when minutes_since_update - late_threshold_minutes > 0 then 1
    else 0
    end as is_late_number

    from datasets;

     

    Honestly, that kind of dense, crazy-long SQL is pretty much impossible to read. You kind of have to build it up piece by piece. The final result is a view that's easy to search. So, to find the DataSets that are being monitored, here's all you need:

     

    select * from domo_dataset_plus
    where monitor_this = true

     

    To find the late jobs, all that you need is this:

    select * from domo_dataset_plus
    where is_late = true

     

    In Domo, I just pull in the whole table, by way of the view, as a replace:

    select * from domo_dataset_plus

     That's it. Then the DataSet can be used in a card as you like. I added columns in the view instead of Beast Modes because, well, I don't have the Beast Mode sytnax memorized that well & putting it in the view makes the data useful from other locations. (Like if we needed to build out the Slack alert idea.) But you could absolutely use Beast Mode calculated columns against a raw table instead of using a view.

    Thoughts on how to more effectively visualize everything would be much appreciated!


  • P.S The summary count is the sum of is_late_number, one of the calculated columns. That really was a great idea! Very simple to use. I'm going to keep that trick in mind for other situations.

  • If the Data Governance connector has the information we need for metrics like this, is there a reason to bring Domo data out to another source with the API, other than to make the data available for other systems?

     

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • I doubt it - I stay within Domo as much as possible. There's very little I have found greatly improved by pushing outside Domo.
    DataMaven
    Breaking Down Silos - Building Bridges
    **Say "Thanks" by clicking a reaction in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • No, I can't see any good reason, past what you mentioned. Pulling data out of Domo, repackaging it, and then pulling it back into Domo is downright perverse. So even though I posted a sample of how to do just that, I'd not promote the idea. We may have other uses for the data but, right now, we don't. And if Domo adds (have they?) a way to push alerts to Slack, I don't think we would have any other reason to pull it out.

     

    But every now and then, you just want to build a Rube Golberg machine to expand you mind a little. Sort of like those whimsical sculptures that used to be in airports a lot with all kinds of balls and conveyer belts:

    https://en.wikipedia.org/wiki/George_Rhoads

    Incredibly fun to look at, memerizing. But you wouldn't want it in your house!

     

    In my case, I can't get into the Governance connector, I started a question on that elsewhere. (I seem to have everything I should.) And Ijust saw some of the cards that Domo builds on top of the DomoStats connector and, dang, those cards are sharp. Very nice. Excited to look at the Governance DataSets.

  • @DataSquirrel - This post is another reminder of why you are one of my favorites on the Dojo!  

    For the Governance Tables - You have to generate an API key.  You can find this under Admin -> Security.  If you want to webex, I'd be happy to show you how I recommend setting them up.  I just did it with someone else yesterday.  Perhaps we can record the webex and post it for others to use.  

    DataMaven
    Breaking Down Silos - Building Bridges
    **Say "Thanks" by clicking a reaction in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • DataSquirrel
    DataSquirrel Contributor

    Thanks! You're kind.

     

    finally got the connector working. I'd been copying-and-pasting all along, but this time it worked. So now I've got an account configured for access to the governance DataSets, and I scheduled nearly all of them to run once a night.

     

    What I _don't_ have is a clue as to what is in there. I've always got four pots on the boil*, so I've only found time to build one card off of one of the generated DataSets. Do you know if there are any detailed docs, videos, etc? 

     

    I just stumbled across a series of beautiful cards that Domo builds from the DomoStats reports. The same sort of thing would be very nice for the Goveranance data. 

     

    I'd be really curious to hear more about any ways you've found the various DataSets useful

     

    Thanks

    * Just watched your cautionary tale ?

  • AS
    AS Coach

    We just started building out a governance dashboard with the governance datasets.  So far, here's what we have:
    Vacant Pages By User: A count of pages by user that have no cards and no subpages. Built on the Governance Pages dataset.  We found out one of our users had built over two dozen pages with no content!

    Outdated Datasets: A count of datasets by dataset type and user, which haven't been refreshed in at least 30 days.  This helps uncover people who, perhaps, are uploading stale spreadsheets or creating static webforms. Built on the Governance Datasets dataset.

    Dataset Naming Issues: A count of datasets which, as evaluated with a case statement or regular expression, don't match our company's best practices for dataset naming conventions.  In order to cooperate well across teams, we need a common language about how we describe data. This helps point out those inconsistencies.  Built on the Governance Datasets dataset.

    Datasets with zero rows: As named, this card indicates datasets that have no data. Some of these may be valid, but most will show that either something is wrong or the process that got the data there is no longer valid. Built on the Governance Datasets dataset.

     

    We are just getting to this, but it's been helpful in the short time we've been connected.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • DataSquirrel
    DataSquirrel Contributor

    Thanks Aaron, those are all excellent examples, I'm going to build them all and see what turns up.

     

    Since I posted, I've buit a few more cards looking at things like PDP policy status, columns by data type, cards & pages, and a few others. Honestly, none of them are particularly useful yet, but pretty intresting. You examples sound immediately helpful.

     

     

  • DataSquirrel
    DataSquirrel Contributor

    I just tried out Aaron's four ideas, and they're great! We only have a limited set of users creating DataSets and cards right now, so a couple of the ideas aren't as useful for us yet. But outdated datasets and empty datasets are handy already. I adapted the ideas a bit for our context where I'm most interested in connector type, and less interested in user/owner.

     

    The quick filters make these card fun to use and accessible to peple that don't want to write queries, etc.

    datasets_by_rows_and_cards.pngdatasets_without_recent_updates.png