Average run time of a redshift data flow on large dataset

WizardOz
WizardOz Contributor

Hello, I wanted to get a general idea of your redshift run time.  Any input is appreciated. 

 

I have a data set with about million rows and 123 columns.  However, it took 3.5 hours to simply have a subset of 1 column and almost 10 hours to run my data flow with some left joints.  This dataset need to be refreshed everyday, so you can image how difficult it is for me to get anything done.    

 

I am sure this data set is not a "Big Data" at all, curious how long does it take you to run a similar size of data.

 

Also, please give me any advise on how to speed up.

 

Thank you.

 

Olivia   

Comments

  • Not having seen your joins, I would suspect that adding some indeces to your input datasets on the columns that are joined on will vastly improve the run time.

     

    What sort of ETL are you using? If it's the MySQL ETL, you can use the following syntax as a transform to add an index:

     

    "ALTER TABLE `table_name` ADD INDEX (`column_name`)"

     

    You should add an index on each input dataset that you join on which will improve the execution time.

     

    Try adding those and let me know if that doesn't reduce the run time.

  • WizardOz
    WizardOz Contributor

    thank you - dthierjung, 

    I will try what you suggested.   May I ask what is your average run time for what size of data?

    But here is an example of a data flow I did yesterday and it is still running after 17 hours - I missed my deadline for an analysis :(.

     

    this is what in this data flow: 

    3 input tables:

    table A: 54million row, 9 coulmns, 

    table B: 24 rows, 3 columns

    Table C: 26.4K rows, 32 columns,

     

    SELECT a.*, b.`Competitor`, e.`ActivityDate`, e.`Company`, e.`Title`, e.`Full Name`

    from `Table A` a
    left join `Table B` b
    on a.`company_name`=b.`DB compnay Name`
    left join `Table C` e
    on b.`Company Name` = e.`Company`

  • We don't have any inputs that exceed 3 million, so I don't think I'd be able to make a useful comparison for you, although I'm sure others can comment on their run-times. As an aside, I had a dataflow that executed in 3 hours but only had ~600k rows total. After adding indeces, this flow completed in less than 10 minutes, so those can really make a huge difference.

     

    But sql best-practices are applicable no matter the data sizes, and given the size of your inputs, using all the tools at your disposal would be best. There are a few more items you can consider:

     

    1) Joining on integers are typically faster than strings, so if you can join on `company_id` for instance, that would be more ideal.

    2) If you can filter your data in any way prior to joining, that will reduce your run-time as well.

    3) Also consider if you need ALL the columns from `Table_a` in the example you provided.

  • You reference a dataset of 123 columns and a million rows, but a table like this isn't mentioned in the query.  Those three tables are all different sizes.  Where does this first dataset come in to play?

     

    In my experience with Domo, 54 million rows is rather large for most companies and I'm not surprised a dataflow took a few hours to run on that.  I wouldn't expect one transform combining that data with a couple smaller tables to take 10 or 17 hours.

    @dthierjung gives really solid advice, though.  Strings, especially long ones, can destroy your run time. Since you're running RedShift, I don't think indexing the tables is going to be necessary. It's really only MySQL you need to do that with, and indexing absolutely helps there.

    Do try your dataflow with a fairly limiited subset, like a particular customer, and see how that improves run time.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • WizardOz
    WizardOz Contributor

    Hi, dthierjung, your suggestion works - it takes about 3.5 hours to get it finish.   But I think this is still too long.  Wondering how do you guys run ad hoc queries?  

     

    Thnak you

  • Domo is fantastic for prototyping however with the sheer amount of data that you're working with, ad hoc reports will be tough to do quickly. 

     

    My company has moved our productionized data flows into a MS SQL environment which does all of the heavy lifting on our server so processing times are typically much shorter.

     

    Have you tried the Fusion ETL? I believe that handles datasets in the millions of rows more efficiently. 

  • WizardOz
    WizardOz Contributor

    Hi, Aron,

     

    Thank you for your comments.  The data with 123 columns and 54 million rows is the original source data.  The example table A I gave above is only a subset of this original data - I have another query to select those columns. 

     

    I have a data flow based on the originally dataset and ran 10 hours in redshift.  I already asked Domo tech person to take a look and he said it looks good, nothing can be optimized in redshift.  Like yous aid, index only works in MySQL.   This data needs to be refreshed everyday and the size is growing everyday. 

     

    I am not sure I understand when you say run like a particular customer - for us, this is our particular day to day situation, if I only run a subset, how can I do my analysis if I need to compare with last years number or summarize overall analysis?  I really need to get the data flow run fast.   

     

    Thank you.

     

    Olivia

     

     

     

  • I suggested limiting to just one customer just to see how fast the dataflow would run, as a sort of check on the logic.

     

    Running 54 million rows with that many columns, I'm not surprised by a three or four hour runtime.  

    What is the nature of the data?  Does it go back many years?  Is that why there are so many rows? Or are there so many dimensions that it takes a lot of rows to use them all?

     

    One thing we've done in order to get descriptive data in faster times is to run two datasets simultaneously.  One with just a year or two of data that refreshes often, like every hour, and another with more historical data that refreshes about every four hours. That adds complexity to the process but gets the right data in the hands of the right people at the right time.

    You might try a similar strategy where you become more precise with the requirements, given the time constraints.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • WizardOz
    WizardOz Contributor

    Thanks for let me know about FUSION ETL.  I am tryng to get that one.

     

    Olivia

  • WizardOz
    WizardOz Contributor

    Hi, Aaron,

     

    Thank you for your great suggestion!.  I am going to see how to optimize my data flow like that

     

     

    Olivia

  • WHM
    WHM Contributor

    Redshift datasets can have a sort key added that will help them join faster... below is some of our code to speed up a dataflow with big datasets: (the BSEG customer table has 85million rows.) The biggest impoct on run times will be IO moving the data into the redshift environment. If you can limit the columns you use it will go faster. you can fusion attributes on later if they are not needed for whatever processing you are doing in hte dataflow..

     

     

    /* ===== Prepare Sorted Table =====
    IMPORTANT! Using BEGIN & END makes these steps run as a single transaction.
    Not using them will lead to occasions of the next transform running before the sorted table is ready.
    */

    BEGIN;

    DROP TABLE IF EXISTS raw_sap_bseg_customer_sorted;

    CREATE TABLE raw_sap_bseg_customer_sorted
    (
    {
    BUKRS VARCHAR(24),
    BELNR VARCHAR(24),
    GJAHR VARCHAR(24),
    BUZEI VARCHAR(24),
    LIFNR VARCHAR(24),
    XREF1 VARCHAR(24),
    XREF2 VARCHAR(24),
    KUNNR VARCHAR(24)
    }
    )
    SORTKEY(BUKRS,BELNR,GJAHR);

    INSERT INTO raw_sap_bseg_customer_sorted
    (SELECT * FROM raw_sap_bseg_customer);

    END;