How to Speed Up Input in Redshift SQL Data Flow

Options

Hi,

I have the current dataflow setup, where Table_A and Table_B have the exact same columns:

  1. Input Table_A (1 hour 2 mins)
  2. Input Table_B  (1 hour 53 mins)
  3. Delete from Table_B Where Table_B.id = Table_A.id (3 secs)
  4. Output Table_B as SELECT * FROM Table_A UNION ALL SELECT * FROM Table_B (1 hour 36 minutes)

As you can see, this flow is 'recursive' in the sense that the output is also the input, and works as an incremental load where Table_A is the 'new' data, and Table_B contains the old data. Domo Currently takes a lot of time on Steps 1,2,4. 

 

There are a couple of issues with this:

  • Firstly, the SQL flow spends most of its time 'selecting' input datasets. The Datasets are already created within Domo, so why does it take so long (nearly 2 hours) to finish selecting them? I would understand if these datasets are being uploaded, but they are all on the same system. Is there a way to speed this up?
  • When outputting a dataset, we are essentially re-creating a dataset using a union all every time, rather than appending it. Is there is a way only to append the output and not replace the output?
  • Due to above it looks like the run time of a job will increase exponentially as the dataset increases, this makes it difficult to maintain.

Can someone provide suggestions to speed this up or a better way of doing it?

 

Thanks,

Jake