Performance Issue MySQL DataFlow

Hi there!

I have a bit of an performance issue.

 

I am basically trying to create a table for our sales department, which includes all sales countries (as a filter), sales towns (column 1) and sales (Total) from this year (column 2), last year sales (column 3) and a variance column (column 4). Furthermore I want to insert a date picker and a possibility for YTD figures.

 

Here is my existing code:

1) Select created_at_trans,shop_town_id,
Sum(Total) AS Total
From buchungsliste_sh GROUP BY created_at_trans,shop_town_id

-Generate Output Table: proc_group

 

2)

SELECT DISTINCT shop_town_id from buchungsliste_sh

-Generate Output Table: townsliste

 

3) 

Select a.Dates,shop_town_id
from (
select curdate() - INTERVAL (a.a + (10 * b.a) + (100 * c.a)+ (1000*d.a)) DAY as Dates,townsliste.shop_town_id
from (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as a
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as b
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as c
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as d
CROSS JOIN townsliste
) a
where a.Dates between '2010-01-01' and '2025-01-01'
order by dates

-Generate Output Table: proc1

 

4) 

SELECT dt.Dates as datum,dt.shop_town_id,
d.Total,
DATE_SUB(dt.Dates, interval WEEK(STR_TO_DATE(CONCAT('12-31-',YEAR(dt.Dates)-1),'%m-%d-%Y'),0) week) AS LY_Date
From proc1 dt LEFT JOIN proc_group d ON dt.dates = d.created_at_trans AND dt.shop_town_id = d.shop_town_id
order by dt.dates

-Generate Output Table: proc_join

 

5)

Select n.*,
n1.total AS Ly_Total
From proc_join n LEFT JOIN proc_group n1 on n1.created_at_trans = n.ly_date
AND n1.shop_town_id=n.shop_town_id

-Generate Output Table: proc2

 

6)

ALTER TABLE proc2
ADD Season text;

 

7)

UPDATE proc2
SET season = (CASE
WHEN datum BETWEEN '2005-05-01' and '2006-04-30'
THEN ("2005/2006")
WHEN datum BETWEEN '2006-05-01' and '2007-04-30'
THEN ("2006/2007")
WHEN datum BETWEEN '2007-05-01' and '2008-04-30'
THEN ("2007/2008")
WHEN datum BETWEEN '2008-05-01' and '2009-04-30'
THEN ("2008/2009")
WHEN datum BETWEEN '2009-05-01' and '2010-04-30'
THEN ("2009/2010")
WHEN datum BETWEEN '2010-05-01' and '2011-04-30'
THEN ("2010/2011")
WHEN datum BETWEEN '2011-05-01' and '2012-04-30'
THEN ("2011/2012")
WHEN datum BETWEEN '2012-05-01' and '2013-04-30'
THEN ("2012/2013")
WHEN datum BETWEEN '2013-05-01' and '2014-04-30'
THEN ("2013/2014")
WHEN datum BETWEEN '2014-05-01' and '2015-04-30'
THEN ("2014/2015")
WHEN datum BETWEEN '2015-05-01' and '2016-04-30'
THEN ("2015/2016")
WHEN datum BETWEEN '2016-05-01' and '2017-04-30'
THEN ("2016/2017")
WHEN datum BETWEEN '2017-05-01' and '2018-04-30'
THEN ("2017/2018")
WHEN datum BETWEEN '2018-05-01' and '2019-04-30'
THEN ("2018/2019")
WHEN datum BETWEEN '2019-05-01' and '2020-04-30'
THEN ("2019/2020")
WHEN datum BETWEEN '2020-05-01' and '2021-04-30'
THEN ("2020/2021")
WHEN datum BETWEEN '2021-05-01' and '2022-04-30'
THEN ("2021/2022")
WHEN datum BETWEEN '2022-05-01' and '2023-04-30'
THEN ("2022/2023")
WHEN datum BETWEEN '2023-05-01' and '2024-04-30'
THEN ("2023/2024")
WHEN datum BETWEEN '2024-05-01' and '2025-04-30'
THEN ("2024/2025")
WHEN datum BETWEEN '2025-05-01' and '2026-04-30'
THEN ("2025/2026")
END
)

 

As you can see it is a very large and complex statement. Unfortunately, it is necessary to include all dates per town, because otherwise I have incorrect last year figures.

 

e.g.

Town A had sales last year, but not on the same day this year. The figure won´t show up, if I change the code.

 

Step 4 took about 6 hours and step 5 about 10h.

 

Does someone could advise me how to have the same result, but in a more efficient way?

 

Many thanks for your help!

 

 

Best Answer

  • ST_-Superman-_
    Answer ✓

    You can improve the performance by adding an index prior to the joins.  You have to add a transform for each table you are indexing.

     

    3b)

    ALTER TABLE proc1 ADD INDEX(`dates`,`shop_town_id`);

     

    3c)

    ALTER TABLE proc_group ADD INDEX(`created_at_trans`,`shop_town_id`);

     

    then run your join statement (#4)

     

    4b)

    ALTER TABLE proc_join ADD INDEX(`created_at_trans`,`shop_town_id`);

     

    4c)

    ALTER TABLE proc_group ADD INDEX(`ly_date`,`shop_town_id`);

     

    then run your join statement (#5)

     

     

    Let me know if that speeds up the process.  My next thought would be to create a "join" field in those data tables by concating the two fields you are joining on CONCAT(`dates`,`shop_town_id`) as `join`

     

    You would then want to index that new field prior to the join.  

     

    Good luck to you, let me know how it turns out.


    “There is a superhero in all of us, we just need the courage to put on the cape.” -Superman

Answers

  • ST_-Superman-_
    Answer ✓

    You can improve the performance by adding an index prior to the joins.  You have to add a transform for each table you are indexing.

     

    3b)

    ALTER TABLE proc1 ADD INDEX(`dates`,`shop_town_id`);

     

    3c)

    ALTER TABLE proc_group ADD INDEX(`created_at_trans`,`shop_town_id`);

     

    then run your join statement (#4)

     

    4b)

    ALTER TABLE proc_join ADD INDEX(`created_at_trans`,`shop_town_id`);

     

    4c)

    ALTER TABLE proc_group ADD INDEX(`ly_date`,`shop_town_id`);

     

    then run your join statement (#5)

     

     

    Let me know if that speeds up the process.  My next thought would be to create a "join" field in those data tables by concating the two fields you are joining on CONCAT(`dates`,`shop_town_id`) as `join`

     

    You would then want to index that new field prior to the join.  

     

    Good luck to you, let me know how it turns out.


    “There is a superhero in all of us, we just need the courage to put on the cape.” -Superman
  • @ST_-Superman-_ gives very solid advice.

    I offer one more suggestion to possibly speed things up for you.  I have a similar transform process in one of my dataflows where I combine a list of all days with a list of all customers to get an array of all possible customer-day combinations.  Instead of doing date math using your self-generated list of n days ago, I used Domo's Domo Dimensions connector (https://yourinstance.domo.com/appstore/connectors/com.domo.connector.domodimension) to bring to our data center a calendar file that has all days in it.  Then I just join that calendar against a list of all customers to generate the customer-day combinations:

     

    SELECT

      a.dt

      ,b.customer_id

    FROM

      (select distinct customer_id from customers) a

      ,domo_calendar b

    --notice no join condition since you want a cartesian product, but you could still use CROSS JOIN if you wanted

     

    I think it's really user-friendly this way.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Thanks Scott, that helped out a lot!

     

    @AS Thank you as well for your input. I didn´t know about Domo Dimension, very helpful feature!