making sum of daily averages

So I  have a data set which has the a date column and a column that represents a certain value for each date. I have several years of data where I have a single corresponding value but my latest data has multiple values per day - i.e. i have basically an entry for each 5 minutes per day. So when I graph this in a bar chart it shows a daily average for each day, however i can't figure out how I can create an output in beast mode that will show me a sum of all average daily values in any selected date segment. 

 

For example lets say on 1st of a month I have one value of 5$ and so its average is 5$ on the 2nd of the month I have values of $4 $6 $5 and its average is is $5 but if I sum all the values its not the sum of the averages (10) but sum of all the values in the column (20)... so any suggestions will be very much appreciated...

 

Thanks

Best Answer

  • jhl
    jhl Member
    Answer ✓

    Youre right, my mistake - you can ignore the rank and filtering (unless you have another column to set as the second ranking row). Instead, you can use ETL function "remove duplicates" as a last step before creating the output dataset to remove any duplicate dates you have - because the average will be the same in all rows, you can do this without worrying about what rows it removes.

     

    Regarding the preceding and following values: Set the maximum number of observations per day (or expected observations) you have as following and choose the "unbounded" checkbox for the preceding. You say you have an observation every five minutes, you have 60/5*24 = 288, you can also set this to something higher, like 500 to be on the safe side (in Redshift code, this can be set dynamically using code, but not in the ETL GUI).

     

    The partition expresses that it evaluates only the fields with the same date for the averaging. It stops if it has less values than the bounds you set in the paragraph above.

     

    HTH

     

    EDIT: as @Valiant mentioned here you could also just use "group by" in ETL to save you all the hassle with rank and window.  

Answers

  • Hi,

     

    I think what you need is a window function - Beast Mode does not support it, unfortunately. However, Magic ETL does. (There is a chance that it works in SQL too using

     

    Select `Date`, AVG(Value) from source

    GROUP BY `Date`

    but I havent tested this) 

     

    You will need to use your date as a partition - maybe you will need to add a column that is a date and not a datetime using ETL function "Set column type" and use this.

     

     

    Then use the Average function and a ranking function. Then filter the rows in a next step so that you only take, say, the lowest rank per day - this will give you a dataset with the date, the daily average (and a few columns you can either remove or ignore).

     

    Voila, you can graph your heart out ?

     

    HTH JHL

  • Thank you or the suggestion but can you elaborate step by step on how to do it? I can't figure out in the ETL how to implement what you propose to try it out...

     

    Thanks!

     

  • Hi, sure,I'll try.

     

    1) use "set column type" on your datetime column to make it a date (without a timestamp)

    2) use "rank and window" for two functions:

    -Use "rank" (ascending) using the "add function" - you will have to give your column a name.

    - use "average" on your Value column (whatever you are actually measuring) - you will have to set preceding and following cells to evaluate, you can make this as wide as possible (the partition will take care of it if there is less, but not if there is more, so pick the maximum number of observations you have in a day) - you will have to give your column a name.

    - order them on your date and set the date field as the partition as well.

    This adds two columns - one with the average per day and one with a ranking. This will be the "count" of observations per day, going up from 1 to however many you have. this column is just there to filter on.

    3) use "filter rows" on your Rank colum - "equals" "specific value" "1" (this means it removes everything but one row per day)

    4) use "select columns" to remove everything but the date and your new average column.

    5) make an output data set.

     

    and that should be it. You should now have a dataset with two columns and one value (the average) per day.

     

    HTH

  • Thank you so much for the explanation but I have couple of most likely stupid questions... here goes:

    so once I set the rank - what it does it assigns a number to each field and if there are rows that have the same date the number assigned is the same - I call this column X

    so where I get stuck is this: so I add a new rank and window function where use the average function on columns X and then I need to select frame and range and I have no idea what to put in there - how do I express that the average needs to be taken of all the rows that have the same value in column X? 

     

    Thanks

  • jhl
    jhl Member
    Answer ✓

    Youre right, my mistake - you can ignore the rank and filtering (unless you have another column to set as the second ranking row). Instead, you can use ETL function "remove duplicates" as a last step before creating the output dataset to remove any duplicate dates you have - because the average will be the same in all rows, you can do this without worrying about what rows it removes.

     

    Regarding the preceding and following values: Set the maximum number of observations per day (or expected observations) you have as following and choose the "unbounded" checkbox for the preceding. You say you have an observation every five minutes, you have 60/5*24 = 288, you can also set this to something higher, like 500 to be on the safe side (in Redshift code, this can be set dynamically using code, but not in the ETL GUI).

     

    The partition expresses that it evaluates only the fields with the same date for the averaging. It stops if it has less values than the bounds you set in the paragraph above.

     

    HTH

     

    EDIT: as @Valiant mentioned here you could also just use "group by" in ETL to save you all the hassle with rank and window.  

  • yep - I gues that solution answerd this question as well! thank you!