subject:"Efficiently updating running sums only on new data"

Re: Efficiently updating running sums only on new data

2022-10-13 Thread Igor Calabria

You can tag the last entry with each key using the same window you're using for your rolling sum. Something like this: "LEAD(1) OVER your_window IS NULL as last_record". Then, you just UNION ALL the last entry of each key(which you tagged) with the new data and run the same query over the windowed

Re: Efficiently updating running sums only on new data

2022-10-12 Thread Artemis User

Do you have to use SQL/window function for this? If I understand this correctly, you could just keep track of the last record of each "thing", then calculate the new sum by adding the current value of "thing" to the sum of last record when a new record is generated. Looks like your problem will

Efficiently updating running sums only on new data

2022-10-11 Thread Greg Kopff

I'm new to Spark and would like to seek some advice on how to approach a problem. I have a large dataset that has dated observations. There are also columns that are running sums of some of other columns. date | thing | foo | bar | foo_sum | bar_sum | +===+===