Re: dropDuplicates and watermark in structured streaming

2020-02-28 Thread Tathagata Das
why do you have two watermarks? once you apply the watermark to a column (i.e., "time"), it can be used in all later operations as long as the column is preserved. So the above code should be equivalent to df.withWarmark("time","window size").dropDulplicates("id").groupBy(window("time","window siz

Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
Such as : df.withWarmark("time","window size").dropDulplicates("id").withWatermark("time","real watermark").groupBy(window("time","window size","window size")).agg(count("id")) can It make count(distinct id) success? lec ssmi 于2020年2月28日周五 下午1:11写道: > Such as : > df.

Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
Such as : df.withWarmark("time","window size").dropDulplicates("id").withWatermark("time","real watermark").groupBy(window("time","window size","window size")).agg(count("id")) can It make count(distinct count) success? Tathagata Das 于2020年2月28日周五 上午10:25写道: > 1. Yes. All times

Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread Tathagata Das
1. Yes. All times in event time, not processing time. So you may get 10AM event time data at 11AM processing time, but it will still be compared again all data within 9-10AM event times. 2. Show us your code. On Thu, Feb 27, 2020 at 2:30 AM lec ssmi wrote: > Hi: > I'm new to structured stre

dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
Hi: I'm new to structured streaming. Because the built-in API cannot perform the Count Distinct operation of Window, I want to use dropDuplicates first, and then perform the window count. But in the process of using, there are two problems: 1. Because it is streaming computing, in