Such as : df.withWarmark("time","window size").dropDulplicates("id").withWatermark("time","real watermark").groupBy(window("time","window size","window size")).agg(count("id")).... can It make count(distinct id) success?
lec ssmi <shicheng31...@gmail.com> 于2020年2月28日周五 下午1:11写道: > Such as : > df.withWarmark("time","window > size").dropDulplicates("id").withWatermark("time","real > watermark").groupBy(window("time","window size","window > size")).agg(count("id")).... > can It make count(distinct count) success? > > Tathagata Das <tathagata.das1...@gmail.com> 于2020年2月28日周五 上午10:25写道: > >> 1. Yes. All times in event time, not processing time. So you may get 10AM >> event time data at 11AM processing time, but it will still be compared >> again all data within 9-10AM event times. >> >> 2. Show us your code. >> >> On Thu, Feb 27, 2020 at 2:30 AM lec ssmi <shicheng31...@gmail.com> wrote: >> >>> Hi: >>> I'm new to structured streaming. Because the built-in API cannot >>> perform the Count Distinct operation of Window, I want to use >>> dropDuplicates first, and then perform the window count. >>> But in the process of using, there are two problems: >>> 1. Because it is streaming computing, in the process of >>> deduplication, the state needs to be cleared in time, which requires the >>> cooperation of watermark. Assuming my event time field is consistently >>> increasing, and I set the watermark to 1 hour, does it >>> mean that the data at 10 o'clock will only be compared in these data from 9 >>> o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ? >>> 2. Because it is window deduplication, I set the watermark >>> before deduplication to the window size.But after deduplication, I need to >>> call withWatermark () again to set the watermark to the real >>> watermark. Will setting the watermark again take effect? >>> >>> Thanks a lot ! >>> >>