Kevin Zhang created SPARK-21944: ----------------------------------- Summary: Watermark on window column is wrong Key: SPARK-21944 URL: https://issues.apache.org/jira/browse/SPARK-21944 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Kevin Zhang
When I use a watermark with dropDuplicates in the following way, the watermark is calculated wrong {code:java} val counts = events.select(window($"time", "5 seconds"), $"time", $"id") .withWatermark("window", "10 seconds") .dropDuplicates("id", "window") .groupBy("window") .count {code} where events is a dataframe with a timestamp column "time" and long column "id". I registered a listener to print the event time stats in each batch, and the results is like the following {code:shell} ------------------------------------------- Batch: 0 ------------------------------------------- +---------------------------------------------+-----+ |window |count| +---------------------------------------------+-----+ |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3 | +---------------------------------------------+-----+ {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z} {watermark=1970-01-01T00:00:00.000Z} {watermark=1970-01-01T00:00:00.000Z} {watermark=1970-01-01T00:00:00.000Z} {watermark=1970-01-01T00:00:00.000Z} ------------------------------------------- Batch: 1 ------------------------------------------- +---------------------------------------------+-----+ |window |count| +---------------------------------------------+-----+ |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1 | |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3 | +---------------------------------------------+-----+ {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} {watermark=1970-01-01T19:05:09.476Z} ------------------------------------------- Batch: 2 ------------------------------------------- +---------------------------------------------+-----+ |window |count| +---------------------------------------------+-----+ |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1 | |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4 | +---------------------------------------------+-----+ {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} {watermark=1970-01-01T19:05:09.476Z} {code} As can be seen, the event time stats are wrong which are always in 1970-01-01, so the watermark is calculated wrong. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org