Hello Spark community -
I am running a Spark SQL query to calculate the difference in time between
consecutive events, using lag(event_time) over window -
SELECT device_id,
unix_time,
event_id,
unix_time - lag(unix_time)
OVER
(PARTITION BY device_id ORDER BY unix_time,event_id)
AS seconds_since_last_event
FROM ios_d_events;
This is giving me some strange results in the case where the first two events
for a particular device_id have the same timestamp.
I used to following query to take a look at what value was being returned by
lag():
SELECT device_id,
event_time,
unix_time,
event_id,
lag(event_time) OVER (PARTITION BY device_id ORDER BY
unix_time,event_id) AS lag_time
FROM ios_d_events;
I’m seeing that in these cases, I am getting something like 1970-01-03 …
instead of a null value, and the following lag times are all following the same
format.
I posted a section of this output in this SO question:
http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls
The errant results are labeled with device_id 99999999999999999999999.
Any idea why this is occurring?
- Ross