Hi folks, I have a stream coming from Kafka.
It has this schema: { "id": 4, "account_id": 1070998, "uid": "green.th", "last_activity_time": "2020-09-03 13:04:04.520129" } Another event arrives a few milliseconds/seconds later: { "id": 9, "account_id": 1070998, "uid": "green.th", "last_activity_time": "2020-09-03 13:04:05.123456" } Later, say 12 seconds later, she's seen again as an event. Now her event is: { "id": 119, "account_id": 1070998, "uid": "green.th", "last_activity_time": "2020-09-03 13:04:17.345678" } She's now been in the queue for 13s (and the total users in the queue are now 43). I need to keep track of two things: (1) Since many of these are coming in from many users, I want to know, over some time period, how many of them are in the queue at any point, (2) If Ms green.th was first seen at 13:04:04 and then at 13:04:05, she's been in the queue for 1 second (ignoring the ms). How does one go about computing these sorts of more complex things in Spark Streaming? Would one have to keep track of her first-seen-time in a column and then do a diff the next time she's seen? With append / update mode, how does one begin doing this sort of thing? Any help would be most gratefully appreciated. Hamish