Hi folks,

I have a stream coming from Kafka.

It has this schema:
{
    "id": 4,
    "account_id": 1070998,
    "uid": "green.th",
    "last_activity_time": "2020-09-03 13:04:04.520129"
}

Another event arrives a few milliseconds/seconds later:

{
    "id": 9,
    "account_id": 1070998,
    "uid": "green.th",
    "last_activity_time": "2020-09-03 13:04:05.123456"
}

Later, say 12 seconds later, she's seen again as an event. Now her event is:

{
    "id": 119,
    "account_id": 1070998,
    "uid": "green.th",
    "last_activity_time": "2020-09-03 13:04:17.345678"
}

She's now been in the queue for 13s (and the total users in the queue are
now 43).
I need to keep track of two things:

(1) Since many of these are coming in from many users, I want to know, over
some time period, how many of them are in the queue at any point,

(2) If Ms green.th was first seen at 13:04:04 and then at 13:04:05, she's
been in the queue for 1 second (ignoring the ms).

How does one go about computing these sorts of more complex things in Spark
Streaming? Would one have to keep track of her first-seen-time in a column
and then do a diff the next time she's seen? With append / update mode, how
does one begin doing this sort of thing?

Any help would be most gratefully appreciated.
Hamish

Reply via email to