Hi, I know that there is an improvement in Blink SQL that can deal with the top k problem like SQL showed below by maintaining an in-memory "heap" to store top k records. That is not a problem when user's score will only grow up.
> SELECT user_id, score > FROM ( > SELECT *, > ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num > FROM user_scores) > WHERE row_num <= 3 > > My question is how to deal with such top k problem when user's score will decrease as well. Suppose there is a SQL like this. > SELECT user_id, score > FROM ( > SELECT *, > ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num > FROM ( > SELECT user_id, LAST_VAL(score) AS score > FROM user_scores > GROUP BY user_id)) > WHERE row_num <= 3 > > `user_scores` is a dynamic table converted from a DataStream, `LAST_VAL` is a UDAF to get the latest value. So, the user's score will increase but also decrease from time to time. So, if we only maintain a heap to store top k elements, there will be a problem. For example, if there is already three users: A, B, C with score: 4, 3, 2 stored in a top-3 heap. If the next record is a D user with score 1, it will be dropped due to the score is less than A, B and C. However, if the next record comes after that with an updated score 0 for user A. In reality, we know that top-3 users will become B, C and D, but it is no chance to get user D back if using heap in this case. Using heap works fine if it is running on batch mode because the users' score won't change from time to time. In this case, I think it should fall back to store all users and their scores. Update top-k every time when receive a new record. If the heap optimization won't work here in streaming mode, is there any other optimization can apply in this case? It is not necessary to focus on SQL only. Any improvement on DataStream is also welcome. Thank you. Best Regards, Tony Wei