I've done some work on this with Nico Kruber. In our benchmarking, the performance loss (from not being able to use the namespace) was roughly a factor of two, so it is significant. We prototyped an API extension that addresses this particular concern but without exposing the namespace directly, which I believe there is some reluctance to do. I've been thinking of turning this into a FLIP, but it needs more work first.
Another direction that could be explored would be to use finer-grained timestamps. E.g., with nanosecond-precision timestamps the number of colliding events will be dramatically smaller. David On Wed, Feb 16, 2022 at 10:17 PM David Anderson <dander...@apache.org> wrote: > I'm afraid not. The DataStream window implementation uses internal APIs to > manipulate the state backend namespace, which isn't possible to do with the > public-facing API. And without this, you can't implement this as > efficiently. > > David > > On Wed, Feb 16, 2022 at 12:04 PM Ruibin Xing <xingro...@gmail.com> wrote: > >> Hi, >> >> I'm trying to implement customized state logic with KeyedProcessFunction. >> But I'm not quite sure how to implement the correct watermark behavior when >> late data is involved. >> >> According to the answer on stackoverflow: >> https://stackoverflow.com/questions/59468154/how-to-sort-an-out-of-order-event-time-stream-using-flink >> , there should be a state buffering all events until watermark passed the >> expected time and a event time trigger will fetch from the state and do the >> calculation. The buffer type should be Map<T, List<E>> where T is the >> timestamp and E is the event type. >> >> However, the interface provided by Flink currently is only a MapStae<K, >> V>. If the V is a List<E> and buffered all events, every time an event >> comes Flink will do ser/deser and could be very expensive when throughput >> is huge. >> >> I checked the built-in window implementation which implements the >> watermark buffering. It seems that WindowOperator consists of some >> InternalStates, of which signature is where window is namespace or key, if >> I understand correctly. But internal states are not available for Flink >> users. >> >> So my question is: is there an efficient way to simulate watermark >> buffering using process function for Flink users? >> >> Thanks. >> >