ozankabak commented on issue #11404: URL: https://github.com/apache/datafusion/issues/11404#issuecomment-2222100150
The consensus at the time when we proposed #4285 was to add general-purpose functionality upstream and keep stream processing focused features downstream. To that end, we added things like - Sort based optimizations - Equivalence/order tracking - Interval arithmetic - Various join functionality (and many other things I forget now) upstream. Per this consensus, checkpointing and watermarking (especially when it throws away data depending on processing time) did not get the same treatment as they are quite specific to stream processing. Having added many features to upstream DF for a long time now, and going through the experience of implementing specific functionality like checkpointing, watermarking and others, I think the consensus reached at the time of #4285 proved to be a quite reasonable one. > A pluggable state backend support would be ideal, this also may be useful for operators that spill to disk. This is an interesting idea. If there is sufficient interest in generalizing spill-to-disk code to go through a backend, then we should add this upstream. In that case, it would be a win-win: We would be helping general-purpose cases and also simplify downstream code for people like you and us. > Happy to take a stab at putting together a gentle introduction for future developers from our learnings. This would be very nice and will certainly be helpful to others who want to build streaming systems on top of DataFusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
