GitHub user gyfora opened a pull request: https://github.com/apache/flink/pull/395
StreamWindow abstraction + modular window computations This PR introduces a rework to the current windowing semantics, by major API and Runtime improvements. The core new abstraction is the StreamWindow which encapsulates the contents of a window in the data stream. Now the WindowedDataStream symbolises transformations on a DataStream of StreamWindows. This allows the users to properly collect, and work with the results of the window transformations. API changes: Previously the call `ds.window(...).groupBy(...).reduce(...)` would have returned a stream of reduced values by key in the given window. There was no way of properly obtaining all the (key, value) pairs from the window reduce calls just the flattened windows. This also made it impossible to apply more than one transformations on the same window without having to re-descritize it. The new api introduces several changes here: `ds.window(...).groupBy(...).reduceWindow(...)` returns a WindowedDataStream which than can be further transformed by applying transformations on the resulting window. The user can also call `.flatten()` or `.getDiscretizedStream()` to obtain the DataStream of the flattened values or the DataStream of StreamWindows. The reduce and reduceGroup functions have been renamed to reduceWindow and mapWindow respectively to emphasise the functionality. Runtime changes: The windowing runtime has been completely reworked to make the components modular for future improvements: DataStream -> StreamDiscretizer -> (PreAggregator) -> WindowPartitioner -> Transformation (reduce/map) -> WindowMerge This allows easier special-casing for specific window types for maximal performance, such as applying special preaggregators or parallel stream discretisation for global windows (time) as well. Further tasks after merge: -Add pre-aggregators for time windows -Add pre-aggregator for sliding count windows -Improve test coverage for all pre-implemented policies -Further optimise partitioning-reduce parallelism I will add one last commit which will update the javadocs and the streaming guide. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mbalassi/flink window Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/395.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #395 ---- commit f41a237ee05dd13caf6ddb805a917fc63d20c1a8 Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-04T16:14:51Z [streaming] StreamWindow abstraction added with typeinfo and tests commit 6b7243bda432e7cd7d0d3dff37871a67f7c3fbc0 Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-04T16:26:16Z [FLINK-1176] [streaming] Added invokables for modular windowing tranformations commit 3df8b613801c2dd7da12a7467acabd21cac14c8e Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-04T16:27:34Z [FLINK-1176] [streaming] WindowedDataStream rework for new windowing runtime commit df54364fe3bfa38a4eaafd5a7e774087a9996aaf Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-08T12:09:34Z [streaming] StreamDiscretizer rework to support only 1 eviction and trigger for robustness + test cleanup commit f14936ea62ebc999c765f536d79ba91f3709256b Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-08T12:10:00Z [streaming] Integration test added for windowed operations commit fca43c132f6988440d179f5922f3c150dc2dc9c9 Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-08T12:46:47Z [streaming] Streaming scala api fix for window changes + example cleanup commit 8109c85a987fcb312d2575962abcf4b6f971b122 Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-08T21:59:22Z [streaming] GroupedTimeDiscretizer added for lighter time policy thread management commit 8df13b8d9f117633ac4c7537b3c03d8d2f4bbf40 Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-10T11:21:09Z [streaming] WindowBuffer interface added for preaggregator logic + simple tumbling prereducer commit bae76b7b310d313d2296d91e910cbf5d0d13c344 Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-11T11:57:06Z [streaming] Reimplemented StreamWindow.split to avoid dependency issues commit 2a88e35cd59a9f8cdc2e460f6ec89c03e430308d Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-12T19:45:34Z [streaming] TumblingGroupedPreReducer added + new tests for windowing commit 0dba2c5cfc1bd9a8dd51a0ac630fe0a0ac3f9f5d Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-13T10:03:08Z [streaming] WindowMapFunction added + Streaming package structure cleanup commit 592a6ec90ade022ef4758ef5cc678d17b33a7afd Author: Gyula Fora <gyf...@apache.org> Date: 2015-02-13T18:57:55Z [FLINK-1539] [streaming] Remove calls to uninitalized runtimecontexts ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---