Hi Dmitry, In all cases, the result of the countWindow will be also grouped by key because of the keyBy() that you are using.
If you want to have a non-keyed stream and then split it in count windows, remove the keyBy() and instead of countWindow(), use countWindowAll(). This will have parallelism 1 but then you can repartition your stream so that the downstream operators have higher parallelism. Hope this helps, Kostas > On Jan 23, 2017, at 11:05 AM, Dmitry Golubets <dgolub...@gmail.com> wrote: > > Hi, > > I'm looking for the right way to do the following scheme: > > 1. Read data > 2. Split it into partitions for parallel processing > 3. In every partition group data in N elements batches > 4. Process these batches > > My first attempt was: dataStream.keyBy(_.key).countWindow(..) > But countWindow groups by a key. I however want to group all elements in > partition. > > Then I tried: dataStream.keyBy(_.key).countWindowAll(..) > But apparently countWindowAll doesn't work on partitioned data. > > So, my last version is: > dataStream.keyBy(_.key.hashCode % 4).countWindow(..) > But is looks kida hacky with hardcoded partitions number. > > So, what's the right way of doing it? > > Best regards, > Dmitry