Hi Dmitry, the third version is the way to go, IMO. You might want to have a larger number of partitions if you are planning to later increase the parallelism of the job. Also note, that it is not guaranteed that 4 keys are uniformly distributed to 4 tasks. It might happen that one task ends up with two keys and another with none.
If you want more control you can use partitionCustom (which does not produce a KeyedStream) and a stateful Map or FlatMap function to do the aggregation yourself. Best, Fabian 2017-01-23 11:12 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>: > Hi Dmitry, > > In all cases, the result of the countWindow will be also grouped by key > because of > the keyBy() that you are using. > > If you want to have a non-keyed stream and then split it in count windows, > remove > the keyBy() and instead of countWindow(), use countWindowAll(). This will > have > *parallelism 1 *but then you can repartition your stream so that the > downstream > operators have higher parallelism. > > Hope this helps, > Kostas > > On Jan 23, 2017, at 11:05 AM, Dmitry Golubets <dgolub...@gmail.com> wrote: > > Hi, > > I'm looking for the right way to do the following scheme: > > 1. Read data > 2. Split it into partitions for parallel processing > 3. In every partition group data in N elements batches > 4. Process these batches > > My first attempt was: > *dataStream.keyBy(_.key).countWindow(..)* > But countWindow groups by a key. I however want to group all elements in > partition. > > Then I tried: > *dataStream.keyBy(_.key).countWindowAll(..)* > But apparently countWindowAll doesn't work on partitioned data. > > So, my last version is: > > *dataStream.keyBy(_.key.hashCode % 4).countWindow(..)* > But is looks kida hacky with hardcoded partitions number. > > So, what's the right way of doing it? > > Best regards, > Dmitry > > >