Hi Dmitry,

In all cases, the result of the countWindow will be also grouped by key because 
of 
the keyBy() that you  are using.

If you want to have a non-keyed stream and then split it in count windows, 
remove 
the keyBy() and instead of countWindow(), use countWindowAll(). This will have
parallelism 1 but then you can repartition your stream so that the downstream 
operators have higher parallelism.

Hope this helps,
Kostas

> On Jan 23, 2017, at 11:05 AM, Dmitry Golubets <dgolub...@gmail.com> wrote:
> 
> Hi,
> 
> I'm looking for the right way to do the following scheme:
> 
> 1. Read data
> 2. Split it into partitions for parallel processing
> 3. In every partition group data in N elements batches
> 4. Process these batches
> 
> My first attempt was: dataStream.keyBy(_.key).countWindow(..)
> But countWindow groups by a key. I however want to group all elements in 
> partition.
> 
> Then I tried: dataStream.keyBy(_.key).countWindowAll(..)
> But apparently countWindowAll doesn't work on partitioned data.
> 
> So, my last version is: 
> dataStream.keyBy(_.key.hashCode % 4).countWindow(..)
> But is looks kida hacky with hardcoded partitions number.
> 
> So, what's the right way of doing it?
> 
> Best regards,
> Dmitry

Reply via email to