If the data does not have a key (or you do not care about it) you can also
use a FlatMapFunction (or ProcessFunction) with Operator State. Operator
State is not bound to a key but to a parallel operator instance. Have a
look at the ListCheckpointed interface and its JavaDocs.
2017-06-23 18:27 GMT+
So there is no way to do a countWindow(100) and preserve data locality?
My use case is this: augment a data stream with new fields from DynamoDB
lookup. DynamoDB allows batch get's of up to 100 records, so I am trying to
collect 100 records before making that call. I have no other reason to do a
r
No, you will lose data locality if you use keyBy(), which is the only way
to obtain a KeyedStream.
2017-06-23 17:52 GMT+02:00 Edward :
> Thanks, Fabian.
> In this case, I could just extend your idea by creating some deterministic
> multiplier of the subtask index:
>
> RichMapFunction> keyBy
Thanks, Fabian.
In this case, I could just extend your idea by creating some deterministic
multiplier of the subtask index:
RichMapFunction> keyByMap = new
RichMapFunction>() {
public Tuple2 map(String value) {
int indexOfCounter = Math.abs(value.hashCode()) % 4
Flink hashes the keys and computes the target partition using modulo. This
works well, if you have many keys but leads to skew if the number of keys
is close to the number of partitions.
If you use parittionCustom, you can explicitly define the target partition,
however, partitionCustom does not re
Hi Fabian -
I've tried this idea of creating a KeyedStream based on
getRuntimeContext().getIndexOfThisSubtask(). However, not all target
subtasks are receiving records.
All subtasks have a parallelism of 12, so I have 12 source subtasks and 12
target subtasks. I've confirmed that the call to getI
If I understood you correctly, you want to compute windows in parallel
without using a key.
Are you aware that the results of such a computation is not deterministic
and kind of arbitrary?
If that is still OK for you, you can use a mapper to assign the current
parallel index as a key field, i.e.,
Thx, now I use element.hashCode() % nPartitions and it works as expected.
But I'm afraid it's not a best practice for just turning a plain (already
paralellized) DataStream into a KeyedStream? Because it introduces some
overhead due to physical repartitioning by key, which is unnecessary since
I d
Hi Yukun,
the problem is that the KeySelector is internally invoked multiple times.
Hence it must be deterministic, i.e., it must extract the same key for the
same object if invoked multiple times.
The documentation is not discussing this aspect and should be extended.
Thanks for pointing out thi