> This will mean 2 shuffles, and 1 node might bottleneck if 1 topic has too
much data?
Yes
> Is there a way to avoid shuffle at all (or do only 1) and avoid a
situation when 1 node will become a hotspot?
Do you know the amount of data per kafka topic beforehand, or does this
have to be dynamic?
This will mean 2 shuffles, and 1 node might bottleneck if 1 topic has too
much data? Is there a way to avoid shuffle at all (or do only 1) and avoid
a situation when 1 node will become a hotspot?
Alex
On Thu, Jun 25, 2020 at 8:05 AM Kostas Kloudas wrote:
> Hi Alexander,
>
> Routing of input dat
Hi Alexander,
Routing of input data in Flink can be done through keying and this can
guarantee collocation constraints. This means that you can send two
records to the same node by giving them the same key, e.g. the topic
name. Keep in mind that elements with different keys do not
necessarily go t
Maybe I misreading the documentation, but:
"Data within the partition directories are split into part files. Each
partition will contain at least one part file for each subtask of the sink
that has received data for that partition."
So, it is 1 partition per subtask. I'm trying to figure out how t
You can achieve this in Flink 1.10 using the StreamingFileSink.
I’d also like to note that Flink 1.11 (which is currently going through
release testing and should be available imminently) has support for exactly
this functionality in the table API.
https://ci.apache.org/projects/flink/flink-docs-
Hello!
We are working an a Flink Streaming job that reads data from multiple Kafka
topics and writes them to DFS. We are using StreamingFileSink with custom
implementation for GCS FS and it generates a lot of files as streams are
partitioned among multiple JMs. In the ideal case we should have at