There are two "kinds" of splits in SDF - one splits the restriction
*before* being processed and the other *during* processing. The first
one is supported (it is needed for correctness) and the other is in
bounded case only an optimization (which is not currently supported). It
seems to me, tha
Thanks for the response! That's what I feared was going on.
I consider this a huge shortcoming, particularly because it does not
only affect users with large files like you said. The same happens with
many small files, because file globs are also fused to one worker. The
only way to process fi
Hi Janek,
I think you hit a deficiency in the FlinkRunner's SDF implementation.
AFAIK the runner is unable to do dynamic splitting, which is what you
are probably looking for. What you describe essentially works in the
model, but FlinkRunner does not implement the complete contract to make
us
I went through all Flink and Beam documentation I could find to see if I
overlooked something, but I could not get the text input source to
unfuse the file input splits. This creates a huge input bottleneck,
because one worker is busy reading records from a huge input file while
99 others wait
Hey there,
According to the docs, when using a FileBasedSource or a splittable
DoFn, the runner is free to initiate splits that can be run in parallel.
As far as I can tell, the former is actually happening on my Apache
Flink cluster, but the latter is not. This causes a single Taskmanager
to