Re: Beam on Flink not processing input splits in parallel

2022-03-09 Thread Jan Lukavský
There are two "kinds" of splits in SDF - one splits the restriction *before* being processed and the other *during* processing. The first one is supported (it is needed for correctness) and the other is in bounded case only an optimization (which is not currently supported). It seems to me, tha

Re: Beam on Flink not processing input splits in parallel

2022-03-09 Thread Janek Bevendorff
Thanks for the response! That's what I feared was going on. I consider this a huge shortcoming, particularly because it does not only affect users with large files like you said. The same happens with many small files, because file globs are also fused to one worker. The only way to process fi

Re: Beam on Flink not processing input splits in parallel

2022-03-09 Thread Jan Lukavský
Hi Janek, I think you hit a deficiency in the FlinkRunner's SDF implementation. AFAIK the runner is unable to do dynamic splitting, which is what you are probably looking for. What you describe essentially works in the model, but FlinkRunner does not implement the complete contract to make us

Re: Beam on Flink not processing input splits in parallel

2022-03-09 Thread Janek Bevendorff
I went through all Flink and Beam documentation I could find to see if I overlooked something, but I could not get the text input source to unfuse the file input splits. This creates a huge input bottleneck, because one worker is busy reading records from a huge input file while 99 others wait

Beam on Flink not processing input splits in parallel

2022-03-08 Thread Janek Bevendorff
Hey there, According to the docs, when using a FileBasedSource or a splittable DoFn, the runner is free to initiate splits that can be run in parallel. As far as I can tell, the former is actually happening on my Apache Flink cluster, but the latter is not. This causes a single Taskmanager to