Question abount Spark Runner's Filter in parDo

LDesire Sun, 22 Sep 2024 20:58:46 -0700

Hello Beam community.

I'm currently trying out Spark Runner and while going through the code, 
I noticed that when evaluating a ParDo operation, 
it applies too many filter operations (from line 467 in 
TransformTranslator.java).


The original intent of this code seems to be to apply filters because the 
output of the ParDo can have multiple outputs.
In other words, it makes sense to apply the filter operation when there are 
multiple outputs, but I believe that applying the filter operation when there 
is only one output actually degrades pipeline performance (because the equals 
operation has to be applied to each element to compare them).


So I changed the PTransform to only apply when there are multiple outputs and 
tested it.
I need to do more testing, but it didn't affect the output and the results 
weren't bad.
If this is ok, would it be ok to make a PR?

Also, if I'm missing anything, I'd be grateful if you could let me know.

Cheers.

Question abount Spark Runner's Filter in parDo

Reply via email to