Why is GroupBy involved in the file save operation?

Tao Li Fri, 21 May 2021 16:16:03 -0700

Hi Beam community,

I wonder why a GroupBy operation is involved in WriteFiles: 
https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html


This doc mentioned “ The exact parallelism of the write stage can be controlled 
using 
withNumShards(int)<https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html#withNumShards-int->,
 typically used to control how many files are produced or to globally limit the 
number of workers connecting to an external service. However, this option can 
often hurt performance: it adds an additional 
GroupByKey<https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/GroupByKey.html>
 to the pipeline.”

When we are saving the PCollection into multiple files, why can’t we simply 
split the PCollection without a key and save each split as a file?

Thanks!

Why is GroupBy involved in the file save operation?

Reply via email to