Re: Why is GroupBy involved in the file save operation?

Reuven Lax Fri, 21 May 2021 16:46:23 -0700

On Fri, May 21, 2021 at 4:35 PM Tao Li <t...@zillow.com> wrote:

> Reuven thanks for your response.  GroupBy is not involved if we are not
> specifying fixed number of files, correct?
>


Correct.

>
>
> And what’s the implication of not specifying the shard number? Is the
> parallelism determined by the number of spark executors that hold data to
> save? This is assuming we are using spark runner.
>

Each bundle becomes a file. I'm not entirely sure how the spark runner
determines what the bundles should be.


>
>
> What would be the best practice? Specifying the fixed shard number or
> asking beam to figure it out for us?
>
>
>
> *From: *Reuven Lax <re...@google.com>
> *Reply-To: *"user@beam.apache.org" <user@beam.apache.org>
> *Date: *Friday, May 21, 2021 at 4:27 PM
> *To: *user <user@beam.apache.org>
> *Cc: *Lian Jiang <li...@zillowgroup.com>
> *Subject: *Re: Why is GroupBy involved in the file save operation?
>
>
>
> What you describe is what happens (at least in the Dataflow runner) if
> auto sharding is specified in batch. This mechanism tries to split the
> PColllection to fully utilize every worker, so is not appropriate when a
> fixed number of shards is desired. A GroupByKey is also necessary in
> streaming in order to split an unbounded PColllection using
> windows/triggers, as windows and triggers are applied during GroupByKey.
>
>
>
> On Fri, May 21, 2021 at 4:16 PM Tao Li <t...@zillow.com> wrote:
>
> Hi Beam community,
>
>
>
> I wonder why a GroupBy operation is involved in WriteFiles:
> https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FWriteFiles.html&data=04%7C01%7Ctaol%40zillow.com%7Cb15e8362b9f84ca3794508d91cb00d54%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572364724252236%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rOcYW37RAqNkyPX%2BiuA60jBsMyk9OtNMy8y5D4hhNL8%3D&reserved=0>
>
>
>
> This doc mentioned “ The exact parallelism of the write stage can be
> controlled using withNumShards(int)
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FWriteFiles.html%23withNumShards-int-&data=04%7C01%7Ctaol%40zillow.com%7Cb15e8362b9f84ca3794508d91cb00d54%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572364724262189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Id%2BDaRGTvkksknEAOF5HZDhiOlaSsNluxfc38oGvoTA%3D&reserved=0>,
> typically used to control how many files are produced or to globally limit
> the number of workers connecting to an external service. However, this
> option can often hurt performance: it adds an additional GroupByKey
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Ftransforms%2FGroupByKey.html&data=04%7C01%7Ctaol%40zillow.com%7Cb15e8362b9f84ca3794508d91cb00d54%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572364724262189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=unk2nGlUp5ey77DPFwEdceM%2FMQ30RjK0TtxgQzc7g4Q%3D&reserved=0>
>  to
> the pipeline.”
>
>
>
> When we are saving the PCollection into multiple files, why can’t we
> simply split the PCollection without a key and save each split as a file?
>
>
>
> Thanks!
>
>

Re: Why is GroupBy involved in the file save operation?

Reply via email to