Thanks Jan, that clears things up!

On Fri, Jan 17, 2025 at 3:20 AM Jan Lukavský <[email protected]> wrote:

> Ho Joey,
>
> the fanout parameter takes two types of parameters (besides None):
>
>   a) int constant
>
>   b) function mapping a key to int
>
> Both types result in a mapping from key to integer, where this integer
> signals how many downstream copies (or "buckets") there should be for a
> given key. If there is more than 1 bucket for given key, then a random
> bucket is chosen for given input key-value pair and the pair is sent for
> processing to the selected "bucket". These buckets are then combined
> together to produce final value per key. This has the impact that if the
> "fanout" is too high for a scarce key, then it might result in no
> combining at all (all instances of the specific key will end-up in own
> bucket). This will result in increased shuffling and lower efficiency.
>
> Best,
>
>   Jan
>
> On 1/16/25 22:01, Joey Tran wrote:
> > Hi,
> >
> > I've read the documentation for CombinePerKey with hot key fanout and
> > I think I understand it at a high level (split up and combine sharded
> > keys before merging all values in one key) but I'm confused by the
> > parameter that this method takes and how it affects the behavior of
> > the transform.
> >
> > Is the fanout parameter the... number of shards per key? Am I thinking
> > about this right? Any help would be appreciated, thanks!
> >
> > Joey
> >
> > [1]
> >
> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey.with_hot_key_fanout
>

Reply via email to