Thanks Jan, that clears things up! On Fri, Jan 17, 2025 at 3:20 AM Jan Lukavský <[email protected]> wrote:
> Ho Joey, > > the fanout parameter takes two types of parameters (besides None): > > a) int constant > > b) function mapping a key to int > > Both types result in a mapping from key to integer, where this integer > signals how many downstream copies (or "buckets") there should be for a > given key. If there is more than 1 bucket for given key, then a random > bucket is chosen for given input key-value pair and the pair is sent for > processing to the selected "bucket". These buckets are then combined > together to produce final value per key. This has the impact that if the > "fanout" is too high for a scarce key, then it might result in no > combining at all (all instances of the specific key will end-up in own > bucket). This will result in increased shuffling and lower efficiency. > > Best, > > Jan > > On 1/16/25 22:01, Joey Tran wrote: > > Hi, > > > > I've read the documentation for CombinePerKey with hot key fanout and > > I think I understand it at a high level (split up and combine sharded > > keys before merging all values in one key) but I'm confused by the > > parameter that this method takes and how it affects the behavior of > > the transform. > > > > Is the fanout parameter the... number of shards per key? Am I thinking > > about this right? Any help would be appreciated, thanks! > > > > Joey > > > > [1] > > > https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey.with_hot_key_fanout >
