Hey Lukasz, I have a follow up question about this -
What if I want to do something very similar, but instead of with 4 instances of AvroIO following the partition transform, I want 4 instances of a DoFn that I've written. I want to ensure that each partition is processed by a single DoFn instance/thread. Is this possible with Beam? Thanks, Josh On Wed, May 24, 2017 at 6:15 PM, Josh <jof...@gmail.com> wrote: > Ahh I see - Ok I'll try out this solution then. Thanks Lukasz! > > On Wed, May 24, 2017 at 5:20 PM, Lukasz Cwik <lc...@google.com> wrote: > >> Google Cloud Dataflow won't override your setting. The dynamic sharding >> occurs if you don't explicitly set a numShard value. >> >> On Wed, May 24, 2017 at 9:14 AM, Josh <jof...@gmail.com> wrote: >> >>> Hi Lukasz, >>> >>> Thanks for the example. That sounds like a nice solution - >>> I am running on Dataflow though, which dynamically sets numShards - so >>> if I set numShards to 1 on each of those AvroIO writers, I can't be sure >>> that Dataflow isn't going to override my setting right? I guess this should >>> work fine as long as I partition my stream into a large enough number of >>> partitions so that Dataflow won't override numShards. >>> >>> Josh >>> >>> >>> On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik <lc...@google.com> wrote: >>> >>>> Since your using a small number of shards, add a Partition transform >>>> which uses a deterministic hash of the key to choose one of 4 partitions. >>>> Write each partition with a single shard. >>>> >>>> (Fixed width diagram below) >>>> Pipeline -> AvroIO(numShards = 4) >>>> Becomes: >>>> Pipeline -> Partition --> AvroIO(numShards = 1) >>>> |-> AvroIO(numShards = 1) >>>> |-> AvroIO(numShards = 1) >>>> \-> AvroIO(numShards = 1) >>>> >>>> On Wed, May 24, 2017 at 1:05 AM, Josh <jof...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am using a FileBasedSink (AvroIO.write) on an unbounded stream >>>>> (withWindowedWrites, hourly windows, numShards=4). >>>>> >>>>> I would like to partition the stream by some key in the element, so >>>>> that all elements with the same key will get processed by the same shard >>>>> writer, and therefore written to the same file. Is there a way to do this? >>>>> Note that in my stream the number of keys is very large (most elements >>>>> have >>>>> a unique key, while a few elements share a key). >>>>> >>>>> Thanks, >>>>> Josh >>>>> >>>> >>>> >>> >> >