Hey Lukasz,

I have a follow up question about this -

What if I want to do something very similar, but instead of with 4
instances of AvroIO following the partition transform, I want 4 instances
of a DoFn that I've written. I want to ensure that each partition is
processed by a single DoFn instance/thread. Is this possible with Beam?

Thanks,
Josh



On Wed, May 24, 2017 at 6:15 PM, Josh <jof...@gmail.com> wrote:

> Ahh I see - Ok I'll try out this solution then. Thanks Lukasz!
>
> On Wed, May 24, 2017 at 5:20 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Google Cloud Dataflow won't override your setting. The dynamic sharding
>> occurs if you don't explicitly set a numShard value.
>>
>> On Wed, May 24, 2017 at 9:14 AM, Josh <jof...@gmail.com> wrote:
>>
>>> Hi Lukasz,
>>>
>>> Thanks for the example. That sounds like a nice solution -
>>> I am running on Dataflow though, which dynamically sets numShards - so
>>> if I set numShards to 1 on each of those AvroIO writers, I can't be sure
>>> that Dataflow isn't going to override my setting right? I guess this should
>>> work fine as long as I partition my stream into a large enough number of
>>> partitions so that Dataflow won't override numShards.
>>>
>>> Josh
>>>
>>>
>>> On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> Since your using a small number of shards, add a Partition transform
>>>> which uses a deterministic hash of the key to choose one of 4 partitions.
>>>> Write each partition with a single shard.
>>>>
>>>> (Fixed width diagram below)
>>>> Pipeline -> AvroIO(numShards = 4)
>>>> Becomes:
>>>> Pipeline -> Partition --> AvroIO(numShards = 1)
>>>>                       |-> AvroIO(numShards = 1)
>>>>                       |-> AvroIO(numShards = 1)
>>>>                       \-> AvroIO(numShards = 1)
>>>>
>>>> On Wed, May 24, 2017 at 1:05 AM, Josh <jof...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using a FileBasedSink (AvroIO.write) on an unbounded stream
>>>>> (withWindowedWrites, hourly windows, numShards=4).
>>>>>
>>>>> I would like to partition the stream by some key in the element, so
>>>>> that all elements with the same key will get processed by the same shard
>>>>> writer, and therefore written to the same file. Is there a way to do this?
>>>>> Note that in my stream the number of keys is very large (most elements 
>>>>> have
>>>>> a unique key, while a few elements share a key).
>>>>>
>>>>> Thanks,
>>>>> Josh
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to