Ok great, thanks Lukasz. I will try turning off the timestamp attribute on
some of these jobs then!

On Thu, Aug 3, 2017 at 10:14 PM, Lukasz Cwik <[email protected]> wrote:

> To my knowledge, autoscaling is dependent on how many messages are
> backlogged within Pubsub and independent of the second subscription (the
> second subscription is really to compute a better watermark).
>
> On Thu, Aug 3, 2017 at 1:34 PM, <[email protected]> wrote:
>
>> Thanks Lukasz that's good to know! It sounds like we can halve our PubSub
>> costs then!
>>
>> Just to clarify, if I were to remove withTimestampAttribute from a job,
>> this would cause the watermark to always be up to date (processing time)
>> even if the job starts to lag behind (build up of unacknowledged PubSub
>> messages). In this case would Dataflow's autoscaling still scale up? I
>> thought the reason the autoscaler scales up is due to the watermark lagging
>> behind, but is it also aware of the acknowledged PubSub messages?
>>
>> On 3 Aug 2017, at 18:58, Lukasz Cwik <[email protected]> wrote:
>>
>> You understanding is correct - the data watermark will only matter for
>> windowing. It will not affect auto-scaling. If the pipeline is not doing
>> any windowing, triggering, etc then there is no need to pay for the cost of
>> the second subscription.
>>
>> On Thu, Aug 3, 2017 at 8:17 AM, Josh <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> We've been running a few streaming Beam jobs on Dataflow, where each job
>>> is consuming from PubSub via PubSubIO. Each job does something like this:
>>>
>>> PubsubIO.readMessagesWithAttributes()
>>>             .withIdAttribute("unique_id")
>>>             .withTimestampAttribute("timestamp");
>>>
>>> My understanding of `withTimestampAttribute` is that it means we use the
>>> timestamp on the PubSub message as Beam's concept of time (the watermark) -
>>> so that any windowing we do in the job uses "event time" rather than
>>> "processing time".
>>>
>>> My question is: is my understanding correct, and does using
>>> `withTimestampAttribute` have any effect in a job that doesn't do any
>>> windowing? I have a feeling it may also have an effect on Dataflow's
>>> autoscaling, since I think Dataflow scales up when the watermark timestamp
>>> lags behind, but I'm not sure about this.
>>>
>>> The reason I'm concerned about this is because we've been using it in
>>> all our Dataflow jobs, and have now realised that whenever
>>> `withTimestampAttribute` is used, Dataflow creates an additional PubSub
>>> subscription (suffixed with `__streaming_dataflow_internal`), which
>>> appears to be doubling PubSub costs (since we pay per subscription)! So I
>>> want to remove `withTimestampAttribute` from jobs where possible, but want
>>> to first understand the implications.
>>>
>>> Thanks for any advice,
>>> Josh
>>>
>>
>>
>

Reply via email to