To my knowledge, autoscaling is dependent on how many messages are backlogged within Pubsub and independent of the second subscription (the second subscription is really to compute a better watermark).
On Thu, Aug 3, 2017 at 1:34 PM, <[email protected]> wrote: > Thanks Lukasz that's good to know! It sounds like we can halve our PubSub > costs then! > > Just to clarify, if I were to remove withTimestampAttribute from a job, > this would cause the watermark to always be up to date (processing time) > even if the job starts to lag behind (build up of unacknowledged PubSub > messages). In this case would Dataflow's autoscaling still scale up? I > thought the reason the autoscaler scales up is due to the watermark lagging > behind, but is it also aware of the acknowledged PubSub messages? > > On 3 Aug 2017, at 18:58, Lukasz Cwik <[email protected]> wrote: > > You understanding is correct - the data watermark will only matter for > windowing. It will not affect auto-scaling. If the pipeline is not doing > any windowing, triggering, etc then there is no need to pay for the cost of > the second subscription. > > On Thu, Aug 3, 2017 at 8:17 AM, Josh <[email protected]> wrote: > >> Hi all, >> >> We've been running a few streaming Beam jobs on Dataflow, where each job >> is consuming from PubSub via PubSubIO. Each job does something like this: >> >> PubsubIO.readMessagesWithAttributes() >> .withIdAttribute("unique_id") >> .withTimestampAttribute("timestamp"); >> >> My understanding of `withTimestampAttribute` is that it means we use the >> timestamp on the PubSub message as Beam's concept of time (the watermark) - >> so that any windowing we do in the job uses "event time" rather than >> "processing time". >> >> My question is: is my understanding correct, and does using >> `withTimestampAttribute` have any effect in a job that doesn't do any >> windowing? I have a feeling it may also have an effect on Dataflow's >> autoscaling, since I think Dataflow scales up when the watermark timestamp >> lags behind, but I'm not sure about this. >> >> The reason I'm concerned about this is because we've been using it in all >> our Dataflow jobs, and have now realised that whenever >> `withTimestampAttribute` is used, Dataflow creates an additional PubSub >> subscription (suffixed with `__streaming_dataflow_internal`), which >> appears to be doubling PubSub costs (since we pay per subscription)! So I >> want to remove `withTimestampAttribute` from jobs where possible, but want >> to first understand the implications. >> >> Thanks for any advice, >> Josh >> > >
