Ok great, thanks Lukasz. I will try turning off the timestamp attribute on some of these jobs then!
On Thu, Aug 3, 2017 at 10:14 PM, Lukasz Cwik <[email protected]> wrote: > To my knowledge, autoscaling is dependent on how many messages are > backlogged within Pubsub and independent of the second subscription (the > second subscription is really to compute a better watermark). > > On Thu, Aug 3, 2017 at 1:34 PM, <[email protected]> wrote: > >> Thanks Lukasz that's good to know! It sounds like we can halve our PubSub >> costs then! >> >> Just to clarify, if I were to remove withTimestampAttribute from a job, >> this would cause the watermark to always be up to date (processing time) >> even if the job starts to lag behind (build up of unacknowledged PubSub >> messages). In this case would Dataflow's autoscaling still scale up? I >> thought the reason the autoscaler scales up is due to the watermark lagging >> behind, but is it also aware of the acknowledged PubSub messages? >> >> On 3 Aug 2017, at 18:58, Lukasz Cwik <[email protected]> wrote: >> >> You understanding is correct - the data watermark will only matter for >> windowing. It will not affect auto-scaling. If the pipeline is not doing >> any windowing, triggering, etc then there is no need to pay for the cost of >> the second subscription. >> >> On Thu, Aug 3, 2017 at 8:17 AM, Josh <[email protected]> wrote: >> >>> Hi all, >>> >>> We've been running a few streaming Beam jobs on Dataflow, where each job >>> is consuming from PubSub via PubSubIO. Each job does something like this: >>> >>> PubsubIO.readMessagesWithAttributes() >>> .withIdAttribute("unique_id") >>> .withTimestampAttribute("timestamp"); >>> >>> My understanding of `withTimestampAttribute` is that it means we use the >>> timestamp on the PubSub message as Beam's concept of time (the watermark) - >>> so that any windowing we do in the job uses "event time" rather than >>> "processing time". >>> >>> My question is: is my understanding correct, and does using >>> `withTimestampAttribute` have any effect in a job that doesn't do any >>> windowing? I have a feeling it may also have an effect on Dataflow's >>> autoscaling, since I think Dataflow scales up when the watermark timestamp >>> lags behind, but I'm not sure about this. >>> >>> The reason I'm concerned about this is because we've been using it in >>> all our Dataflow jobs, and have now realised that whenever >>> `withTimestampAttribute` is used, Dataflow creates an additional PubSub >>> subscription (suffixed with `__streaming_dataflow_internal`), which >>> appears to be doubling PubSub costs (since we pay per subscription)! So I >>> want to remove `withTimestampAttribute` from jobs where possible, but want >>> to first understand the implications. >>> >>> Thanks for any advice, >>> Josh >>> >> >> >
