Thanks Lukasz that's good to know! It sounds like we can halve our PubSub costs 
then!

Just to clarify, if I were to remove withTimestampAttribute from a job, this 
would cause the watermark to always be up to date (processing time) even if the 
job starts to lag behind (build up of unacknowledged PubSub messages). In this 
case would Dataflow's autoscaling still scale up? I thought the reason the 
autoscaler scales up is due to the watermark lagging behind, but is it also 
aware of the acknowledged PubSub messages?

> On 3 Aug 2017, at 18:58, Lukasz Cwik <[email protected]> wrote:
> 
> You understanding is correct - the data watermark will only matter for 
> windowing. It will not affect auto-scaling. If the pipeline is not doing any 
> windowing, triggering, etc then there is no need to pay for the cost of the 
> second subscription. 
> 
>> On Thu, Aug 3, 2017 at 8:17 AM, Josh <[email protected]> wrote:
>> Hi all,
>> 
>> We've been running a few streaming Beam jobs on Dataflow, where each job is 
>> consuming from PubSub via PubSubIO. Each job does something like this:
>> 
>> PubsubIO.readMessagesWithAttributes()
>>             .withIdAttribute("unique_id")
>>             .withTimestampAttribute("timestamp");
>> 
>> My understanding of `withTimestampAttribute` is that it means we use the 
>> timestamp on the PubSub message as Beam's concept of time (the watermark) - 
>> so that any windowing we do in the job uses "event time" rather than 
>> "processing time".
>> 
>> My question is: is my understanding correct, and does using 
>> `withTimestampAttribute` have any effect in a job that doesn't do any 
>> windowing? I have a feeling it may also have an effect on Dataflow's 
>> autoscaling, since I think Dataflow scales up when the watermark timestamp 
>> lags behind, but I'm not sure about this.
>> 
>> The reason I'm concerned about this is because we've been using it in all 
>> our Dataflow jobs, and have now realised that whenever 
>> `withTimestampAttribute` is used, Dataflow creates an additional PubSub 
>> subscription (suffixed with `__streaming_dataflow_internal`), which appears 
>> to be doubling PubSub costs (since we pay per subscription)! So I want to 
>> remove `withTimestampAttribute` from jobs where possible, but want to first 
>> understand the implications.
>> 
>> Thanks for any advice,
>> Josh
> 

Reply via email to