Watermarks are generated using the PeriodicWatermarkAssigner using a timestamp 
field from within the records. We are processing log data from an S3 bucket and 
logs are always processed in chronological order using a custom 
ContinuousFileMonitoringFunction but the standard ContinousFileReaderOperator. 
Certainly with a larger cluster splits would be processed more quickly and as 
such the watermark would advance at a quicker pace. Why do you think a more 
quickly advancing watermark would affect state size in this case?

Seth Wiesman

From: Aljoscha Krettek <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, December 23, 2016 at 1:43 PM
To: "[email protected]" <[email protected]>
Subject: Re: state size in relation to cluster size and processing speed

Hi,
how are you generating your watermarks? Could it be that they advance faster 
when the job is processing more data?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 21:01 Seth Wiesman 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I’ve noticed something peculiar about the relationship between state size and 
cluster size and was wondering if anyone here knows of the reason. I am running 
a job with 1 hour tumbling event time windows which have an allowed lateness of 
7 days. When I run on a 20-node cluster with FsState I can process 
approximately 1.5 days’ worth of data in an hour with the most recent 
checkpoint being ~20gb.  Now if I run the same job with the same configurations 
on a 40-node cluster I can process 2 days’ worth of data in 20 min (expected) 
but the state size is only ~8gb. Because allowed lateness is 7 days no windows 
should be purged yet and I would expect the larger cluster which has processed 
more data to have a larger state. Is there some why a slower running job or a 
smaller cluster would require more state?

This is more of a curiosity than an issue. Thanks’ in advance for any insights 
you may have.

Seth Wiesman

Reply via email to