Hi Mark, could you double check if these spikes co-occur with checkpointing? If there is an alignment, certain channels are blocked from taking in data. If all keys are more or less contained in a shard with less data, it would why only these keys are affected.
On Mon, Nov 30, 2020 at 9:27 PM Kegel, Mark <mark.ke...@disneystreaming.com> wrote: > We have a high volume (600-700 shards) kinesis data stream that we are > doing a simple keying and aggregation on. The logic is very simple: kinesis > source, key by fields (A,B,C), window (1-minute, tumbling), aggregate by > summing over integer field R, connect to sink. > > > > We are seeing some anomalous spikes in our aggregations. From one minute > to the next, the sum total for one particular key may increase 25x or more > and then drop back down to a normal level, yet sums for other keys in the > same window remain roughly the same, which we expect. > > > > We don’t see this too often. Maybe 1-5 data points (key + timestamp) in an > hour’s worth of 1-minute windowed data will have these spikes. The data has > fairly low cardinality. There are only roughly two hundred distinct keys. > > > > We inspected the raw kinesis stream and found no duplicates. It isn’t > clear how these spikes could happen or what we might do to work around the > issue since the code is as idiomatic as possible. > > > > We are running the job as part of Kinesis Data Analytics, which is using > Flink version 1.8. To connect to Kinesis we are using the > amazon-kinesis-connection-flink library (v1.0.4) library and the EFO > consumer mode. > > > -- Arvid Heise | Senior Java Developer <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng