Our data arrives in order from Kafka, so we are hoping to use that same order for our processing.
On Tue, Jan 26, 2021 at 8:40 PM Rex Fenley <r...@remind101.com> wrote: > Going further, if "Flink provides no guarantees about the order of the > elements within a window" then with minibatch, which I assume uses a window > under the hood, any aggregates that expect rows to arrive in order will > fail to keep their consistency. Is this correct? > > On Tue, Jan 26, 2021 at 5:36 PM Rex Fenley <r...@remind101.com> wrote: > >> Hello, >> >> We have a job from CDC to a large unbounded Flink plan to Elasticsearch. >> >> Currently, we have been relentlessly trying to reduce our record >> amplification which, when our Elasticsearch index is near fully populated, >> completely bottlenecks our write performance. We decided recently to try a >> new job using mini-batch. At first this seemed promising but at some point >> we began getting huge record amplification in a join operator. It appears >> that minibatch may only batch on aggregate operators? >> >> So we're now thinking that we should have a window before our ES sink >> which only takes the last record for any unique document id in the window, >> since that's all we really want to send anyway. However, when investigating >> turning a table, to a keyed window stream for deduping, and then back into >> a table I read the following: >> >> >Attention Flink provides no guarantees about the order of the elements >> within a window. This implies that although an evictor may remove elements >> from the beginning of the window, these are not necessarily the ones that >> arrive first or last. [1] >> >> which has put a damper on our investigation. >> >> I then found the deduplication SQL doc [2], but I have a hard time >> parsing what the SQL does and we've never used TemporaryViews or proctime >> before. >> Is this essentially what we want? >> Will just using this SQL be safe for a job that is unbounded and just >> wants to deduplicate a document write to whatever the most current one is >> (i.e. will restoring from a checkpoint maintain our unbounded consistency >> and will deletes work)? >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html >> [2] >> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html#deduplication >> >> Thanks! >> >> >> -- >> >> Rex Fenley | Software Engineer - Mobile and Backend >> >> >> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >> <https://www.facebook.com/remindhq> >> > > > -- > > Rex Fenley | Software Engineer - Mobile and Backend > > > Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | > FOLLOW US <https://twitter.com/remindhq> | LIKE US > <https://www.facebook.com/remindhq> > -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | FOLLOW US <https://twitter.com/remindhq> | LIKE US <https://www.facebook.com/remindhq>