Hello, I began reading https://ci.apache.org/projects/flink/flink-docs-release-1.0/concepts/concepts.html#parallel-dataflows -
*> Redistributing* streams (between *map()* and *keyBy/window*, as well as between *keyBy/window* and *sink*) change the partitioning of streams. Each *operator subtask* sends data to different target subtasks, depending on the selected transformation. Examples are *keyBy()* (re-partitions by hash code), *broadcast()*, or *rebalance()* (random redistribution). In a *redistributing* exchange, order among elements is only preserved for each pair of sending- and receiving task (for example subtask[1] of *map()* and subtask[2] of *keyBy/window*). This makes it sounds like ordering on the same partition/key is always maintained. Which is exactly the ordering guarantee that I need. This seems to slightly contradict the statement "Flink provides no guarantees about the order of the elements within a window" for keyed state. So is it true that ordering _is_ guaranteed for identical keys? If I'm not mistaken, the state in the TableAPI is always considered keyed state for a join or aggregate. Or am I missing something? Thanks! On Tue, Jan 26, 2021 at 8:53 PM Rex Fenley <r...@remind101.com> wrote: > Our data arrives in order from Kafka, so we are hoping to use that same > order for our processing. > > On Tue, Jan 26, 2021 at 8:40 PM Rex Fenley <r...@remind101.com> wrote: > >> Going further, if "Flink provides no guarantees about the order of the >> elements within a window" then with minibatch, which I assume uses a window >> under the hood, any aggregates that expect rows to arrive in order will >> fail to keep their consistency. Is this correct? >> >> On Tue, Jan 26, 2021 at 5:36 PM Rex Fenley <r...@remind101.com> wrote: >> >>> Hello, >>> >>> We have a job from CDC to a large unbounded Flink plan to Elasticsearch. >>> >>> Currently, we have been relentlessly trying to reduce our record >>> amplification which, when our Elasticsearch index is near fully populated, >>> completely bottlenecks our write performance. We decided recently to try a >>> new job using mini-batch. At first this seemed promising but at some point >>> we began getting huge record amplification in a join operator. It appears >>> that minibatch may only batch on aggregate operators? >>> >>> So we're now thinking that we should have a window before our ES sink >>> which only takes the last record for any unique document id in the window, >>> since that's all we really want to send anyway. However, when investigating >>> turning a table, to a keyed window stream for deduping, and then back into >>> a table I read the following: >>> >>> >Attention Flink provides no guarantees about the order of the elements >>> within a window. This implies that although an evictor may remove elements >>> from the beginning of the window, these are not necessarily the ones that >>> arrive first or last. [1] >>> >>> which has put a damper on our investigation. >>> >>> I then found the deduplication SQL doc [2], but I have a hard time >>> parsing what the SQL does and we've never used TemporaryViews or proctime >>> before. >>> Is this essentially what we want? >>> Will just using this SQL be safe for a job that is unbounded and just >>> wants to deduplicate a document write to whatever the most current one is >>> (i.e. will restoring from a checkpoint maintain our unbounded consistency >>> and will deletes work)? >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html#deduplication >>> >>> Thanks! >>> >>> >>> -- >>> >>> Rex Fenley | Software Engineer - Mobile and Backend >>> >>> >>> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >>> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >>> <https://www.facebook.com/remindhq> >>> >> >> >> -- >> >> Rex Fenley | Software Engineer - Mobile and Backend >> >> >> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >> <https://www.facebook.com/remindhq> >> > > > -- > > Rex Fenley | Software Engineer - Mobile and Backend > > > Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | > FOLLOW US <https://twitter.com/remindhq> | LIKE US > <https://www.facebook.com/remindhq> > -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | FOLLOW US <https://twitter.com/remindhq> | LIKE US <https://www.facebook.com/remindhq>