Re: Deduplicating record amplification

Rex Fenley Tue, 26 Jan 2021 20:53:35 -0800

Our data arrives in order from Kafka, so we are hoping to use that same
order for our processing.


On Tue, Jan 26, 2021 at 8:40 PM Rex Fenley <r...@remind101.com> wrote:

> Going further, if "Flink provides no guarantees about the order of the
> elements within a window" then with minibatch, which I assume uses a window
> under the hood, any aggregates that expect rows to arrive in order will
> fail to keep their consistency. Is this correct?
>
> On Tue, Jan 26, 2021 at 5:36 PM Rex Fenley <r...@remind101.com> wrote:
>
>> Hello,
>>
>> We have a job from CDC to a large unbounded Flink plan to Elasticsearch.
>>
>> Currently, we have been relentlessly trying to reduce our record
>> amplification which, when our Elasticsearch index is near fully populated,
>> completely bottlenecks our write performance. We decided recently to try a
>> new job using mini-batch. At first this seemed promising but at some point
>> we began getting huge record amplification in a join operator. It appears
>> that minibatch may only batch on aggregate operators?
>>
>> So we're now thinking that we should have a window before our ES sink
>> which only takes the last record for any unique document id in the window,
>> since that's all we really want to send anyway. However, when investigating
>> turning a table, to a keyed window stream for deduping, and then back into
>> a table I read the following:
>>
>> >Attention Flink provides no guarantees about the order of the elements
>> within a window. This implies that although an evictor may remove elements
>> from the beginning of the window, these are not necessarily the ones that
>> arrive first or last. [1]
>>
>> which has put a damper on our investigation.
>>
>> I then found the deduplication SQL doc [2], but I have a hard time
>> parsing what the SQL does and we've never used TemporaryViews or proctime
>> before.
>> Is this essentially what we want?
>> Will just using this SQL be safe for a job that is unbounded and just
>> wants to deduplicate a document write to whatever the most current one is
>> (i.e. will restoring from a checkpoint maintain our unbounded consistency
>> and will deletes work)?
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html#deduplication
>>
>> Thanks!
>>
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>> <https://www.facebook.com/remindhq>
>>
>
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>


-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Re: Deduplicating record amplification

Reply via email to