Re: Deduplicating record amplification

Rex Fenley Tue, 26 Jan 2021 21:08:15 -0800

Hello,

I began reading
https://ci.apache.org/projects/flink/flink-docs-release-1.0/concepts/concepts.html#parallel-dataflows
-


*> Redistributing* streams (between *map()* and *keyBy/window*, as well as
between *keyBy/window* and *sink*) change the partitioning of streams.
Each *operator
subtask* sends data to different target subtasks, depending on the selected
transformation. Examples are *keyBy()* (re-partitions by hash code),
*broadcast()*, or *rebalance()* (random redistribution). In a
*redistributing* exchange, order among elements is only preserved for each
pair of sending- and receiving task (for example subtask[1] of *map()* and
subtask[2] of *keyBy/window*).
This makes it sounds like ordering on the same partition/key is always
maintained. Which is exactly the ordering guarantee that I need. This seems
to slightly contradict the statement "Flink provides no guarantees about
the order of the elements within a window" for keyed state. So is it true
that ordering _is_ guaranteed for identical keys?

If I'm not mistaken, the state in the TableAPI is always considered keyed
state for a join or aggregate. Or am I missing something?

Thanks!

On Tue, Jan 26, 2021 at 8:53 PM Rex Fenley <r...@remind101.com> wrote:

> Our data arrives in order from Kafka, so we are hoping to use that same
> order for our processing.
>
> On Tue, Jan 26, 2021 at 8:40 PM Rex Fenley <r...@remind101.com> wrote:
>
>> Going further, if "Flink provides no guarantees about the order of the
>> elements within a window" then with minibatch, which I assume uses a window
>> under the hood, any aggregates that expect rows to arrive in order will
>> fail to keep their consistency. Is this correct?
>>
>> On Tue, Jan 26, 2021 at 5:36 PM Rex Fenley <r...@remind101.com> wrote:
>>
>>> Hello,
>>>
>>> We have a job from CDC to a large unbounded Flink plan to Elasticsearch.
>>>
>>> Currently, we have been relentlessly trying to reduce our record
>>> amplification which, when our Elasticsearch index is near fully populated,
>>> completely bottlenecks our write performance. We decided recently to try a
>>> new job using mini-batch. At first this seemed promising but at some point
>>> we began getting huge record amplification in a join operator. It appears
>>> that minibatch may only batch on aggregate operators?
>>>
>>> So we're now thinking that we should have a window before our ES sink
>>> which only takes the last record for any unique document id in the window,
>>> since that's all we really want to send anyway. However, when investigating
>>> turning a table, to a keyed window stream for deduping, and then back into
>>> a table I read the following:
>>>
>>> >Attention Flink provides no guarantees about the order of the elements
>>> within a window. This implies that although an evictor may remove elements
>>> from the beginning of the window, these are not necessarily the ones that
>>> arrive first or last. [1]
>>>
>>> which has put a damper on our investigation.
>>>
>>> I then found the deduplication SQL doc [2], but I have a hard time
>>> parsing what the SQL does and we've never used TemporaryViews or proctime
>>> before.
>>> Is this essentially what we want?
>>> Will just using this SQL be safe for a job that is unbounded and just
>>> wants to deduplicate a document write to whatever the most current one is
>>> (i.e. will restoring from a checkpoint maintain our unbounded consistency
>>> and will deletes work)?
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html#deduplication
>>>
>>> Thanks!
>>>
>>>
>>> --
>>>
>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>
>>>
>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>> <https://www.facebook.com/remindhq>
>>>
>>
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>> <https://www.facebook.com/remindhq>
>>
>
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>


-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Re: Deduplicating record amplification

Reply via email to