Digging around, it looks like Upsert Kafka which requires a Primary Key
will actually do what I want and uses compaction, but it doesn't look
compatible with Debezium format? Is this on the roadmap?

In the meantime, we're considering consuming from Debezium Kafka (still
compacted) and then writing directly to an Upsert Kafka sink and then
reading right back out of a corresponding Upsert Kafka source. Since that
little roundabout will key all changes by primary key it should give us a
compacted topic to start with initially. Once we get that working we can
probably do the same thing with intermediate flink jobs too.

Would appreciate any feedback on this approach, thanks!

On Fri, Feb 26, 2021 at 10:52 AM Rex Fenley <r...@remind101.com> wrote:

> Does this also imply that it's not safe to compact the initial topic where
> data is coming from Debezium? I'd think that Flink's Kafka source would
> emit retractions on any existing data with a primary key as new data with
> the same pk arrived (in our case all data has primary keys). I guess that
> goes back to my original question still however, is this not what the Kafka
> source does? Is there no way to make that happen?
>
> We really can't live with the record amplification, it's sometimes
> nonlinear and randomly kills RocksDB performance.
>
> On Fri, Feb 26, 2021 at 2:16 AM Arvid Heise <ar...@apache.org> wrote:
>
>> Just to clarify, intermediate topics should in most cases not be
>> compacted for exactly the reasons if your application depends on all
>> intermediate data. For the final topic, it makes sense. If you also consume
>> intermediate topics for web application, one solution is to split it into
>> two topics (like topic-raw for Flink and topic-compacted for applications)
>> and live with some amplification.
>>
>> On Thu, Feb 25, 2021 at 12:11 AM Rex Fenley <r...@remind101.com> wrote:
>>
>>> All of our Flink jobs are (currently) used for web applications at the
>>> end of the day. We see a lot of latency spikes from record amplification
>>> and we were at first hoping we could pass intermediate results through
>>> Kafka and compact them to lower the record amplification, but then it hit
>>> me that this might be an issue.
>>>
>>> Thanks for the detailed explanation, though it seems like we'll need to
>>> look for a different solution or only compact on records we know will never
>>> mutate.
>>>
>>> On Wed, Feb 24, 2021 at 6:38 AM Arvid Heise <ar...@apache.org> wrote:
>>>
>>>> Jan's response is correct, but I'd like to emphasize the impact on a
>>>> Flink application.
>>>>
>>>> If the compaction happens before the data arrives in Flink, the
>>>> intermediate updates are lost and just the final result appears.
>>>> Also if you restart your Flink application and reprocess older data, it
>>>> will naturally only see the compacted data save for the active segment.
>>>>
>>>> So how to make it deterministic? Simply drop topic compaction. If it's
>>>> coming from CDC and you want to process and produce changelog streams over
>>>> several applications, you probably don't want to use log compactions
>>>> anyways.
>>>>
>>>> Log compaction only makes sense in the snapshot topic that displays the
>>>> current state (KTable), where you don't think in CDC updates anymore but
>>>> just final records, like
>>>> (user_id: 1, state: "california")
>>>> (user_id: 1, state: "ohio")
>>>>
>>>> Usually, if you use CDC in your company, each application is
>>>> responsible for building its own current model by tapping in the relevant
>>>> changes. Log compacted topics would then only appear at the end of
>>>> processing, when you hand it over towards non-analytical applications, such
>>>> as Web Apps.
>>>>
>>>> On Wed, Feb 24, 2021 at 10:01 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>
>>>>> Hi Rex,
>>>>>
>>>>> If I understand correctly, you are concerned about behavior of Kafka
>>>>> source in the case of compacted topic, right? If that is the case, then
>>>>> this is not directly related to Flink, Flink will expose the behavior
>>>>> defined by Kafka. You can read about it for instance here [1]. TL;TD - 
>>>>> your
>>>>> pipeline is guaranteed to see every record written to topic (every single
>>>>> update, be it later "overwritten" or not) if it processes the record with
>>>>> latency at most 'delete.retention.ms'. This is configurable per topic
>>>>> - default 24 hours. If you want to reprocess the data later, your consumer
>>>>> might see only resulting compacted ("retracted") stream, and not every
>>>>> record actually written to the topic.
>>>>>
>>>>>  Jan
>>>>>
>>>>> [1]
>>>>> https://medium.com/swlh/introduction-to-topic-log-compaction-in-apache-kafka-3e4d4afd2262
>>>>> On 2/24/21 3:14 AM, Rex Fenley wrote:
>>>>>
>>>>> Apologies, forgot to finish. If the Kafka source performs its own
>>>>> retractions of old data on key (user_id) for every append it receives, it
>>>>> should resolve this discrepancy I think.
>>>>>
>>>>> Again, is this true? Anything else I'm missing?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> On Tue, Feb 23, 2021 at 6:12 PM Rex Fenley <r...@remind101.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm concerned about the impacts of Kafka's compactions when sending
>>>>>> data between running flink jobs.
>>>>>>
>>>>>> For example, one job produces retract stream records in sequence of
>>>>>> (false, (user_id: 1, state: "california") -- retract
>>>>>> (true, (user_id: 1, state: "ohio")) -- append
>>>>>> Which is consumed by Kafka and keyed by user_id, this could end up
>>>>>> compacting to just
>>>>>> (true, (user_id: 1, state: "ohio")) -- append
>>>>>> If some other downstream Flink job has a filter on state ==
>>>>>> "california" and reads from the Kafka stream, I assume it will miss the
>>>>>> retract message altogether and produce incorrect results.
>>>>>>
>>>>>> Is this true? How do we prevent this from happening? We need to use
>>>>>> compaction since all our jobs are based on CDC and we can't just drop 
>>>>>> data
>>>>>> after x number of days.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>>>>
>>>>>>
>>>>>> Remind.com <https://www.remind.com/> |  BLOG
>>>>>> <http://blog.remind.com/>  |  FOLLOW US
>>>>>> <https://twitter.com/remindhq>  |  LIKE US
>>>>>> <https://www.facebook.com/remindhq>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>>>
>>>>>
>>>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>>>> <https://www.facebook.com/remindhq>
>>>>>
>>>>>
>>>
>>> --
>>>
>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>
>>>
>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>> <https://www.facebook.com/remindhq>
>>>
>>
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>


-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Reply via email to