Thank you for the reply.

Yes kylin would not know the semantic of duplicate using kafka consumer api. 
It's left to the custom application code to do that.


The question actually means “ any best practice to implementing de-duplication 
custom application code with kylin streaming cube?”


For example, a naive solution would be:

Assign each "row" an uuid and ensure each row goes to a fixed topic partition. 
Assume that no kafka retry will happen after 10 seconds. In kylin's "read kafka 
message to HDFS" step, add some application logic to save the uuid for the past 
10s of a topic partition, and discard any message if duplicated uuid found.

but this naive solution needs to modify KafkaInputRecordReader(if using MR 
engine) and costs some memory.

Are there any suggested way or best practice to do this? Thanks.

________________________________
From: Billy Liu <[email protected]>
Sent: Tuesday, May 16, 2017 3:07 PM
To: user
Subject: Re: Streaming cube - workaround to duplicate messages by kafka 
producer retry?

Kafka provides the ack mechanism, although all ack solution would hurt the 
throughput and performance. User could configure it by kafka client parameter. 
Kylin would not know and should not know how to process the duplicate messages. 
The duplicate is semantic concept. What Kylin could guarantee is to not consume 
the messages more than once.

2017-05-16 22:37 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>:

Hi,


Current version of Kafka producer provides at least once semantics. Duplicates 
may occur in the stream due to producer retries.

( the idempotent producer is still under development  
https://issues.apache.org/jira/browse/KAFKA-4815 )

Idempotent/transactional Producer Checklist 
(KIP-98)<https://issues.apache.org/jira/browse/KAFKA-4815>
issues.apache.org<http://issues.apache.org>
This issue tracks implementation progress for KIP-98: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging.

When using streaming cube, Kylin may get duplicated messages and provide 
unexpected result.

Does anyone have some experience dealing with this problem?  I think this is 
more about Kafka itself, but since no Idempotent producer is available at 
current time,  could I have some advice to work around it on Kylin side? Thanks.




Reply via email to