Hi Yi, Thanks for the clarification, it was helpful.
I would also like to know your views on the below issues and if you have employed something to overcome those. LogCompaction Issues: https://issues.apache.org/jira/browse/KAFKA-2163 - Offsets manager cache should prevent stale-offset-cleanup while an offset load is in progress; otherwise we can lose consumer offsets – *Might be an issue as it will result in no offset to be read thereby failing the bootstrap of local key value store* https://issues.apache.org/jira/browse/KAFKA-2118 - Cleaner cannot clean after shutdown during replaceSegments – *Will prevent reading log compacted topic causing failure of local key value store bootstrap* https://issues.apache.org/jira/browse/KAFKA-2235 - LogCleaner offset map overflow – *Will probably be an issue for some clients who has smaller message size and large number of keys. They need to fine tune a lot to make sure that this doesn't happen.* Replication Issues: https://issues.apache.org/jira/browse/KAFKA-2477 - Replicas spuriously deleting all segments in partition – *Will cause the data in changelog topic to be lost resulting in failure of local key value store bootstrap. * Though Samza can be plugged with different messaging systems, Kafka is the major system that is supported today for state-full processing. If that's the case the following bugs will potentially make Samza also to not work properly (Ex: if there is replication issue called out below in a log compacted topic happens, then Samza might not be able to restore its local key value store).. Since you are running Samza with state-full processing, the above issues might result your Samza job with key value store in an in-consistent state. Are you using Samza with stateful processing for critical applications which cannot tolerate loss of data or inconsistencies? (Because with the above bugs you might not be able to run the job for critical application as it might fail if it is hit with the above issues). I believe that upgrading to 0.9 Kafka is much critical to ensure that Samza also works properly (I do understand that its not a issue with Samza, but I believe that the one of the primary reason for customers/devs choosing Samza is its fine ability to do state-full processing and if that is not working or will fail due to dependency on Kafka, it becomes necessary to upgrade to Kafka asap), please correct me if I am wrong here. Thanks, Nick ------------------------------ Hi, Nick, Let me try to answer in-between the lines: On Thu, Mar 31, 2016 at 12:49 AM, nick xander <nickxander...@gmail.com> wrote: > > * Do you guys experience issue with Kafka when it is used with log > compaction for Samza's state full management? > The critical issue on log-compaction in Kafka that we care about is the case where message compression and log-compaction are *both* used in the same topic. Currently, for changelog topics, we forcefully turned off compression. Hence, it is not a problem for Samza's KV-stores. It is still a problem for checkpoint topics if the Kafka producer is configured to use message compression. > * What is the avg number of keys per partition that you have observed in > Kafka's log compacted topic for state full management, total number of > partition, replication factor and number of Kafka brokers? > This number varies *a lot*, depending on how big your KV-store is. For example, we have seem around 5-10GB of RocksDB KV-stores being stored in changelog in LinkedIn. That will cause a long bootstrap time when the container is restarted on a different host. Hence, we included host-affinity feature in Samza 0.10, which cut down the bootstrap time for that particular job by 20x. > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it > seems critical if Samza is used for stateful management? And what is the > timeline for Samza 0.10.1 that you are expecting? > We are planning to release Samza 0.10.1 very soon and are working on pending code reviews and validations now. Depending on the test/validation cycles, we hope to get Samza 0.10.1 release candidate ready in a month or so. Kafka 0.9 upgrade will likely not be in Samza 0.10.1, due to the tight release timeline this time. > * What is recommendation between the usage of Samza vs Kafka connect? > Should we use Samza for state full management and Kafka connect for other > stateless streaming soslution? > > KafkaConnect is mainly an ingest/output connector to/from Kafka, not having much stateful processing. Samza actually does both ingest/output and stateful process. If there are input data sources that Samza does not have a SystemConsumer implementation for yet, you can definitely use KafkaConnect for ingestion and Samza for stateful processing. Hope the above answered your questions. Thanks! -Yi On Thu, Mar 31, 2016 at 9:49 AM, nick xander <nickxander...@gmail.com> wrote: > Hi All, > As per this article: > http://www.confluent.io/blog/290-reasons-to-upgrade-to-apache-kafka-0.9.0.0 > there are some well know bugs and feature improvements around log > compaction (state full management in Samza) and Replication. I also saw in > Samza issues about this upgrade: > https://issues.apache.org/jira/browse/SAMZA-855. My questions here: > > * Do you guys experience issue with Kafka when it is used with log > compaction for Samza's state full management? > * What is the avg number of keys per partition that you have observed in > Kafka's log compacted topic for state full management, total number of > partition, replication factor and number of Kafka brokers? > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it > seems critical if Samza is used for stateful management? And what is the > timeline for Samza 0.10.1 that you are expecting? > * What is recommendation between the usage of Samza vs Kafka connect? > Should we use Samza for state full management and Kafka connect for other > stateless streaming soslution? > > Thanks, > Nick >