Nick,

The relationship between Kafka Connect
<http://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines>
and any stream processing system (whether it is Samza, Kafka Streams or
anything else) is very complementary. Kafka Connect makes data available in
Kafka that stream processing systems can then process.

The purpose of Kafka Connect is to offer a framework for using real-time
streaming ingestion connectors to Kafka in the easiest way possible without
having to write extra code. It builds on top of Kafka primitives to offer
fault-tolerance, offset management (very soon exactly-once), scalability
that every connector needs. Since the Kafka community announced it, the
community has built 20 open-source connectors
<http://www.confluent.io/developers/connectors> :-)

Kafka Streams
<http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple>,
on the other hand, is a lightweight library that offers stateful stream
processing capability. It is meant to make an application developer's life
easy by allowing a simple embeddable library for doing all sorts of stream
processing operations. Kafka Streams is available as part of Apache Kafka.

As far as the relationship between Kafka Connect and Samza is concerned,
this <http://www.confluent.io/blog/hello-world-kafka-connect-kafka-streams>
blog post talks about building the Hello Samza wikipedia example using
Kafka Connect and Kafka Streams. You might find it useful.

Thanks,
Neha

On Thu, Apr 7, 2016 at 10:00 PM, Kartik Paramasivam <
kparamasi...@linkedin.com.invalid> wrote:

> The log compaction fixes in Kafka 0.9 were done by our linkedIn Kafka
> developers to fix issues faced by Samza.
>
> So Yes.. Samza 0.10 can be safely used with Kafka 0.9.
> At LinkedIn we currently run kafka broker from apache kafka trunk.. So it
> is a more recent Kafka version.  But that should be fine.
>
> Regarding Kafka Connect : We don't use it at LinkedIn.  So we don't have
> any position on this.   We do have several Samza jobs at linkedIn which
> have implemented system consumers to read from Kinesis and Dynamo DB
> streams and publish to Kafka.   We also have been working on a new version
> of Databus which will be built on top of Kafka (it is similar to Kafka
> connect)
>
>
> On Thu, Apr 7, 2016 at 3:51 PM, nick xander <nickxander...@gmail.com>
> wrote:
>
> > Hi Yi,
> >
> >         Thanks for the support, really appreciate it to have an
> > active/supportive community. Makes sense to not upgrade Samza to use
> Kafka
> > 0.9 new client(which doesn’t rely on Zookeeper) because it might break
> the
> > clients using Kafka 0.8.2 broker. But as you were saying we might be able
> > to use 0.9 Broker with Samza 0.10 (Can you please confirm this? I tried
> > going through different documentation, seems possible. But “This means
> that
> > upgraded brokers and clients may not be compatible with older versions”
> in
> > Kafka 0.9 documentation worries me). It would be great if you guys could
> do
> > some sanity testing with Kafka 0.9 broker and see if there are any issues
> > using Samza 0.10, as you guys are the experts in the field and we will
> not
> > be able to identify all the use cases Samza using Kafka for. Ex: Tools
> > packaged under *org.apache.kafka.clients.tools.** have been moved to
> > *org.apache.kafka.tools.*(I suppose this is only for script), bunch of
> > Kafka configurations that got deprecated that might affect? and other
> > potential
> > breaking changes
> > <http://kafka.apache.org/documentation.html#upgrade_9_breaking>. It
> would
> > be really helpful for this community if Samza team could confirm that
> Kafka
> > 0.9 could be safely used with 0.10 or 0.10.1 version.
> >
> >
> >
> > Thanks,
> >
> > Nick
> > ------------------------------
> >
> >
> > Hi, Nick,
> >
> >
> >
> > Thanks for digging out the details from KAFKA JIRAs! I appreciated it!*
> >
> >
> >
> > As for upgrading to Kafka 0.9 to fix those critical issues, I am totally
> w/
> >
> > you. The discussion on whether Samza 0.10.1 should include Kafka 0.9
> fixes
> >
> > or not has just started (by your thread :)). So, we are happy to
> >
> > accommodate the request if the community has need for that.
> >
> >
> >
> > As for LinkedIn deployment, we actually have already deployed an internal
> >
> > version of Kafka that has most of the 0.9 fixes for log-compaction w/
> >
> > compressed messages. I will need to check w/ our Kafka team to see
> whether
> >
> > the bugs you mentioned also is included. There is a bit concern on
> pushing
> >
> > out Kafka 0.9 (with client libs) to Samza 0.10.1 due to the fact that
> some
> >
> > of the community members are still running Kafka 0.8.2 brokers in their
> >
> > production and this change might incur some migration cost. Besides,
> Kafka
> >
> > 0.9 also introduces a new client library changes that requires code
> change
> >
> > in Samza's KafkaSystemConsumer/KafkaSystemProducer. Hence, our original
> >
> > thought is to keep Samza 0.10.1 as a light-weighted release and
> incorporate
> >
> > Kafka 0.9 in the next major release.
> >
> >
> >
> > However, if Kafka 0.9 brokers are supporting Kafka 0.8.2 clients, I don't
> >
> > think that it should block you from using Kafka 0.9 broker and Samza 0.10
> >
> > together to fix the server side issues you mentioned. If there any
> >
> > client-side change in Samza that is needed, we are happy to help and if
> >
> > necessary, we can also change the scope of Samza 0.10.1 to include Kafka
> >
> > 0.9 client libraries.
> >
> >
> >
> > Please let me know if the above works for you. If not, let me know the
> >
> > specific issues that we need to use Kafka 0.9 client and we can find a
> >
> > solution together.
> >
> >
> >
> > Thanks a lot!
> >
> >
> >
> > -Yi
> >
> >
> >
> > On Fri, Apr 1, 2016 at 3:34 PM, nick xander <nickxander...@gmail.com>
> > wrote:
> >
> >
> >
> > > Hi Yi,
> >
> > >
> >
> > >         Thanks for the clarification, it was helpful.
> >
> > >
> >
> > >
> >
> > > I would also like to know your views on the below issues and if you
> have
> >
> > > employed something to overcome those.
> >
> > >
> >
> > > LogCompaction Issues:
> >
> > >
> >
> > > https://issues.apache.org/jira/browse/KAFKA-2163 - Offsets manager
> cache
> >
> > > should prevent stale-offset-cleanup while an offset load is in
> progress;
> >
> > > otherwise we can lose consumer offsets – *Might be an issue as it will
> >
> > > result in no offset to be read thereby failing the bootstrap of local
> key
> >
> > > value store*
> >
> > >
> >
> > > https://issues.apache.org/jira/browse/KAFKA-2118 - Cleaner cannot
> clean
> >
> > > after shutdown during replaceSegments –
> >
> > > *Will prevent reading log compacted topic causing failure of local key
> >
> > > value store bootstrap*
> >
> > >
> >
> > > https://issues.apache.org/jira/browse/KAFKA-2235 - LogCleaner offset
> map
> >
> > > overflow –
> >
> > > *Will probably be an issue for some clients who has smaller  message
> size
> >
> > > and large number of keys. They need to fine tune a lot to make sure
> that
> >
> > > this doesn't happen.*
> >
> > >
> >
> > >
> >
> > >
> >
> > > Replication Issues:
> >
> > >
> >
> > > https://issues.apache.org/jira/browse/KAFKA-2477 - Replicas spuriously
> >
> > > deleting all segments in partition –
> >
> > > *Will cause the data in changelog topic to be lost resulting in failure
> > of
> >
> > > local key value store bootstrap. *
> >
> > >
> >
> > >
> >
> > > Though Samza can be plugged with different messaging systems, Kafka is
> > the
> >
> > > major system that is supported today for state-full processing. If
> that's
> >
> > > the case the following bugs will potentially make Samza also to not
> work
> >
> > > properly (Ex: if there is replication issue called out below in a log
> >
> > > compacted topic happens, then Samza might not be able to restore its
> > local
> >
> > > key value store).. Since you are running Samza with state-full
> > processing,
> >
> > > the above issues might result your Samza job with key value store in an
> >
> > > in-consistent state. Are you using Samza with stateful processing for
> >
> > > critical applications which cannot tolerate loss of data or
> >
> > > inconsistencies? (Because with the above bugs you might not be able to
> > run
> >
> > > the job for critical application as it might fail if it is hit with the
> >
> > > above issues). I believe that upgrading to 0.9 Kafka is much critical
> to
> >
> > > ensure that Samza also works properly (I do understand that its not a
> > issue
> >
> > > with Samza, but I believe that the one of the primary reason for
> >
> > > customers/devs choosing Samza is its fine ability to do state-full
> >
> > > processing and if that is not working or will fail due to dependency on
> >
> > > Kafka, it becomes necessary to upgrade to Kafka asap), please correct
> me
> > if
> >
> > > I am wrong here.
> >
> > >
> >
> > >
> >
> > > Thanks,
> >
> > >
> >
> > > Nick
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > ------------------------------
> >
> > >
> >
> > >
> >
> > >
> >
> > > Hi, Nick,
> >
> > >
> >
> > >
> >
> > >
> >
> > > Let me try to answer in-between the lines:
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Thu, Mar 31, 2016 at 12:49 AM, nick xander <nickxander...@gmail.com
> >
> >
> > >
> >
> > > wrote:
> >
> > >
> >
> > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > * Do you guys experience issue with Kafka when it is used with log
> >
> > >
> >
> > > > compaction for Samza's state full management?
> >
> > >
> >
> > > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > The critical issue on log-compaction in Kafka that we care about is the
> >
> > >
> >
> > > case where message compression and log-compaction are *both* used in
> the
> >
> > >
> >
> > > same topic. Currently, for changelog topics, we forcefully turned off
> >
> > >
> >
> > > compression. Hence, it is not a problem for Samza's KV-stores. It is
> > still
> >
> > >
> >
> > > a problem for checkpoint topics if the Kafka producer is configured to
> > use
> >
> > >
> >
> > > message compression.
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > > * What is the avg number of keys per partition that you have observed
> > in
> >
> > >
> >
> > > > Kafka's log compacted topic for state full management, total number
> of
> >
> > >
> >
> > > > partition, replication factor and number of Kafka brokers?
> >
> > >
> >
> > > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > This number varies *a lot*, depending on how big your KV-store is. For
> >
> > >
> >
> > > example, we have seem around 5-10GB of RocksDB KV-stores being stored
> in
> >
> > >
> >
> > > changelog in LinkedIn. That will cause a long bootstrap time when the
> >
> > >
> >
> > > container is restarted on a different host. Hence, we included
> >
> > >
> >
> > > host-affinity feature in Samza 0.10, which cut down the bootstrap time
> > for
> >
> > >
> >
> > > that particular job by 20x.
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as
> it
> >
> > >
> >
> > > > seems critical if Samza is used for stateful management? And what is
> > the
> >
> > >
> >
> > > > timeline for Samza 0.10.1 that you are expecting?
> >
> > >
> >
> > > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > We are planning to release Samza 0.10.1 very soon and are working on
> >
> > >
> >
> > > pending code reviews and validations now. Depending on the
> > test/validation
> >
> > >
> >
> > > cycles, we hope to get Samza 0.10.1 release candidate ready in a month
> or
> >
> > >
> >
> > > so. Kafka 0.9 upgrade will likely not be in Samza 0.10.1, due to the
> > tight
> >
> > >
> >
> > > release timeline this time.
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > > * What is recommendation between the usage of Samza vs Kafka connect?
> >
> > >
> >
> > > > Should we use Samza for state full management and Kafka connect for
> > other
> >
> > >
> >
> > > > stateless streaming soslution?
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > KafkaConnect is mainly an ingest/output connector to/from Kafka, not
> > having
> >
> > >
> >
> > > much stateful processing. Samza actually does both ingest/output and
> >
> > >
> >
> > > stateful process. If there are input data sources that Samza does not
> > have
> >
> > >
> >
> > > a SystemConsumer implementation for yet, you can definitely use
> >
> > >
> >
> > > KafkaConnect for ingestion and Samza for stateful processing.
> >
> > >
> >
> > >
> >
> > >
> >
> > > Hope the above answered your questions.
> >
> > >
> >
> > >
> >
> > >
> >
> > > Thanks!
> >
> > >
> >
> > >
> >
> > >
> >
> > > -Yi
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Thu, Mar 31, 2016 at 9:49 AM, nick xander <nickxander...@gmail.com>
> >
> > > wrote:
> >
> > >
> >
> > > > Hi All,
> >
> > > >     As per this article:
> >
> > > >
> >
> > >
> >
> http://www.confluent.io/blog/290-reasons-to-upgrade-to-apache-kafka-0.9.0.0
> >
> > > > there are some well know bugs and feature improvements around log
> >
> > > > compaction (state full management in Samza) and Replication. I also
> saw
> >
> > > in
> >
> > > > Samza issues about this upgrade:
> >
> > > > https://issues.apache.org/jira/browse/SAMZA-855. My questions here:
> >
> > > >
> >
> > > > * Do you guys experience issue with Kafka when it is used with log
> >
> > > > compaction for Samza's state full management?
> >
> > > > * What is the avg number of keys per partition that you have observed
> > in
> >
> > > > Kafka's log compacted topic for state full management, total number
> of
> >
> > > > partition, replication factor and number of Kafka brokers?
> >
> > > > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as
> it
> >
> > > > seems critical if Samza is used for stateful management? And what is
> > the
> >
> > > > timeline for Samza 0.10.1 that you are expecting?
> >
> > > > * What is recommendation between the usage of Samza vs Kafka connect?
> >
> > > > Should we use Samza for state full management and Kafka connect for
> > other
> >
> > > > stateless streaming soslution?
> >
> > > >
> >
> > > > Thanks,
> >
> > > > Nick
> >
> > > >
> >
> > >
> >
>



-- 
Thanks,
Neha

Reply via email to