I just finished reading up on Kafka 
Connect<http://kafka.apache.org/documentation.html#connect> and am trying to 
wrap my head around where it fits within the big data ecosystem.

Other than the high level overview provided in the docs I haven't heard much 
about this feature. My limited understanding of it so far is that it includes 
semantics similar to Storm (sources/spouts, sinks/bolts) and allows for 
distributed processing of streams using tasks that handle data defined in 
records conforming to a schema.  Assuming that's mostly accurate, is anyone 
able to speak to why a developer would want to use Kafka Connect over Spark (or 
maybe even Storm but to a lesser degree)?  Is Kafka Connect trying to address 
any short comings?  I understand it greatly simplifies offset persistence but 
that's not terribly difficult to implement on top of Spark (see my offset 
persistence hack<https://gist.github.com/ariens/e6a39bc3dbeb11467e53>).  Where 
is Kafka Connect being targeted to within the  vast ecosystem that is big data?

Does Kafka Connect offer efficiencies 'under the hood' taking advantage of data 
locality and the fact that it distributes workload on the actual Kafka cluster 
itself?

I can see basic ETL and data warehouse bulk operations simplified where one 
just wants an easy way to get all data in/out of Kafka and reduce the network 
IO of having multiple compute clusters but for any data science type operations 
(machine learning, etc) I would expect working with Spark's RDDs to be more 
efficient.










Reply via email to