I just finished reading up on Kafka Connect<http://kafka.apache.org/documentation.html#connect> and am trying to wrap my head around where it fits within the big data ecosystem.
Other than the high level overview provided in the docs I haven't heard much about this feature. My limited understanding of it so far is that it includes semantics similar to Storm (sources/spouts, sinks/bolts) and allows for distributed processing of streams using tasks that handle data defined in records conforming to a schema. Assuming that's mostly accurate, is anyone able to speak to why a developer would want to use Kafka Connect over Spark (or maybe even Storm but to a lesser degree)? Is Kafka Connect trying to address any short comings? I understand it greatly simplifies offset persistence but that's not terribly difficult to implement on top of Spark (see my offset persistence hack<https://gist.github.com/ariens/e6a39bc3dbeb11467e53>). Where is Kafka Connect being targeted to within the vast ecosystem that is big data? Does Kafka Connect offer efficiencies 'under the hood' taking advantage of data locality and the fact that it distributes workload on the actual Kafka cluster itself? I can see basic ETL and data warehouse bulk operations simplified where one just wants an easy way to get all data in/out of Kafka and reduce the network IO of having multiple compute clusters but for any data science type operations (machine learning, etc) I would expect working with Spark's RDDs to be more efficient.