Hey Dave, We're separating the problem of getting data in and out of Kafka from the problem of transforming it. If you think about ETL (Extract, Transform, Load), what Kafka Connect does is E and L really really well and not T at all; the focus in stream processing systems is T with E and L being a bit of a necessary evil. If you are trying to get a single stream of data for one application, directly using Storm or Spark with appropriate plugins is totally reasonable. If you are trying to capture a bunch of different data sources for multiple uses these systems get really awkward really fast.
Imagine a world in which you wanted to capture a significant portion of what happened in your company as real-time streams and where there are many things that use data. You could theoretically set up a Storm or Spark job for each database table purely for the purpose of loading data but managing this would be a bit of a nightmare. I think this is where Kafka Connect really shines. The other advantage of this is that transformation of data is inherently a deep problem that is close to the programmer. There is lots of room here for query languages, frameworks in different languages, etc. On the other hand ingress and egress is much more well defined problem. So the approach we're building towards is one where data is captured more or less as it is, at large scale, and then is available for further transformation or loading into many other systems. The transformation would be the role of the stream processing systems and the loading and unloading the role of Kafka Connect. The advantage Kafka Connect has is the following: - No additional cluster is needed, it directly co-ordinates with the Kafka cluster - It does a good job of capturing schema information from sources if it is present - It does a good job of handling scalable data capture--if you want to add a new table to the set of things you're pulling data from that is just a simple REST call not another job to manually configure and manage. Hope that sheds some light on things. -Jay On Wed, Nov 25, 2015 at 7:50 AM, Dave Ariens <dari...@blackberry.com> wrote: > I just finished reading up on Kafka Connect< > http://kafka.apache.org/documentation.html#connect> and am trying to wrap > my head around where it fits within the big data ecosystem. > > Other than the high level overview provided in the docs I haven't heard > much about this feature. My limited understanding of it so far is that it > includes semantics similar to Storm (sources/spouts, sinks/bolts) and > allows for distributed processing of streams using tasks that handle data > defined in records conforming to a schema. Assuming that's mostly > accurate, is anyone able to speak to why a developer would want to use > Kafka Connect over Spark (or maybe even Storm but to a lesser degree)? Is > Kafka Connect trying to address any short comings? I understand it greatly > simplifies offset persistence but that's not terribly difficult to > implement on top of Spark (see my offset persistence hack< > https://gist.github.com/ariens/e6a39bc3dbeb11467e53>). Where is Kafka > Connect being targeted to within the vast ecosystem that is big data? > > Does Kafka Connect offer efficiencies 'under the hood' taking advantage of > data locality and the fact that it distributes workload on the actual Kafka > cluster itself? > > I can see basic ETL and data warehouse bulk operations simplified where > one just wants an easy way to get all data in/out of Kafka and reduce the > network IO of having multiple compute clusters but for any data science > type operations (machine learning, etc) I would expect working with Spark's > RDDs to be more efficient. > > > > > > > > > > >