Hey Dave,

We're separating the problem of getting data in and out of Kafka from the
problem of transforming it. If you think about ETL (Extract, Transform,
Load), what Kafka Connect does is E and L really really well and not T at
all; the focus in stream processing systems is T with E and L being a bit
of a necessary evil. If you are trying to get a single stream of data for
one application, directly using Storm or Spark with appropriate plugins is
totally reasonable. If you are trying to capture a bunch of different data
sources for multiple uses these systems get really awkward really fast.

Imagine a world in which you wanted to capture a significant portion of
what happened in your company as real-time streams and where there are many
things that use data. You could theoretically set up a Storm or Spark job
for each database table purely for the purpose of loading data but managing
this would be a bit of a nightmare. I think this is where Kafka Connect
really shines.

The other advantage of this is that transformation of data is inherently a
deep problem that is close to the programmer. There is lots of room here
for query languages, frameworks in different languages, etc. On the other
hand ingress and egress is much more well defined problem.

So the approach we're building towards is one where data is captured more
or less as it is, at large scale, and then is available for further
transformation or loading into many other systems. The transformation would
be the role of the stream processing systems and the loading and unloading
the role of Kafka Connect.

The advantage Kafka Connect has is the following:
- No additional cluster is needed, it directly co-ordinates with the Kafka
cluster
- It does a good job of capturing schema information from sources if it is
present
- It does a good job of handling scalable data capture--if you want to add
a new table to the set of things you're pulling data from that is just a
simple REST call not another job to manually configure and manage.

Hope that sheds some light on things.

-Jay

On Wed, Nov 25, 2015 at 7:50 AM, Dave Ariens <dari...@blackberry.com> wrote:

> I just finished reading up on Kafka Connect<
> http://kafka.apache.org/documentation.html#connect> and am trying to wrap
> my head around where it fits within the big data ecosystem.
>
> Other than the high level overview provided in the docs I haven't heard
> much about this feature. My limited understanding of it so far is that it
> includes semantics similar to Storm (sources/spouts, sinks/bolts) and
> allows for distributed processing of streams using tasks that handle data
> defined in records conforming to a schema.  Assuming that's mostly
> accurate, is anyone able to speak to why a developer would want to use
> Kafka Connect over Spark (or maybe even Storm but to a lesser degree)?  Is
> Kafka Connect trying to address any short comings?  I understand it greatly
> simplifies offset persistence but that's not terribly difficult to
> implement on top of Spark (see my offset persistence hack<
> https://gist.github.com/ariens/e6a39bc3dbeb11467e53>).  Where is Kafka
> Connect being targeted to within the  vast ecosystem that is big data?
>
> Does Kafka Connect offer efficiencies 'under the hood' taking advantage of
> data locality and the fact that it distributes workload on the actual Kafka
> cluster itself?
>
> I can see basic ETL and data warehouse bulk operations simplified where
> one just wants an easy way to get all data in/out of Kafka and reduce the
> network IO of having multiple compute clusters but for any data science
> type operations (machine learning, etc) I would expect working with Spark's
> RDDs to be more efficient.
>
>
>
>
>
>
>
>
>
>
>

Reply via email to