Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

Ewen Cheslack-Postava Tue, 16 Jun 2015 17:48:46 -0700

On Tue, Jun 16, 2015 at 5:00 PM, Joe Stein <joe.st...@stealth.ly> wrote:

> Hey Ewen, very interesting!
>
> I like the idea of the connector and making one side always being Kafka for
> all the reasons you mentioned. It makes having to build consumers (over and
> over and over (and over)) again for these type of tasks much more
> consistent for everyone.
>
> Some initial comments (will read a few more times and think more through
> it).
>
> 1) Copycat, it might be weird/hard to talk about producers, consumers,
> brokers and copycat for what and how "kafka" runs. I think the other naming
> makes sense but maybe we can call it something else? "Sinks" or whatever
> (don't really care just bringing up it might be something to consider). We
> could also just call it "connectors"...dunno.... producers, consumers,
> brokers and connectors...
>

I'm very open to naming changes. It's hard to come up with names that are
intuitive but don't have conflicts. Even in writing this up I was fighting
the names a lot. It gets especially confusing because a lot of the names
you would think are intuitive, like source and sink, are confusing if
everyone isn't using the same frame of reference. If you're just thinking
about data in Kafka, you could think of "source" as being a Kafka consumer,
but at the level I think of Copycat "source" means a source of data for
import into Kafka, and is therefore tied to a Kafka producer. The
perspective of someone who already uses the Kafka APIs a lot vs. the
perspective of a new user or admin that's just trying to get data copied
may be very different.

I think the important things to distinguish are:
* import and export since they need different APIs for tasks. anything
suggesting directionality (e.g., import/export, source/sink,
producer/consumer) is potentially confusing
* difference between the connector (top-level job) vs tasks (subset of the
job that does the actual copying)
* worker/coordinator, this is probably uncontroversial
* data model names are even confusing -- "record" vs.
object/dictionary/whatever. one needs to indicate complex data structures,
and need another term to refer to the actual records being processed like
ProducerRecord/ConsumerRecord. This might get a bit easier when we start
talking about real classes (i.e. CopycatRecord) but having a clear
distinction would be helpful since it still gets confusing talking about
these things in documentation.

> 2) Can we do copycat-workers without having to rely on Zookeeper? So much
> work has been done to remove this dependency if we can do something without
> ZK lets try (or at least abstract it so it is easier later to make it
> pluggable).
>

Agreed. I think if we hide this behind a Coordinator interface where most
of the Coordinator public API corresponds to the actions you'd take from
the REST API/CLI it'll sufficiently isolate it. I think even if we use ZK
for the distributed version, we can probably get a good interface to start
with by actually implementing the standalone version as a separate
Coordinator implementation. This would force us to think that API through
thoroughly and properly layer the code.

I suspect that in practice it's unlikely we'd see an alternative
implementation any time soon, but I think it's a great idea to try to
design around that possibility here since I don't think it costs us much
when we're starting from scratch.

>
> 3) Even though connectors being managed in project has already been
> rejected... maybe we want to have a few (or one) that are in the project
> and maintained. This makes out of the box really out of the box (if only
> file or hdfs or something).
>

Heh, I included some items there just so I'd have a place to put our
thoughts about those issues without making it look like I was including
them in the proposal. Obviously nothing in here is really off the table yet.

There are a couple of reasons to have something built in. First, you can't
really test without *something*, even if it's trivial. Second, it's hard
for people to write connectors without any reference example.

File is the really obvious one since it can be really simple. It's also
nice since you don't need any extra dependencies or any infrastructure to
do simple tests/a quickstart. And with a pretty minimal feature set it can
provide basic log import.

I think we could go two ways with an included connector. We could either
make it a fully featured example, which could result in quite a bit more
code that's more complex. Or, we could keep it minimal and use it as a
helpful example and skeleton connector. I don't feel too strongly either
way on this, but definitely think the file connector is the right thing to
include with the framework itself.

>
> 4) "all records include schemas which describe the format of their data" I
> don't totally get this... a lot of data doesn't have the schema with it, we
> have to plug that in... so would the plugin you are talking about for
> serializer would inject the schema to use with the record when it sees the
> data?
>

Good question. That phrase may have been overreaching. I think the
serialization needs to be pluggable, and with a generic data format we'll
need some API (or at least implementations) that are different than current
serializers/deserializers. How we handle schemas is a bit tricky since some
wouldn't need them and might just discard them (e.g. JSON), whereas others
will require them (e.g. Avro).

I think the key point here is that we need to provide the APIs to allow
schemas to be passed through the system so that they can make it all the
way from an input system to an output system. One way to accomplish this
would be to have a very generic catch-all that can handle, for example,
JSON inputs that don't have associated schemas.

I think this part is going to be tricky -- where schema info is available,
it'd be really helpful to preserve it especially since some connectors will
require it (or at least become a lot less useful without it). I think it's
a good idea to encourage connector developers to provide it if possible,
which is why I suggested it should be required. On the other hand, if
there's an easy out like using a catch-all, we might just have a bunch of
connectors that use that instead of providing the real schema....

-Ewen

>
>
> ~ Joe Stein
> - - - - - - - - - - - - - - - - -
>
>   http://www.stealth.ly
> - - - - - - - - - - - - - - - - -
>
> On Tue, Jun 16, 2015 at 4:33 PM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > Oops, linked the wrong thing. Here's the correct one:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >
> > -Ewen
> >
> > On Tue, Jun 16, 2015 at 4:32 PM, Ewen Cheslack-Postava <
> e...@confluent.io>
> > wrote:
> >
> > > Hi all,
> > >
> > > I just posted KIP-26 - Add Copycat, a connector framework for data
> > > import/export here:
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> > >
> > > This is a large KIP compared to what we've had so far, and is a bit
> > > different from most. We're proposing the addition of a fairly big new
> > > component to Kafka because we think including it as part of Kafka
> rather
> > > than as an external project is in the best interest of both Copycat and
> > > Kafka itself.
> > >
> > > The goal with this KIP is to decide whether such a tool would make
> sense
> > > in Kafka, give a high level sense of what it would entail, and scope
> what
> > > would be included vs what would be left to third-parties. I'm hoping to
> > > leave discussion of specific design and implementation details, as well
> > > logistics like how best to include it in the Kafka repository &
> project,
> > to
> > > the subsequent JIRAs or follow up KIPs.
> > >
> > > Looking forward to your feedback!
> > >
> > > -Ewen
> > >
> > > P.S. Preemptive relevant XKCD: https://xkcd.com/927/
> > >
> > >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>

-- 
Thanks,
Ewen

Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

Reply via email to