On Tue, Jun 16, 2015 at 5:00 PM, Joe Stein <joe.st...@stealth.ly> wrote:
> Hey Ewen, very interesting! > > I like the idea of the connector and making one side always being Kafka for > all the reasons you mentioned. It makes having to build consumers (over and > over and over (and over)) again for these type of tasks much more > consistent for everyone. > > Some initial comments (will read a few more times and think more through > it). > > 1) Copycat, it might be weird/hard to talk about producers, consumers, > brokers and copycat for what and how "kafka" runs. I think the other naming > makes sense but maybe we can call it something else? "Sinks" or whatever > (don't really care just bringing up it might be something to consider). We > could also just call it "connectors"...dunno.... producers, consumers, > brokers and connectors... > I'm very open to naming changes. It's hard to come up with names that are intuitive but don't have conflicts. Even in writing this up I was fighting the names a lot. It gets especially confusing because a lot of the names you would think are intuitive, like source and sink, are confusing if everyone isn't using the same frame of reference. If you're just thinking about data in Kafka, you could think of "source" as being a Kafka consumer, but at the level I think of Copycat "source" means a source of data for import into Kafka, and is therefore tied to a Kafka producer. The perspective of someone who already uses the Kafka APIs a lot vs. the perspective of a new user or admin that's just trying to get data copied may be very different. I think the important things to distinguish are: * import and export since they need different APIs for tasks. anything suggesting directionality (e.g., import/export, source/sink, producer/consumer) is potentially confusing * difference between the connector (top-level job) vs tasks (subset of the job that does the actual copying) * worker/coordinator, this is probably uncontroversial * data model names are even confusing -- "record" vs. object/dictionary/whatever. one needs to indicate complex data structures, and need another term to refer to the actual records being processed like ProducerRecord/ConsumerRecord. This might get a bit easier when we start talking about real classes (i.e. CopycatRecord) but having a clear distinction would be helpful since it still gets confusing talking about these things in documentation. > 2) Can we do copycat-workers without having to rely on Zookeeper? So much > work has been done to remove this dependency if we can do something without > ZK lets try (or at least abstract it so it is easier later to make it > pluggable). > Agreed. I think if we hide this behind a Coordinator interface where most of the Coordinator public API corresponds to the actions you'd take from the REST API/CLI it'll sufficiently isolate it. I think even if we use ZK for the distributed version, we can probably get a good interface to start with by actually implementing the standalone version as a separate Coordinator implementation. This would force us to think that API through thoroughly and properly layer the code. I suspect that in practice it's unlikely we'd see an alternative implementation any time soon, but I think it's a great idea to try to design around that possibility here since I don't think it costs us much when we're starting from scratch. > > 3) Even though connectors being managed in project has already been > rejected... maybe we want to have a few (or one) that are in the project > and maintained. This makes out of the box really out of the box (if only > file or hdfs or something). > Heh, I included some items there just so I'd have a place to put our thoughts about those issues without making it look like I was including them in the proposal. Obviously nothing in here is really off the table yet. There are a couple of reasons to have something built in. First, you can't really test without *something*, even if it's trivial. Second, it's hard for people to write connectors without any reference example. File is the really obvious one since it can be really simple. It's also nice since you don't need any extra dependencies or any infrastructure to do simple tests/a quickstart. And with a pretty minimal feature set it can provide basic log import. I think we could go two ways with an included connector. We could either make it a fully featured example, which could result in quite a bit more code that's more complex. Or, we could keep it minimal and use it as a helpful example and skeleton connector. I don't feel too strongly either way on this, but definitely think the file connector is the right thing to include with the framework itself. > > 4) "all records include schemas which describe the format of their data" I > don't totally get this... a lot of data doesn't have the schema with it, we > have to plug that in... so would the plugin you are talking about for > serializer would inject the schema to use with the record when it sees the > data? > Good question. That phrase may have been overreaching. I think the serialization needs to be pluggable, and with a generic data format we'll need some API (or at least implementations) that are different than current serializers/deserializers. How we handle schemas is a bit tricky since some wouldn't need them and might just discard them (e.g. JSON), whereas others will require them (e.g. Avro). I think the key point here is that we need to provide the APIs to allow schemas to be passed through the system so that they can make it all the way from an input system to an output system. One way to accomplish this would be to have a very generic catch-all that can handle, for example, JSON inputs that don't have associated schemas. I think this part is going to be tricky -- where schema info is available, it'd be really helpful to preserve it especially since some connectors will require it (or at least become a lot less useful without it). I think it's a good idea to encourage connector developers to provide it if possible, which is why I suggested it should be required. On the other hand, if there's an easy out like using a catch-all, we might just have a bunch of connectors that use that instead of providing the real schema.... -Ewen > > > ~ Joe Stein > - - - - - - - - - - - - - - - - - > > http://www.stealth.ly > - - - - - - - - - - - - - - - - - > > On Tue, Jun 16, 2015 at 4:33 PM, Ewen Cheslack-Postava <e...@confluent.io> > wrote: > > > Oops, linked the wrong thing. Here's the correct one: > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767 > > > > -Ewen > > > > On Tue, Jun 16, 2015 at 4:32 PM, Ewen Cheslack-Postava < > e...@confluent.io> > > wrote: > > > > > Hi all, > > > > > > I just posted KIP-26 - Add Copycat, a connector framework for data > > > import/export here: > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals > > > > > > This is a large KIP compared to what we've had so far, and is a bit > > > different from most. We're proposing the addition of a fairly big new > > > component to Kafka because we think including it as part of Kafka > rather > > > than as an external project is in the best interest of both Copycat and > > > Kafka itself. > > > > > > The goal with this KIP is to decide whether such a tool would make > sense > > > in Kafka, give a high level sense of what it would entail, and scope > what > > > would be included vs what would be left to third-parties. I'm hoping to > > > leave discussion of specific design and implementation details, as well > > > logistics like how best to include it in the Kafka repository & > project, > > to > > > the subsequent JIRAs or follow up KIPs. > > > > > > Looking forward to your feedback! > > > > > > -Ewen > > > > > > P.S. Preemptive relevant XKCD: https://xkcd.com/927/ > > > > > > > > > > > > -- > > Thanks, > > Ewen > > > -- Thanks, Ewen