Hey Roshan, That is definitely the key question in this space--what can we do that other systems don't?
It's true that there are a number of systems that copy data between things. At a high enough level of abstraction I suppose they are somewhat the same. But I think this area is the source of rather a lot of pain for people running these things so it is hard to imagine that the problem is totally solved in the current state. All the systems you mention are good, and a few we have even contributed to so this is not to disparage anything. Here are the advantages in what we are proposing: 1. Unlike sqoop and Camus this treats batch load as a special case of continuous load (where the stream happens to be a bit bursty). I think this is the right approach and enables real-time integration without giving up the possibility of periodic dumps. 2. We are trying to make it possible to capture and integrate the metadata around schema with the data whenever possible. This is present and something the connectors themselves have access to. I think this is a big deal versus just delivering opaque byte[]/String rows, and is really required for doing this kind of thing well at scale. This allows a lot of simple filtering, projection, mapping, etc without custom code as well as making it possible to start to have notions of compatibility and schema evolution. We hope to make the byte[]/String case be kind of a special case of the richer record model where you just have a simple schema. 3. This has a built in notion of parallelism throughout. 4. This maps well to Kafka. For people using Kafka I think basically sharing a data model makes things a lot simpler (topics, partitions, etc). This also makes it a lot easier to reason about guarantees. 5. Philosophically we are very committed to the idea of O(1) data loads, which I think Gwen has more eloquently called the "factory model", and in other context's I have heard described as Cattle not Pets. The idea being that if you accept up front that you are going to have ~1000 data streams in a company and dozens of sources and syncs the approach you take towards this sort of stuff is radically different than if you assume a few inputs, one output and a dozen data streams. I think this plays out in a bunch of ways around management, configuration, etc. Ultimately I think one thing we learned in thinking about the area is that the system you come up with really comes down to what assumptions you make. To address a few of your other points: - We agree running in YARN is a good thing, but requiring YARN is a bad thing. I think you may be seeing things somewhat from a Hadoop-centric view where YARN is much more prevalent. However I think the scope of the problem is not at all specific to Hadoop and beyond the Hadoop ecosystem we don't see that heavy use of YARN (Mesos is more prevalent, but neither is particularly common). I think our approach here is that copycat runs as a process, if you run it in YARN it should work in Slider, if you run it in Mesos in Marathon, and if you run it with old fashioned ops tools then you just manage it like any other process. - Exactly-once: Yes, but when we add that support in Kafka you will get it end-to-end, which is important. - I agree that all existing systems have more connectors--we are willing to do the work to catch up there as we think it is possible to get to an overall better state. I definitely agree this is significant work. -Jay On Fri, Jun 19, 2015 at 7:57 PM, Roshan Naik <ros...@hortonworks.com> wrote: > My initial thoughts: > > Although it is kind of discussed very broadly, I did struggle a bit to > properly grasp the value add this adds over the alternative approaches that > are available today (or need a little work to accomplish) in specific use > cases. I feel its better to take specific common use cases and show why > this will do better to make it clear. For example data flow starting from a > pool of web server and finally end up in HDFS or Hive while providing > At-least one guarantees. > > Below are more specific points that occurred to me: > > - Import: Today we can create data flows to pick up data from a variety of > source and push data into Kafka using Flume. Not clear how this system can > do better in this specific case. > - Export: For pulling data out of Kakfa there is Camus (which limits > destination to HDFS), Flume (which can deliver to many places) and also > Sqoop (which could be extended to support Kafka). Camus and Sqoop don't > have the problem of "requires defining many tasks" issue for parallelism. > - YARN support – Letting YARN manage things is actually good thing (not a > bad thing as indicated), since its easier for the scaling in/out as needed > and not worry too much about hardware allocation. > - Exactly-Once: It is clear that on the import side you won't support > that for now. Not clear how you will support that on export side for > destination like HDFS or some other. Exactly once only make sense when we > can have that guarantee on the entire data flow (not just portions of the > flow). > - Connector Variety: Flume and Sqoop already have out of the box- support > for multiple destinations and sources. > > > -roshan > > > > On 6/19/15 2:47 PM, "Jay Kreps" <j...@confluent.io<mailto:j...@confluent.io>> > wrote: > > I think we want the connectors to be federated just because trying to > maintain all the connectors centrally would be really painful. I think if > we really do this well we would want to have >100 of these connectors so it > really won't make sense to maintain them with the project. I think the > thought was just to include the framework and maybe one simple connector as > an example. > > Thoughts? > > -Jay > > On Fri, Jun 19, 2015 at 2:38 PM, Gwen Shapira <gshap...@cloudera.com > <mailto:gshap...@cloudera.com>> wrote: > > I think BikeShed will be a great name. > > Can you clarify the scope? The KIP discusses a framework and also few > examples for connectors. Does the addition include just the framework > (and perhaps an example or two), or do we plan to start accepting > connectors to Apache Kafka project? > > Gwen > > On Thu, Jun 18, 2015 at 3:09 PM, Jay Kreps <j...@confluent.io<mailto: > j...@confluent.io>> wrote: > > I think the only problem we came up with was that Kafka KopyKat > abbreviates > > as KKK which is not ideal in the US. Copykat would still be googlable > > without that issue. :-) > > > > -Jay > > > > On Thu, Jun 18, 2015 at 1:20 PM, Otis Gospodnetic < > > otis.gospodne...@gmail.com<mailto:otis.gospodne...@gmail.com>> wrote: > > > >> Just a comment on the name. KopyKat? More unique, easy to write, > >> pronounce, remember... > >> > >> Otis > >> > >> > >> > >> > On Jun 18, 2015, at 13:36, Jay Kreps <j...@confluent.io<mailto: > j...@confluent.io>> wrote: > >> > > >> > 1. We were calling the plugins connectors (which is kind of a generic > way > >> > to say either source or sink) and the framework copycat. The pro of > >> copycat > >> > is it is kind of fun. The con is that it doesn't really say what it > does. > >> > The Kafka Connector Framework would be a duller but more intuitive > name, > >> > but I suspect people would then just shorten it to KCF which again > has no > >> > intuitive meaning. > >> > > >> > 2. Potentially. One alternative we had thought of wrt the consumer > was to > >> > have the protocol just handle the group management part and have the > >> > partition assignment be purely driven by the client. At the time > copycat > >> > wasn't even a twinkle in our eyes so we weren't really thinking about > >> that. > >> > There were pros and cons to this and we decided it was better to do > >> > partition assignment on the broker side. We could revisit this, it > might > >> > not be a massive change in the consumer, but it would definitely add > work > >> > there. I do agree that if we have learned one thing it is to keep > clients > >> > away from zk. This zk usage is more limited though, in that there is > no > >> > intention of having copycat in different languages as the clients are. > >> > > >> > 4. I think the idea is to include the structural schema information > you > >> > have available so it can be taken advantage of. Obviously the easiest > >> > approach would just be to have a static schema for the messages like > >> > timestamp + string/byte[]. However this means that i the source has > >> schema > >> > information there is no real official way to propagate that. Having a > >> real > >> > built-in schema mechanism gives you a little more power to make the > data > >> > usable. So if you were publishing apache logs the low-touch generic > way > >> > would just be to have the schema be "string" since that is what apache > >> log > >> > entries are. However if you had the actual format string used for the > log > >> > you could use that to have a richer schema and parse out the > individual > >> > fields, which is significantly more usable. The advantage of this is > that > >> > systems like databases, Hadoop, and so on that have some notion of > >> schemas > >> > can take advantage of this information that is captured with the > source > >> > data. So, e.g. the JDBC plugin can map the individual fields to > columns > >> > automatically, and you can support features like projecting out > >> particular > >> > fields and renaming fields easily without having to write custom > >> > source-specific code. > >> > > >> > -Jay > >> > > >> >> On Tue, Jun 16, 2015 at 5:00 PM, Joe Stein <joe.st...@stealth.ly > <mailto:joe.st...@stealth.ly>> > >> wrote: > >> >> > >> >> Hey Ewen, very interesting! > >> >> > >> >> I like the idea of the connector and making one side always being > Kafka > >> for > >> >> all the reasons you mentioned. It makes having to build consumers > (over > >> and > >> >> over and over (and over)) again for these type of tasks much more > >> >> consistent for everyone. > >> >> > >> >> Some initial comments (will read a few more times and think more > through > >> >> it). > >> >> > >> >> 1) Copycat, it might be weird/hard to talk about producers, > consumers, > >> >> brokers and copycat for what and how "kafka" runs. I think the other > >> naming > >> >> makes sense but maybe we can call it something else? "Sinks" or > whatever > >> >> (don't really care just bringing up it might be something to > consider). > >> We > >> >> could also just call it "connectors"...dunno.... producers, > consumers, > >> >> brokers and connectors... > >> >> > >> >> 2) Can we do copycat-workers without having to rely on Zookeeper? So > >> much > >> >> work has been done to remove this dependency if we can do something > >> without > >> >> ZK lets try (or at least abstract it so it is easier later to make it > >> >> pluggable). > >> >> > >> >> 3) Even though connectors being managed in project has already been > >> >> rejected... maybe we want to have a few (or one) that are in the > project > >> >> and maintained. This makes out of the box really out of the box (if > only > >> >> file or hdfs or something). > >> >> > >> >> 4) "all records include schemas which describe the format of their > >> data" I > >> >> don't totally get this... a lot of data doesn't have the schema with > >> it, we > >> >> have to plug that in... so would the plugin you are talking about for > >> >> serializer would inject the schema to use with the record when it > sees > >> the > >> >> data? > >> >> > >> >> > >> >> ~ Joe Stein > >> >> - - - - - - - - - - - - - - - - - > >> >> > >> >> http://www.stealth.ly > >> >> - - - - - - - - - - - - - - - - - > >> >> > >> >> On Tue, Jun 16, 2015 at 4:33 PM, Ewen Cheslack-Postava < > >> e...@confluent.io<mailto:e...@confluent.io>> > >> >> wrote: > >> >> > >> >>> Oops, linked the wrong thing. Here's the correct one: > >> >> > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767 > >> >>> > >> >>> -Ewen > >> >>> > >> >>> On Tue, Jun 16, 2015 at 4:32 PM, Ewen Cheslack-Postava < > >> >> e...@confluent.io<mailto:e...@confluent.io>> > >> >>> wrote: > >> >>> > >> >>>> Hi all, > >> >>>> > >> >>>> I just posted KIP-26 - Add Copycat, a connector framework for data > >> >>>> import/export here: > >> >> > >> > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals > >> >>>> > >> >>>> This is a large KIP compared to what we've had so far, and is a bit > >> >>>> different from most. We're proposing the addition of a fairly big > new > >> >>>> component to Kafka because we think including it as part of Kafka > >> >> rather > >> >>>> than as an external project is in the best interest of both Copycat > >> and > >> >>>> Kafka itself. > >> >>>> > >> >>>> The goal with this KIP is to decide whether such a tool would make > >> >> sense > >> >>>> in Kafka, give a high level sense of what it would entail, and > scope > >> >> what > >> >>>> would be included vs what would be left to third-parties. I'm > hoping > >> to > >> >>>> leave discussion of specific design and implementation details, as > >> well > >> >>>> logistics like how best to include it in the Kafka repository & > >> >> project, > >> >>> to > >> >>>> the subsequent JIRAs or follow up KIPs. > >> >>>> > >> >>>> Looking forward to your feedback! > >> >>>> > >> >>>> -Ewen > >> >>>> > >> >>>> P.S. Preemptive relevant XKCD: https://xkcd.com/927/ > >> >>> > >> >>> > >> >>> -- > >> >>> Thanks, > >> >>> Ewen > >> >> > >> > > >