Thanks! So I just override the conf while doing the API call? It’d be great to have this documented somewhere on the confluent website. I couldn’t find it.
On 6 January 2017 at 3:42:45 pm, Ewen Cheslack-Postava (e...@confluent.io) wrote: On Thu, Jan 5, 2017 at 7:19 PM, Stephane Maarek < steph...@simplemachines.com.au> wrote: > Thanks a lot for the guidance, I think we’ll go ahead with one cluster. I > just need to figure out how our CD pipeline can talk to our connect cluster > securely (because it’ll need direct access to perform API calls). > The documentation isn't great here, but you can apply all the normal security configs to Connect (in distributed mode, it's basically equivalent to a consumer, so everything you can do with a consumer you can do with Connect). > > Lastly, a question or maybe a piece of feedback… is it not possible to > specify the key serializer and deserializer as part of the rest api job > config? > The issue is that sometimes our data is avro, sometimes it’s json. And it > seems I’d need two separate clusters for that? > This is new! As of 0.10.1.0, we have https://cwiki.apache.org/confluence/display/KAFKA/KIP-75+-+Add+per-connector+Converters which allows you to include it in the connector config. It's called "Converter" in Connect because it does a bit more than ser/des if you've written them for Kafka, but they are basically just pluggable ser/des. We knew folks would want this, it just took us awhile to find the bandwidth to implement it. Now, you shouldn't need to do anything special or deploy multiple clusters -- it's baked in and supported as long as you are willing to override it on a per-connector basis (and this seems reasonable for most folks since *ideally* you are *somewhat* standardized on a common serialization format). -Ewen > > On 6 January 2017 at 1:54:10 pm, Ewen Cheslack-Postava (e...@confluent.io) > wrote: > > On Thu, Jan 5, 2017 at 3:12 PM, Stephane Maarek < > steph...@simplemachines.com.au> wrote: > > > Hi, > > > > We like to operate in micro-services (dockerize and ship everything on > ecs) > > and I was wondering which approach was preferred. > > We have one kafka cluster, one zookeeper cluster, etc, but when it comes > to > > kafka connect I have some doubts. > > > > Is it better to have one big kafka connect with multiple nodes, or many > > small kafka connect clusters or standalone, for each connector / etl ? > > > > You can do any of these, and it may depend on how you do > orchestration/deployment. > > We built Connect to support running one big cluster running a bunch of > connectors. It balances work automatically and provides a way to control > scale up/down via increased parallelism. This means we don't need to make > any assumptions about how you deploy, how you handle elastically scaling > your clusters, etc. But if you run in an environment and have the tooling > in place to do that already, you can also opt to run many smaller clusters > and use that tooling to scale up/down. In that case you'd just make sure > there were enough tasks for each connector so that when you scale the # of > workers for a cluster up the rebalancing of work would ensure there was > enough tasks for every worker to remain occupied. > > The main drawback of doing this is that Connect uses a few topics to for > configs, status, and offsets and you need these to be unique per cluster. > This means you'll have 3N more topics. If you're running a *lot* of > connectors, that could eventually become a problem. It also means you have > that many more worker configs to handle, clusters to monitor, etc. And > deploying a connector no longer becomes as simple as just making a call to > the service's REST API since there isn't a single centralized service. The > main benefits I can think of are a) if you already have preferred tooling > for handling elasticity and b) better resource isolation between connectors > (i.e. an OOM error in one connector won't affect any other connectors). > > For standalone mode, we'd generally recommend only using it when > distributed mode doesn't make sense, e.g. for log file collection. Other > than that, having the fault tolerance and high availability of distributed > mode is preferred. > > On your specific points: > > > > > The issues I’m trying to address are : > > - Integration with our CI/CD pipeline > > > > I'm not sure anything about Connect affects this. Is there a specific > concern you have about the CI/CD pipeline & Connect? > > > > - Efficient resources utilisation > > > > Putting all the connectors into one cluster will probably result in better > resource utilization unless you're already automatically tracking usage and > scaling appropriately. The reason is that if you use a bunch of small > clusters, you're now stuck trying to optimize N uses. Since Connect can > already (roughly) balance work, putting all the work into one cluster and > having connect split it up means you just need to watch utilization of the > nodes in that one cluster and scale up or down as appropriate. > > > > - Easily add new jar files that connectors depend on with minimal > downtime > > > > This one is a bit interesting. You shouldn't have any downtime adding jars > in the sense that you can do rolling bounces of Connect. The one caveat is > that the current limitation for how it rebalances work involves halting > work for all connectors/tasks, doing the rebalance, and then starting them > up again. We plan to improve this, but the timeframe for it is still > uncertain. Usually these rebalance steps should be pretty quick. The main > reason this can be a concern is that halting some connectors could take > some time (e.g. because they need to fully flush their data). This means > the period of time your connectors are not processing data during one of > those rebalances is controlled by the "worst" connector. > > I would recommend trying a single cluster but monitoring whether you see > stalls due to rebalances. If you do, then moving to multiple clusters might > make sense. (This also, obviously, depends a lot on your SLA for data > delivery.) > > > > - Monitoring operations > > > > Multiple clusters definitely seems messier and more complicated for this. > There will be more workers in a single cluster, but it's a single service > you need to monitor and maintain. > > Hope that helps! > > -Ewen > > > > > > Thanks for your guidance > > > > Regards, > > Stephane > > > >