The JDBC connector I started implementing just handles this manually, and
isn't much code (and could be made into a simple utility):
https://github.com/confluentinc/copycat-jdbc/blob/master/src/main/java/io/confluent/copycat/jdbc/JdbcSourceTask.java#L152

Given the current APIs, sources can just handle this on their own if they
want to because the expectation is that when we call `poll()` on them, they
can hold on to control of the thread indefinitely.

So I think this is mainly a question for sinks, like the Camus-like example
you mentioned. And I definitely think this is a valid use case -- if I want
hourly files in HDFS, it's probably better to just run the job once per
hour and quickly dump all that data to HDFS than to stream it gradually.

A different option from your suggestion would be to expose the upcoming
pause/resume functionality of the consumer (assuming you agree with my
analysis that this is primarily a sink connector issue). In that case, sink
connectors could just pause their inputs and sleep during the time
processing should not occur. I'm not sure if the batch mode or exposing
pause/resume is better -- both add more API surface area.

-Ewen



On Thu, Aug 13, 2015 at 10:23 PM, Gwen Shapira <g...@confluent.io> wrote:

> Hi Team Kafka,
>
> (sorry for the flood, this is last one! promise!)
>
> If you tried out PR-99, you know that CopyCat now does on-going
> export/import. So it will continuously read data from a source and write it
> to Kafka (or vice versa). This is great for tailing logs and replicating
> from MySQL binlog.
>
> But, I'm wondering if there's a need for a batch-mode too.
> This can be useful for:
> * Camus-like thing. You can stream data to HDFS, but the benefits are
> limited and there are some known issues there.
> * Dump large parts of an RDBMS at once.
>
> Do you agree that this need exist? or is stream export/import good enough?
>
> Also, anyone has ideas how he would like the batch mode to work?
>
> Gwen
>



-- 
Thanks,
Ewen

Reply via email to