Re: [copycat] support for "batch" jobs

Gwen Shapira Fri, 14 Aug 2015 19:45:38 -0700

For sink side:
I'm a bit more comfortable with "batch mode" than with "run once and it
will do something every hour" because the former puts scheduling firmly in
the user hands (and their cron), the latter means that connector developers
need to figure out schedules.


For source side:
I'm not convinced the use-case doesn't exist - again, because I doubt the
connector developers want to handle scheduling. However, I think we can
punt on this until a good use-case shows up.

Gwen

On Fri, Aug 14, 2015 at 11:17 AM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> On Fri, Aug 14, 2015 at 10:57 AM, Jay Kreps <j...@confluent.io> wrote:
>
> > I thought batch was dead? :-)
> >
> > Yeah I think this would be really useful. Kafka kind of allows you to
> unify
> > batch and streams since you produce or consume your stream on your own
> > schedule so you would want the ingress/egress to work the same.
> >
> > Ewen, rather than sleeping, I think the use case is that I want to be
> able
> > to crontab up the copycat process to run hourly or daily to either push
> or
> > pull data and then quit when there is no more data. Scheduling the
> process
> > to start is easy, the challenge is how does copycat know it is done?
> >
> > The sink side is a little easier since you can define the end of the
> stream
> > to be the last offset for each partition at the time the connector starts
> > (this is what Camus does iirc). So at startup you check the end offset
> for
> > each partition, a partition is complete when it reaches that offset. When
> > all jobs are complete the process exists.
> >
> > Not sure how the source side could work since the offset concept is
> > heterogenous across different systems.
> >
>
> You can indicate this by returning from poll() without any data. Since
> poll() is allowed to block indefinitely, there is no reason under streaming
> mode that it would need to return no data.
>
> -Ewen
>
>
> >
> > -Jay
> >
> > On Thu, Aug 13, 2015 at 10:23 PM, Gwen Shapira <g...@confluent.io>
> wrote:
> >
> > > Hi Team Kafka,
> > >
> > > (sorry for the flood, this is last one! promise!)
> > >
> > > If you tried out PR-99, you know that CopyCat now does on-going
> > > export/import. So it will continuously read data from a source and
> write
> > it
> > > to Kafka (or vice versa). This is great for tailing logs and
> replicating
> > > from MySQL binlog.
> > >
> > > But, I'm wondering if there's a need for a batch-mode too.
> > > This can be useful for:
> > > * Camus-like thing. You can stream data to HDFS, but the benefits are
> > > limited and there are some known issues there.
> > > * Dump large parts of an RDBMS at once.
> > >
> > > Do you agree that this need exist? or is stream export/import good
> > enough?
> > >
> > > Also, anyone has ideas how he would like the batch mode to work?
> > >
> > > Gwen
> > >
> >
>
>
>
> --
> Thanks,
> Ewen
>

Re: [copycat] support for "batch" jobs

Reply via email to