Re: execution.runtime-mode=BATCH when reading from Hive

Dongwon Kim Wed, 18 Nov 2020 09:11:58 -0800

Hi Aljoscha,

Unfortunately, it's not that easy right now because normal Sinks that
> rely on checkpointing to write out data, such as Kafka, don't work in
> BATCH execution mode because we don't have checkpoints there. It will
> work, however, if you use a source that doesn't rely on checkpointing it
> will work. The FlinkKafkaProducer with Semantic.NONE should work, for
> example.


As the output produced to Kafka is eventually stored on Cassandra, I might
use a different sink to write output directly to Cassandra for BATCH
execution.
In such a case, I have to replace both (A) source and (E) sink.

There is HiveSource, which is built on the new Source API that will work
> well with both BATCH and STREAMING. It's quite new and it was only added
> to be used by a Table/SQL connector but you might have some success with
> that one.

Oh, this one is a new one which will be introduced in the upcoming 1.12
release.
How I've missed this one.
I'm going to give it a try :-)

BTW, thanks a lot for the input and the nice presentation - it's very
helpful and insightful.

Best,

Dongwon

On Wed, Nov 18, 2020 at 9:44 PM Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi Dongwon,
>
> Unfortunately, it's not that easy right now because normal Sinks that
> rely on checkpointing to write out data, such as Kafka, don't work in
> BATCH execution mode because we don't have checkopoints there. It will
> work, however, if you use a source that doesn't rely on checkpointing it
> will work. The FlinkKafkaProducer with Semantic.NONE should work, for
> example.
>
> There is HiveSource, which is built on the new Source API that will work
> well with both BATCH and STREAMING. It's quite new and it was only added
> to be used by a Table/SQL connector but you might have some success with
> that one.
>
> Best,
> Aljoscha
>
> On 18.11.20 07:03, Dongwon Kim wrote:
> > Hi,
> >
> > Recently I've been working on a real-time data stream processing pipeline
> > with DataStream API while preparing for a new service to launch.
> > Now it's time to develop a back-fill job to produce the same result by
> > reading data stored on Hive which we use for long-term storage.
> >
> > Meanwhile, I watched Aljoscha's talk [1] and just wondered if I could
> reuse
> > major components of the pipeline written in DataStream API.
> > The pipeline conceptually looks as follows:
> > (A) reads input from Kafka
> > (B) performs AsyncIO to Redis in order to enrich the input data
> > (C) appends timestamps and emits watermarks before time-based window
> > (D) keyBy followed by a session window with a custom trigger for early
> > firing
> > (E) writes output to Kafka
> >
> > I have simple (maybe stupid) questions on reusing components of the
> > pipeline written in DataStream API.
> > (1) By replacing (A) with a bounded source, can I execute the pipeline
> with
> > a new BATCH execution mode without modifying (B)~(E)?
> > (2) Is there a bounded source for Hive available for DataStream API?
> >
> > Best,
> >
> > Dongwon
> >
> > [1] https://www.youtube.com/watch?v=z9ye4jzp4DQ
> >
>
>

Re: execution.runtime-mode=BATCH when reading from Hive

Reply via email to