Hi Aljoscha, Unfortunately, it's not that easy right now because normal Sinks that > rely on checkpointing to write out data, such as Kafka, don't work in > BATCH execution mode because we don't have checkpoints there. It will > work, however, if you use a source that doesn't rely on checkpointing it > will work. The FlinkKafkaProducer with Semantic.NONE should work, for > example.
As the output produced to Kafka is eventually stored on Cassandra, I might use a different sink to write output directly to Cassandra for BATCH execution. In such a case, I have to replace both (A) source and (E) sink. There is HiveSource, which is built on the new Source API that will work > well with both BATCH and STREAMING. It's quite new and it was only added > to be used by a Table/SQL connector but you might have some success with > that one. Oh, this one is a new one which will be introduced in the upcoming 1.12 release. How I've missed this one. I'm going to give it a try :-) BTW, thanks a lot for the input and the nice presentation - it's very helpful and insightful. Best, Dongwon On Wed, Nov 18, 2020 at 9:44 PM Aljoscha Krettek <aljos...@apache.org> wrote: > Hi Dongwon, > > Unfortunately, it's not that easy right now because normal Sinks that > rely on checkpointing to write out data, such as Kafka, don't work in > BATCH execution mode because we don't have checkopoints there. It will > work, however, if you use a source that doesn't rely on checkpointing it > will work. The FlinkKafkaProducer with Semantic.NONE should work, for > example. > > There is HiveSource, which is built on the new Source API that will work > well with both BATCH and STREAMING. It's quite new and it was only added > to be used by a Table/SQL connector but you might have some success with > that one. > > Best, > Aljoscha > > On 18.11.20 07:03, Dongwon Kim wrote: > > Hi, > > > > Recently I've been working on a real-time data stream processing pipeline > > with DataStream API while preparing for a new service to launch. > > Now it's time to develop a back-fill job to produce the same result by > > reading data stored on Hive which we use for long-term storage. > > > > Meanwhile, I watched Aljoscha's talk [1] and just wondered if I could > reuse > > major components of the pipeline written in DataStream API. > > The pipeline conceptually looks as follows: > > (A) reads input from Kafka > > (B) performs AsyncIO to Redis in order to enrich the input data > > (C) appends timestamps and emits watermarks before time-based window > > (D) keyBy followed by a session window with a custom trigger for early > > firing > > (E) writes output to Kafka > > > > I have simple (maybe stupid) questions on reusing components of the > > pipeline written in DataStream API. > > (1) By replacing (A) with a bounded source, can I execute the pipeline > with > > a new BATCH execution mode without modifying (B)~(E)? > > (2) Is there a bounded source for Hive available for DataStream API? > > > > Best, > > > > Dongwon > > > > [1] https://www.youtube.com/watch?v=z9ye4jzp4DQ > > > >