If you want to write in batches from a streaming source you always will need 
some state ie a state database (here a NoSQL database such as a key value store 
makes sense). Then you can grab the data at certain points in time and convert 
it to Avro. You need to make sure that the state is logically consistent (eg 
all from the last day) to avoid that events arrive later then expected and they 
are not in the files. 

You can write your own sink, but it would require some state database  to write 
the data afterwards as batch.

Maybe this could be a generic Flink component, ie writing to a state database 
to later write a logical consistent (ok this is defined by the application) 
state into other sinks (CSVsink, avrosink etc).

> On 13. May 2018, at 14:02, Padarn Wilson <pad...@gmail.com> wrote:
> 
> Hi all,
> 
> I am writing some some jobs intended to run using the DataStream API using a 
> Kafka source. However we also have a lot of data in Avro archives (of the 
> same Kafka source). I would like to be able to run the processing code over 
> parts of the archive so I can generate some "example output".
> 
> I've written the transformations needed to read the data from the archives 
> and process the data, but now I'm trying to figure out the best way to write 
> the results of this to some storage.
> 
> At the moment I can easily write to Json or CSV using the bucketing sink 
> (although I'm curious about using the watermark time rather than system time 
> to name the buckets), but I'd really like to store to something smaller like 
> Avro.
> 
> However I'm not sure this make sense. Writing to a compressed file format in 
> this way from a streaming job doesn't sound intuitively right. What would 
> make the most sense. I could write to some temporary database and then pipe 
> that into an archive, but this seems like a lot of trouble. Is there a way to 
> pipe the output directly into the batch API of flink?
> 
> Thanks

Reply via email to