Hi Dan,
InputFormats are the connectors of the DataSet API. Yes, you can use
either readFile, readCsvFile, readFileOfPrimitives etc. However, I would
recommend to also give Table API a try. The unified TableEnvironment is
able to perform batch processing and is integrated with a bunch of
connectors such as for filesystems [1] and through Hive abstractions [2].
I hope this helps.
Regards,
Timo
[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/hive_read_write.html
On 11.08.20 00:13, Dan Hill wrote:
Hi. I have a streaming job that writes to
StreamingFileSink.forRowFormat(...) with an encoder that converts
protocol buffers to byte arrays.
How do read this data back in during a batch pipeline (using DataSet)?
Do I use env.readFile with a custom DelimitedInputFormat? The
streamfile sink documentation
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>
is a bit vague.
These files are used as raw logs. They're processed offline and the
whole record is read and used at the same time.
Thanks!
- Dan
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>