Hi,

it depends on the file format whether a file can be read in parallel or
not. Basically, you have to be able to identify valid offsets from which
you can start reading.
There are a few techniques like fixed sized blocks with padding or a footer
section with split offsets, but if the file is already written and does not
offer these features, there is no way to read it in parallel.

To read a file without splitting it, you can implement a custom
FileInputFormat and set the "unsplittable" member field to "true".
This will create one input split for each file. In nextRecord(), you could
parse the file record by record.

Hope this helps,
Fabian

2017-07-26 20:47 GMT+02:00 ShB <shon.balakris...@gmail.com>:

> I'm working with Apache Flink on reading, parsing and processing data from
> S3. I'm using the DataSet API, as my data is bounded and doesn't need
> streaming semantics.
>
> My data is on S3 in binary protobuf format in the form of a large number of
> timestamped files. Each of these files have to be read, parsed(using
> parseDelimiedFrom
> <https://developers.google.com/protocol-buffers/docs/
> reference/java/com/google/protobuf/Parser#parseDelimitedFrom-java.io.
> InputStream->
> ) into their custom protobuf java classes and then processed.
>
> I’m currently using the aws-java-sdk to read these files as I couldn’t
> figure out how to read binary protobufs via Flink semantics(env.readFile).
> But I'm getting OOM errors as the number/size of files is too large.
>
> So I'm looking to do distributed/parallel reading and parsing of the files
> in Flink. How can these custom binary files be read from s3 using the Flink
> Dataset API(like env.readFile)? How can these custom binary files be read
> from s3 in a distributed manner?
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Distributed-
> reading-and-parsing-of-protobuf-files-from-S3-in-Apache-Flink-tp14480.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Reply via email to