Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-09-04 Thread Fabian Hueske
Hi, readFile() requests a FileInputFormat, i.e., your custom InputFormat would need to extend FileInputFormat. In general, any InputFormat decides about what to read when generating InputSplits. In your case the, createInputSplits() method should return one InputSplit for each file it wants to rea

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-31 Thread ShB
Hi Fabian, Thanks for your response. If I implemented my own InputFormat, how would I read a specific list of files from S3? Assuming I need to use readFile(), below would read all of the files from the specified S3 bucket or path: env.readFile(MyInputFormat, "s3://my-bucket/") Is there a way

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-31 Thread Fabian Hueske
Hi, this is a valid approach. It might suffer from unbalanced load if the reader tasks process the files at different speed (or the files vary in size) because each task has to process the same number of files. An alternative would be to implement your own InputFormat. The input format would crea

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-30 Thread ShB
Hi Fabian, Thank you so much for your quick response, I appreciate it. Since I'm working with a very large number of files of small sizes, I don't necessarily need to read each file in parallel. I need to read a my large list of files in parallel - that is, split up my list of files into small

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-07-27 Thread Fabian Hueske
read > from s3 in a distributed manner? > > > > -- > View this message in context: http://apache-flink-user- > mailing-list-archive.2336050.n4.nabble.com/Distributed- > reading-and-parsing-of-protobuf-files-from-S3-in-Apache-Flink-tp14480.html > Sent from the Apache Flink User Mailing List archive. mailing list archive > at Nabble.com. >

Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-07-26 Thread ShB
files be read from s3 using the Flink Dataset API(like env.readFile)? How can these custom binary files be read from s3 in a distributed manner? -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Distributed-reading-and-parsing-of-protobuf-files-