subject:"Re\: Distributed reading and parsing of protobuf files from S3 in Apache Flink"

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-09-04 Thread Fabian Hueske

Hi, readFile() requests a FileInputFormat, i.e., your custom InputFormat would need to extend FileInputFormat. In general, any InputFormat decides about what to read when generating InputSplits. In your case the, createInputSplits() method should return one InputSplit for each file it wants to rea

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-31 Thread ShB

Hi Fabian, Thanks for your response. If I implemented my own InputFormat, how would I read a specific list of files from S3? Assuming I need to use readFile(), below would read all of the files from the specified S3 bucket or path: env.readFile(MyInputFormat, "s3://my-bucket/") Is there a way

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-31 Thread Fabian Hueske

Hi, this is a valid approach. It might suffer from unbalanced load if the reader tasks process the files at different speed (or the files vary in size) because each task has to process the same number of files. An alternative would be to implement your own InputFormat. The input format would crea

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-30 Thread ShB

Hi Fabian, Thank you so much for your quick response, I appreciate it. Since I'm working with a very large number of files of small sizes, I don't necessarily need to read each file in parallel. I need to read a my large list of files in parallel - that is, split up my list of files into small

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-07-27 Thread Fabian Hueske

Hi, it depends on the file format whether a file can be read in parallel or not. Basically, you have to be able to identify valid offsets from which you can start reading. There are a few techniques like fixed sized blocks with padding or a footer section with split offsets, but if the file is alr

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

5 matches

Site Navigation

Mail list logo

Footer information