Re: Reading files from an S3 folder

2016-11-23 Thread Steve Morin
Alex, We are working on the same thing, for the same exact reason. We are trying to avoid the complexities of running HDFS just for the file storage. We are also okay with the S3 limitations it introduces. We'll try and update the group if we find solutions for parallelizing the files consumpt

Re: Reading files from an S3 folder

2016-11-23 Thread Alex Reid
Each file is ~1.8G compressed (and about 15G uncompressed, so a little over 300G total for all the files). In the Web Client UI, when I look at the Plan, I click on the subtask for reading in the files, I see a line for each host and the Bytes Sent for each host is like 350G. The job takes longer

Re: Reading files from an S3 folder

2016-11-23 Thread Robert Metzger
Hi, This is not the expected behavior. Each parallel instance should read only one file. The files should not be read multiple times by the different parallel instances. How did you check / find out that each node is reading all the data? Regards, Robert On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid

Reading files from an S3 folder

2016-11-22 Thread Alex Reid
Hi, I've been playing around with using apache flink to process some data, and I'm starting out using the batch DataSet API. To start, I read in some data from files in an S3 folder: DataSet records = env.readTextFile("s3://my-s3-bucket/some-folder/"); Within the folder, there are 20 gzipped fi