Alex,
We are working on the same thing, for the same exact reason. We are
trying to avoid the complexities of running HDFS just for the file
storage. We are also okay with the S3 limitations it introduces. We'll
try and update the group if we find solutions for parallelizing the files
consumpt
Each file is ~1.8G compressed (and about 15G uncompressed, so a little over
300G total for all the files).
In the Web Client UI, when I look at the Plan, I click on the subtask for
reading in the files, I see a line for each host and the Bytes Sent for
each host is like 350G.
The job takes longer
Hi,
This is not the expected behavior.
Each parallel instance should read only one file. The files should not be
read multiple times by the different parallel instances.
How did you check / find out that each node is reading all the data?
Regards,
Robert
On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid
Hi, I've been playing around with using apache flink to process some data,
and I'm starting out using the batch DataSet API.
To start, I read in some data from files in an S3 folder:
DataSet records = env.readTextFile("s3://my-s3-bucket/some-folder/");
Within the folder, there are 20 gzipped fi