Hi, I've been playing around with using apache flink to process some data,
and I'm starting out using the batch DataSet API.

To start, I read in some data from files in an S3 folder:

DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");


Within the folder, there are 20 gzipped files, and I have 20
node/tasks run (so parallel 20). It looks like each node is reading in
ALL the files (whole folder), but what I really want is for each
node/task to read in 1 file each and each process the data within the
file they read in.

Is this expected behavior? Am I suppose to be doing something
different here to get the results I want?

Thanks.

Reply via email to