Best way to process data in many files? (FLINK-BATCH)

Tim Conrad Tue, 23 Feb 2016 04:50:12 -0800


Dear FLINK community.

I was wondering what would be the recommended (best?) way to achievesome kind of file conversion. That runs in parallel on all availableFlink Nodes, since it it "embarrassingly parallel" (no dependencybetween files).

Say, I have a HDFS folder that contains multiple structured text-filescontaining (x,y) pairs (think of CVS).

For each of these files I want to do (individual for each file) thefollowing:


* Read file from HDFS
* Extract dataset(s) from file (e.g. list of (x,y) pairs)
* Apply some filter (e.g. smoothing)
* Do some pattern recognition on smoothed data
* Write results back to HDFS (different format)

Would the following be a good idea?

DataSource<String> fileList = ... // contains list of file names in HDFS

// For each "filename" in list do...
DataSet<FeatureList> featureList = fileList

.flatMap(new ReadDataSetFromFile()) // flatMap becausethere might multiple DataSets in a file

                .map(new Smoothing())
                .map(new FindPatterns());

featureList.writeAsFormattedText( ... )

I have the feeling that Flink does not distribute the independent taskson the available nodes but executes everything on only one node.



Cheers
Tim

Best way to process data in many files? (FLINK-BATCH)

Reply via email to