Dear FLINK community.
I was wondering what would be the recommended (best?) way to achieve
some kind of file conversion. That runs in parallel on all available
Flink Nodes, since it it "embarrassingly parallel" (no dependency
between files).
Say, I have a HDFS folder that contains multiple structured text-files
containing (x,y) pairs (think of CVS).
For each of these files I want to do (individual for each file) the
following:
* Read file from HDFS
* Extract dataset(s) from file (e.g. list of (x,y) pairs)
* Apply some filter (e.g. smoothing)
* Do some pattern recognition on smoothed data
* Write results back to HDFS (different format)
Would the following be a good idea?
DataSource<String> fileList = ... // contains list of file names in HDFS
// For each "filename" in list do...
DataSet<FeatureList> featureList = fileList
.flatMap(new ReadDataSetFromFile()) // flatMap because
there might multiple DataSets in a file
.map(new Smoothing())
.map(new FindPatterns());
featureList.writeAsFormattedText( ... )
I have the feeling that Flink does not distribute the independent tasks
on the available nodes but executes everything on only one node.
Cheers
Tim