Dear FLINK community.

I was wondering what would be the recommended (best?) way to achieve some kind of file conversion. That runs in parallel on all available Flink Nodes, since it it "embarrassingly parallel" (no dependency between files).

Say, I have a HDFS folder that contains multiple structured text-files containing (x,y) pairs (think of CVS).

For each of these files I want to do (individual for each file) the following:

* Read file from HDFS
* Extract dataset(s) from file (e.g. list of (x,y) pairs)
* Apply some filter (e.g. smoothing)
* Do some pattern recognition on smoothed data
* Write results back to HDFS (different format)

Would the following be a good idea?

DataSource<String> fileList = ... // contains list of file names in HDFS

// For each "filename" in list do...
DataSet<FeatureList> featureList = fileList
.flatMap(new ReadDataSetFromFile()) // flatMap because there might multiple DataSets in a file
                .map(new Smoothing())
                .map(new FindPatterns());

featureList.writeAsFormattedText( ... )

I have the feeling that Flink does not distribute the independent tasks on the available nodes but executes everything on only one node.


Cheers
Tim



Reply via email to