How to make Flink read all files in HDFS folder and do transformations on th e data

Palle Sat, 07 May 2016 03:13:27 -0700

Hi there.

I've got a HDFS folder containing a lot of files. All files contains a lot of 
JSON objects, one for each line. I will have several TB in the HDFS folder.


My plan is to make Flink read all files and all JSON objects and then do some 
analysis on the data, actually very similar to the flatMap/groupBy/reduceGroup 
transformations that is done in the WordCount example.

But I am a bit stuck, because I cannot seem to find out how to make Flink read 
all files in a HDFS dir and then perform the transformations on the data. I 
have googled quite a bit and also looked in the Flink API and mail history.

Can anyone point me to an example where Flink is used to read all files in a 
HDFS folder and then do transformations on the data)?

- and a second question: Is there an elegant way to make Flink handle the JSON 
objects? - can they be converted to POJOs by something similar to the 
pojoType() method?

/Palle

How to make Flink read all files in HDFS folder and do transformations on th e data

Reply via email to