Thanks all for your responses. I am now much more clearer on this. Thanks
On Tue, Aug 14, 2018 at 9:46 AM, Fabian Hueske <fhue...@gmail.com> wrote: > Hi, > > Flink InputFormats generate their InputSplits sequentially on the > JobManager. > These splits are stored in the heap of the JM process and handed out to > SourceTasks when they request them lazily. > Split assignment is done by a InputSplitAssigner, that can be customized. > FileInputFormats typically use a LocatableInputSplitAssigner which tries to > assign splits based on locality. > > I see three potential problems: > 1) InputSplit generation might take a long while. The JM is blocked until > splits are generated. > 2) All InputSplits need to be stored on the JM heap. You might need to > assign more memory to the JM process. > 3) Split assignment might take a while depending on the complexity of the > InputSplitAssigner. You can implement a custom assigner to make this more > efficient (from an assignment point of view). > > Best, Fabian > > 2018-08-14 8:19 GMT+02:00 Jörn Franke <jornfra...@gmail.com>: > >> It causes more overhead (processes etc) which might make it slower. >> Furthermore if you have them stored on HDFS then the bottleneck is the >> namenode which will have to answer millions of requests. >> The latter point will change in future Hadoop versions with >> http://ozone.hadoop.apache.org/ >> >> On 13. Aug 2018, at 21:01, Darshan Singh <darshan.m...@gmail.com> wrote: >> >> Hi Guys, >> >> Is there a limit on number of files flink dataset can read? My question >> is will there be any sort of issues if I have say millions of files to read >> to create single dataset. >> >> Thanks >> >> >