Re: Limit on number of files to read for Dataset

Darshan Singh Tue, 14 Aug 2018 11:09:54 -0700

Thanks all for your responses. I am now much more clearer on this.

Thanks


On Tue, Aug 14, 2018 at 9:46 AM, Fabian Hueske <fhue...@gmail.com> wrote:

> Hi,
>
> Flink InputFormats generate their InputSplits sequentially on the
> JobManager.
> These splits are stored in the heap of the JM process and handed out to
> SourceTasks when they request them lazily.
> Split assignment is done by a InputSplitAssigner, that can be customized.
> FileInputFormats typically use a LocatableInputSplitAssigner which tries to
> assign splits based on locality.
>
> I see three potential problems:
> 1) InputSplit generation might take a long while. The JM is blocked until
> splits are generated.
> 2) All InputSplits need to be stored on the JM heap. You might need to
> assign more memory to the JM process.
> 3) Split assignment might take a while depending on the complexity of the
> InputSplitAssigner. You can implement a custom assigner to make this more
> efficient (from an assignment point of view).
>
> Best, Fabian
>
> 2018-08-14 8:19 GMT+02:00 Jörn Franke <jornfra...@gmail.com>:
>
>> It causes more overhead (processes etc) which might make it slower.
>> Furthermore if you have them stored on HDFS then the bottleneck is the
>> namenode which will have to answer millions of requests.
>> The latter point will change in future Hadoop versions with
>> http://ozone.hadoop.apache.org/
>>
>> On 13. Aug 2018, at 21:01, Darshan Singh <darshan.m...@gmail.com> wrote:
>>
>> Hi Guys,
>>
>> Is there a limit on number of files flink dataset can read? My question
>> is will there be any sort of issues if I have say millions of files to read
>> to create single dataset.
>>
>> Thanks
>>
>>
>

Re: Limit on number of files to read for Dataset

Reply via email to