Re: Parallel file read in LocalEnvironment

Flavio Pompermaier Wed, 18 Nov 2015 10:20:18 -0800

in my test I was using the local fs (ext4)
On 18 Nov 2015 19:17, "Stephan Ewen" <se...@apache.org> wrote:


> The JobManager does not read all files, but is has to query the HDFS for
> all file metadata (size, blocks, block locations), which can take a bit.
> There is a separate call to the HDFS Namenode for each file. The more
> files, the more metadata has to be collected.
>
>
> On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
>> So why it takes so much to start the job?because in any case the job
>> manager has to read all the lines of the input files before generating the
>> splits?
>> On 18 Nov 2015 17:52, "Stephan Ewen" <se...@apache.org> wrote:
>>
>>> Late answer, sorry:
>>>
>>> The splits are created in the JobManager, so the sub submission should
>>> not be affected by that.
>>>
>>> The assignment of splits to workers is very fast, so many splits with
>>> small data is not very different from few splits with large data.
>>>
>>> Lines are never materialized and the operators do not work differently
>>> based on different numbers of splits.
>>>
>>> On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <pomperma...@okkam.it
>>> > wrote:
>>>
>>>> I've tried to split my huge file by lines count (using the bash command
>>>> split -l) in 2 different ways:
>>>>
>>>>    1. small lines count (huge number of small files)
>>>>    2. big lines count (small number of big files)
>>>>
>>>> I can't understand why the time required to effectively start the job
>>>> is more or less the same
>>>>
>>>>    - in 1. it takes a lot to fetch the file list (~50.000) and the
>>>>    split assigner is fast to assign the splits (but also being fast they 
>>>> are a
>>>>    lot)
>>>>    - in 2. Flink is fast in fetch the file list but it's extremely
>>>>    slow to generate the splits to assign
>>>>
>>>> Initially I was thinking that Flink was eagerly materializing the lines
>>>> somewhere but both the memory and the disks doesn't increase.
>>>> What is going on underneath? Is it normal?
>>>>
>>>> Thanks in advance,
>>>> Flavio
>>>>
>>>>
>>>>
>>>> On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>>> The split functionality is in the FileInputFormat and the
>>>>> functionality that takes care of lines across splits is in the
>>>>> DelimitedIntputFormat.
>>>>>
>>>>> On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhue...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm sorry there is no such documentation.
>>>>>> You need to look at the code :-(
>>>>>>
>>>>>> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>>>
>>>>>>> And what is the split policy for the FileInputFormat?it depends on
>>>>>>> the fs block size?
>>>>>>> Is there a pointer to the several flink input formats and a
>>>>>>> description of their internals?
>>>>>>>
>>>>>>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhue...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Flavio,
>>>>>>>>
>>>>>>>> it is not possible to split by line count because that would mean
>>>>>>>> to read and parse the file just for splitting.
>>>>>>>>
>>>>>>>> Parallel processing of data sources depends on the input splits
>>>>>>>> created by the InputFormat. Local files can be split just like files in
>>>>>>>> HDFS. Usually, each file corresponds to at least one split but multiple
>>>>>>>> files could also be put into a single split if necessary.The logic for 
>>>>>>>> that
>>>>>>>> would go into to the InputFormat.createInputSplits() method.
>>>>>>>>
>>>>>>>> Cheers, Fabian
>>>>>>>>
>>>>>>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it
>>>>>>>> >:
>>>>>>>>
>>>>>>>>> Hi to all,
>>>>>>>>>
>>>>>>>>> is there a way to split a single local file by line count (e.g. a
>>>>>>>>> split every 100 lines) in a LocalEnvironment to speed up a simple map
>>>>>>>>> function? For me it is not very clear how the local files (files into
>>>>>>>>> directory if recursive=true) are managed by Flink..is there any ref 
>>>>>>>>> to this
>>>>>>>>> internals?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: Parallel file read in LocalEnvironment

Reply via email to