Re: Parallel file read in LocalEnvironment

Flavio Pompermaier Wed, 18 Nov 2015 10:58:06 -0800

it was long ago..but if I remember correctly they were about 50k
On 18 Nov 2015 19:22, "Stephan Ewen" <se...@apache.org> wrote:


> Okay, let me take a step back and make sure I understand this right...
>
> With many small files it takes longer to start the job, correct? How much
> time did it actually take and how many files did you have?
>
>
> On Wed, Nov 18, 2015 at 7:18 PM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
>> in my test I was using the local fs (ext4)
>> On 18 Nov 2015 19:17, "Stephan Ewen" <se...@apache.org> wrote:
>>
>>> The JobManager does not read all files, but is has to query the HDFS for
>>> all file metadata (size, blocks, block locations), which can take a bit.
>>> There is a separate call to the HDFS Namenode for each file. The more
>>> files, the more metadata has to be collected.
>>>
>>>
>>> On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>>
>>>> So why it takes so much to start the job?because in any case the job
>>>> manager has to read all the lines of the input files before generating the
>>>> splits?
>>>> On 18 Nov 2015 17:52, "Stephan Ewen" <se...@apache.org> wrote:
>>>>
>>>>> Late answer, sorry:
>>>>>
>>>>> The splits are created in the JobManager, so the sub submission should
>>>>> not be affected by that.
>>>>>
>>>>> The assignment of splits to workers is very fast, so many splits with
>>>>> small data is not very different from few splits with large data.
>>>>>
>>>>> Lines are never materialized and the operators do not work differently
>>>>> based on different numbers of splits.
>>>>>
>>>>> On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <
>>>>> pomperma...@okkam.it> wrote:
>>>>>
>>>>>> I've tried to split my huge file by lines count (using the bash
>>>>>> command split -l) in 2 different ways:
>>>>>>
>>>>>>    1. small lines count (huge number of small files)
>>>>>>    2. big lines count (small number of big files)
>>>>>>
>>>>>> I can't understand why the time required to effectively start the job
>>>>>> is more or less the same
>>>>>>
>>>>>>    - in 1. it takes a lot to fetch the file list (~50.000) and the
>>>>>>    split assigner is fast to assign the splits (but also being fast they 
>>>>>> are a
>>>>>>    lot)
>>>>>>    - in 2. Flink is fast in fetch the file list but it's extremely
>>>>>>    slow to generate the splits to assign
>>>>>>
>>>>>> Initially I was thinking that Flink was eagerly materializing the
>>>>>> lines somewhere but both the memory and the disks doesn't increase.
>>>>>> What is going on underneath? Is it normal?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Flavio
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <se...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> The split functionality is in the FileInputFormat and the
>>>>>>> functionality that takes care of lines across splits is in the
>>>>>>> DelimitedIntputFormat.
>>>>>>>
>>>>>>> On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhue...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm sorry there is no such documentation.
>>>>>>>> You need to look at the code :-(
>>>>>>>>
>>>>>>>> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it
>>>>>>>> >:
>>>>>>>>
>>>>>>>>> And what is the split policy for the FileInputFormat?it depends on
>>>>>>>>> the fs block size?
>>>>>>>>> Is there a pointer to the several flink input formats and a
>>>>>>>>> description of their internals?
>>>>>>>>>
>>>>>>>>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhue...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Flavio,
>>>>>>>>>>
>>>>>>>>>> it is not possible to split by line count because that would mean
>>>>>>>>>> to read and parse the file just for splitting.
>>>>>>>>>>
>>>>>>>>>> Parallel processing of data sources depends on the input splits
>>>>>>>>>> created by the InputFormat. Local files can be split just like files 
>>>>>>>>>> in
>>>>>>>>>> HDFS. Usually, each file corresponds to at least one split but 
>>>>>>>>>> multiple
>>>>>>>>>> files could also be put into a single split if necessary.The logic 
>>>>>>>>>> for that
>>>>>>>>>> would go into to the InputFormat.createInputSplits() method.
>>>>>>>>>>
>>>>>>>>>> Cheers, Fabian
>>>>>>>>>>
>>>>>>>>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <
>>>>>>>>>> pomperma...@okkam.it>:
>>>>>>>>>>
>>>>>>>>>>> Hi to all,
>>>>>>>>>>>
>>>>>>>>>>> is there a way to split a single local file by line count (e.g.
>>>>>>>>>>> a split every 100 lines) in a LocalEnvironment to speed up a simple 
>>>>>>>>>>> map
>>>>>>>>>>> function? For me it is not very clear how the local files (files 
>>>>>>>>>>> into
>>>>>>>>>>> directory if recursive=true) are managed by Flink..is there any ref 
>>>>>>>>>>> to this
>>>>>>>>>>> internals?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Flavio
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>

Re: Parallel file read in LocalEnvironment

Reply via email to