in my test I was using the local fs (ext4) On 18 Nov 2015 19:17, "Stephan Ewen" <se...@apache.org> wrote:
> The JobManager does not read all files, but is has to query the HDFS for > all file metadata (size, blocks, block locations), which can take a bit. > There is a separate call to the HDFS Namenode for each file. The more > files, the more metadata has to be collected. > > > On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> So why it takes so much to start the job?because in any case the job >> manager has to read all the lines of the input files before generating the >> splits? >> On 18 Nov 2015 17:52, "Stephan Ewen" <se...@apache.org> wrote: >> >>> Late answer, sorry: >>> >>> The splits are created in the JobManager, so the sub submission should >>> not be affected by that. >>> >>> The assignment of splits to workers is very fast, so many splits with >>> small data is not very different from few splits with large data. >>> >>> Lines are never materialized and the operators do not work differently >>> based on different numbers of splits. >>> >>> On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <pomperma...@okkam.it >>> > wrote: >>> >>>> I've tried to split my huge file by lines count (using the bash command >>>> split -l) in 2 different ways: >>>> >>>> 1. small lines count (huge number of small files) >>>> 2. big lines count (small number of big files) >>>> >>>> I can't understand why the time required to effectively start the job >>>> is more or less the same >>>> >>>> - in 1. it takes a lot to fetch the file list (~50.000) and the >>>> split assigner is fast to assign the splits (but also being fast they >>>> are a >>>> lot) >>>> - in 2. Flink is fast in fetch the file list but it's extremely >>>> slow to generate the splits to assign >>>> >>>> Initially I was thinking that Flink was eagerly materializing the lines >>>> somewhere but both the memory and the disks doesn't increase. >>>> What is going on underneath? Is it normal? >>>> >>>> Thanks in advance, >>>> Flavio >>>> >>>> >>>> >>>> On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <se...@apache.org> wrote: >>>> >>>>> The split functionality is in the FileInputFormat and the >>>>> functionality that takes care of lines across splits is in the >>>>> DelimitedIntputFormat. >>>>> >>>>> On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhue...@gmail.com> >>>>> wrote: >>>>> >>>>>> I'm sorry there is no such documentation. >>>>>> You need to look at the code :-( >>>>>> >>>>>> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>>>>> >>>>>>> And what is the split policy for the FileInputFormat?it depends on >>>>>>> the fs block size? >>>>>>> Is there a pointer to the several flink input formats and a >>>>>>> description of their internals? >>>>>>> >>>>>>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhue...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Flavio, >>>>>>>> >>>>>>>> it is not possible to split by line count because that would mean >>>>>>>> to read and parse the file just for splitting. >>>>>>>> >>>>>>>> Parallel processing of data sources depends on the input splits >>>>>>>> created by the InputFormat. Local files can be split just like files in >>>>>>>> HDFS. Usually, each file corresponds to at least one split but multiple >>>>>>>> files could also be put into a single split if necessary.The logic for >>>>>>>> that >>>>>>>> would go into to the InputFormat.createInputSplits() method. >>>>>>>> >>>>>>>> Cheers, Fabian >>>>>>>> >>>>>>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it >>>>>>>> >: >>>>>>>> >>>>>>>>> Hi to all, >>>>>>>>> >>>>>>>>> is there a way to split a single local file by line count (e.g. a >>>>>>>>> split every 100 lines) in a LocalEnvironment to speed up a simple map >>>>>>>>> function? For me it is not very clear how the local files (files into >>>>>>>>> directory if recursive=true) are managed by Flink..is there any ref >>>>>>>>> to this >>>>>>>>> internals? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Flavio >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >