it was long ago..but if I remember correctly they were about 50k On 18 Nov 2015 19:22, "Stephan Ewen" <se...@apache.org> wrote:
> Okay, let me take a step back and make sure I understand this right... > > With many small files it takes longer to start the job, correct? How much > time did it actually take and how many files did you have? > > > On Wed, Nov 18, 2015 at 7:18 PM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> in my test I was using the local fs (ext4) >> On 18 Nov 2015 19:17, "Stephan Ewen" <se...@apache.org> wrote: >> >>> The JobManager does not read all files, but is has to query the HDFS for >>> all file metadata (size, blocks, block locations), which can take a bit. >>> There is a separate call to the HDFS Namenode for each file. The more >>> files, the more metadata has to be collected. >>> >>> >>> On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier < >>> pomperma...@okkam.it> wrote: >>> >>>> So why it takes so much to start the job?because in any case the job >>>> manager has to read all the lines of the input files before generating the >>>> splits? >>>> On 18 Nov 2015 17:52, "Stephan Ewen" <se...@apache.org> wrote: >>>> >>>>> Late answer, sorry: >>>>> >>>>> The splits are created in the JobManager, so the sub submission should >>>>> not be affected by that. >>>>> >>>>> The assignment of splits to workers is very fast, so many splits with >>>>> small data is not very different from few splits with large data. >>>>> >>>>> Lines are never materialized and the operators do not work differently >>>>> based on different numbers of splits. >>>>> >>>>> On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier < >>>>> pomperma...@okkam.it> wrote: >>>>> >>>>>> I've tried to split my huge file by lines count (using the bash >>>>>> command split -l) in 2 different ways: >>>>>> >>>>>> 1. small lines count (huge number of small files) >>>>>> 2. big lines count (small number of big files) >>>>>> >>>>>> I can't understand why the time required to effectively start the job >>>>>> is more or less the same >>>>>> >>>>>> - in 1. it takes a lot to fetch the file list (~50.000) and the >>>>>> split assigner is fast to assign the splits (but also being fast they >>>>>> are a >>>>>> lot) >>>>>> - in 2. Flink is fast in fetch the file list but it's extremely >>>>>> slow to generate the splits to assign >>>>>> >>>>>> Initially I was thinking that Flink was eagerly materializing the >>>>>> lines somewhere but both the memory and the disks doesn't increase. >>>>>> What is going on underneath? Is it normal? >>>>>> >>>>>> Thanks in advance, >>>>>> Flavio >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <se...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> The split functionality is in the FileInputFormat and the >>>>>>> functionality that takes care of lines across splits is in the >>>>>>> DelimitedIntputFormat. >>>>>>> >>>>>>> On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhue...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I'm sorry there is no such documentation. >>>>>>>> You need to look at the code :-( >>>>>>>> >>>>>>>> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it >>>>>>>> >: >>>>>>>> >>>>>>>>> And what is the split policy for the FileInputFormat?it depends on >>>>>>>>> the fs block size? >>>>>>>>> Is there a pointer to the several flink input formats and a >>>>>>>>> description of their internals? >>>>>>>>> >>>>>>>>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhue...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Flavio, >>>>>>>>>> >>>>>>>>>> it is not possible to split by line count because that would mean >>>>>>>>>> to read and parse the file just for splitting. >>>>>>>>>> >>>>>>>>>> Parallel processing of data sources depends on the input splits >>>>>>>>>> created by the InputFormat. Local files can be split just like files >>>>>>>>>> in >>>>>>>>>> HDFS. Usually, each file corresponds to at least one split but >>>>>>>>>> multiple >>>>>>>>>> files could also be put into a single split if necessary.The logic >>>>>>>>>> for that >>>>>>>>>> would go into to the InputFormat.createInputSplits() method. >>>>>>>>>> >>>>>>>>>> Cheers, Fabian >>>>>>>>>> >>>>>>>>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier < >>>>>>>>>> pomperma...@okkam.it>: >>>>>>>>>> >>>>>>>>>>> Hi to all, >>>>>>>>>>> >>>>>>>>>>> is there a way to split a single local file by line count (e.g. >>>>>>>>>>> a split every 100 lines) in a LocalEnvironment to speed up a simple >>>>>>>>>>> map >>>>>>>>>>> function? For me it is not very clear how the local files (files >>>>>>>>>>> into >>>>>>>>>>> directory if recursive=true) are managed by Flink..is there any ref >>>>>>>>>>> to this >>>>>>>>>>> internals? >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Flavio >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >