it was long ago..but if I remember correctly they were about 50k
On 18 Nov 2015 19:22, "Stephan Ewen" wrote:
> Okay, let me take a step back and make sure I understand this right...
>
> With many small files it takes longer to start the job, correct? How much
> time did it actually take and how m
Okay, let me take a step back and make sure I understand this right...
With many small files it takes longer to start the job, correct? How much
time did it actually take and how many files did you have?
On Wed, Nov 18, 2015 at 7:18 PM, Flavio Pompermaier
wrote:
> in my test I was using the lo
in my test I was using the local fs (ext4)
On 18 Nov 2015 19:17, "Stephan Ewen" wrote:
> The JobManager does not read all files, but is has to query the HDFS for
> all file metadata (size, blocks, block locations), which can take a bit.
> There is a separate call to the HDFS Namenode for each fil
The JobManager does not read all files, but is has to query the HDFS for
all file metadata (size, blocks, block locations), which can take a bit.
There is a separate call to the HDFS Namenode for each file. The more
files, the more metadata has to be collected.
On Wed, Nov 18, 2015 at 7:15 PM, Fl
So why it takes so much to start the job?because in any case the job
manager has to read all the lines of the input files before generating the
splits?
On 18 Nov 2015 17:52, "Stephan Ewen" wrote:
> Late answer, sorry:
>
> The splits are created in the JobManager, so the sub submission should not
Late answer, sorry:
The splits are created in the JobManager, so the sub submission should not
be affected by that.
The assignment of splits to workers is very fast, so many splits with small
data is not very different from few splits with large data.
Lines are never materialized and the operato
I've tried to split my huge file by lines count (using the bash command
split -l) in 2 different ways:
1. small lines count (huge number of small files)
2. big lines count (small number of big files)
I can't understand why the time required to effectively start the job is
more or less the s
The split functionality is in the FileInputFormat and the functionality
that takes care of lines across splits is in the DelimitedIntputFormat.
On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske wrote:
> I'm sorry there is no such documentation.
> You need to look at the code :-(
>
> 2015-10-07 15:19
I'm sorry there is no such documentation.
You need to look at the code :-(
2015-10-07 15:19 GMT+02:00 Flavio Pompermaier :
> And what is the split policy for the FileInputFormat?it depends on the fs
> block size?
> Is there a pointer to the several flink input formats and a description of
> their
And what is the split policy for the FileInputFormat?it depends on the fs
block size?
Is there a pointer to the several flink input formats and a description of
their internals?
On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske wrote:
> Hi Flavio,
>
> it is not possible to split by line count becaus
Hi Flavio,
it is not possible to split by line count because that would mean to read
and parse the file just for splitting.
Parallel processing of data sources depends on the input splits created by
the InputFormat. Local files can be split just like files in HDFS. Usually,
each file corresponds
Hi to all,
is there a way to split a single local file by line count (e.g. a split
every 100 lines) in a LocalEnvironment to speed up a simple map function?
For me it is not very clear how the local files (files into directory if
recursive=true) are managed by Flink..is there any ref to this inter
12 matches
Mail list logo