Yes, make both block sizes the same and you're good.
I think you can neglect the overhead, unless we are not talking about
1000's of small files (smaller than block size).
2016-01-29 12:06 GMT+01:00 Flavio Pompermaier :
> So there's no need to worry about the number of parquet files size from
> t
So there's no need to worry about the number of parquet files size from the
Flink point of view if I set correctly the parquet block size (equal to the
HDFS block size)...
It only affects the Parquet file overhead (header and footer present in
each file) and the HDFS resources required to handle th
The number of input splits does not depend on the number of files but on
the number of HDFS blocks of all files.
Reading a single file with 100 HDFS blocks and reading of 100 files with 1
block each should be divided into 100 input splits which can be read by 100
tasks concurrently (or less tasks w
Hi Fabian,
thanks for the response!
>From what is my understanding (correct me if I'm wrong) once I produce some
Parquet dir that I want to read later, the number of files in the dir
affects the initial parallelism of the next job, i.e.:
- If I have less files than available tasks I will not fully
Hi Flavio,
using a default FileOutputFormat, Flink writes one output file for each
data sink task, i.e., as many files as the defined parallelism.
The size of these files depends on the total output size and the
distribution. If you write to HDFS, a file consists of one or more HDFS
blocks.
Parque
Hi to all,
I was reading about optimal Parquet file size and HDFS block size.
The ideal situation for Parquet is when its block size (and thus the
maximum size of each row group) is equal to the HDFS block size. The
default behaviour of Flink is that the output file's size depends on the
output pa