Hi,
We have many Spark jobs that create multiple small files. We would like to
improve analyst reading performance, doing so I'm testing the parquet
optimal file size.
I've found that the optimal file size should be around 1GB, and not less
than 128MB, depending on the size of the data.
I took on
Lian; user@spark.apache.org
*Subject:* Re: Parquet file size
Hi,
In our case, we're using
the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to
increase the size of the RDD partitions when loading text files, so it
would generate larger parquet files. We just set
orld.com>
From: odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete
[och...@apache.org]
Sent: Wednesday, October 07, 2015 9:14 PM
To: Younes Naguib
Cc: Cheng Lian; user@spark.apache.org
Subject: Re: Parquet file size
Hi,
In our case, we
el.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib
> @tritondigital.com
> --
> *From:* Cheng Lian [lian.cs@gmail.com]
> *Sent:* Wednesday, October 07, 2015 7:01 PM
>
> *To:* Younes Naguib; 'user@spark.apache.org'
>
7:01 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size
The reason why so many small files are generated should probably be the fact
that you are inserting into a partitioned table with three partition columns.
If you want a large Parquet files, you may try
:* Younes Naguib; 'user@spark.apache.org'
*Subject:* Re: Parquet file size
Why do you want larger files? Doesn't the result Parquet file contain
all the data in the original TSV file?
Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,
I’m reading a large tsv file, and c
The TSV original files is 600GB and generated 40k files of 15-25MB.
y
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: October-07-15 3:18 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size
Why do you want larger files? Doesn't the result Parquet f
Why do you want larger files? Doesn't the result Parquet file contain
all the data in the original TSV file?
Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,
I’m reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)..
Hi,
I'm reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)
Select from tbl_tsv;
This works nicely, but generates small parquet files (15MB).
I wanted to generate larger files, any idea how to address this?
Thanks,