Spark Parquet file size

2020-11-10 Thread Tzahi File
Hi, We have many Spark jobs that create multiple small files. We would like to improve analyst reading performance, doing so I'm testing the parquet optimal file size. I've found that the optimal file size should be around 1GB, and not less than 128MB, depending on the size of the data. I took on

Re: Parquet file size

2015-10-08 Thread Cheng Lian
Lian; user@spark.apache.org *Subject:* Re: Parquet file size Hi, In our case, we're using the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to increase the size of the RDD partitions when loading text files, so it would generate larger parquet files. We just set

RE: Parquet file size

2015-10-07 Thread Younes Naguib
orld.com> From: odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete [och...@apache.org] Sent: Wednesday, October 07, 2015 9:14 PM To: Younes Naguib Cc: Cheng Lian; user@spark.apache.org Subject: Re: Parquet file size Hi, In our case, we&#

Re: Parquet file size

2015-10-07 Thread Deng Ching-Mallete
el.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib > @tritondigital.com > -- > *From:* Cheng Lian [lian.cs@gmail.com] > *Sent:* Wednesday, October 07, 2015 7:01 PM > > *To:* Younes Naguib; 'user@spark.apache.org' >

RE: Parquet file size

2015-10-07 Thread Younes Naguib
7:01 PM To: Younes Naguib; 'user@spark.apache.org' Subject: Re: Parquet file size The reason why so many small files are generated should probably be the fact that you are inserting into a partitioned table with three partition columns. If you want a large Parquet files, you may try

Re: Parquet file size

2015-10-07 Thread Cheng Lian
:* Younes Naguib; 'user@spark.apache.org' *Subject:* Re: Parquet file size Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and c

RE: Parquet file size

2015-10-07 Thread Younes Naguib
The TSV original files is 600GB and generated 40k files of 15-25MB. y From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: October-07-15 3:18 PM To: Younes Naguib; 'user@spark.apache.org' Subject: Re: Parquet file size Why do you want larger files? Doesn't the result Parquet f

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day)..

Parquet file size

2015-10-07 Thread Younes Naguib
Hi, I'm reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day) Select from tbl_tsv; This works nicely, but generates small parquet files (15MB). I wanted to generate larger files, any idea how to address this? Thanks,