Lian; user@spark.apache.org
*Subject:* Re: Parquet file size
Hi,
In our case, we're using
the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to
increase the size of the RDD partitions when loading text files, so it
would generate larger parquet files. We just set
orld.com>
From: odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete
[och...@apache.org]
Sent: Wednesday, October 07, 2015 9:14 PM
To: Younes Naguib
Cc: Cheng Lian; user@spark.apache.org
Subject: Re: Parquet file size
Hi,
In our case, we
el.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib
> @tritondigital.com
> --
> *From:* Cheng Lian [lian.cs@gmail.com]
> *Sent:* Wednesday, October 07, 2015 7:01 PM
>
> *To:* Younes Naguib; 'user@spark.apache.org'
>
7:01 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size
The reason why so many small files are generated should probably be the fact
that you are inserting into a partitioned table with three partition columns.
If you want a large Parquet files, you may try
:* Younes Naguib; 'user@spark.apache.org'
*Subject:* Re: Parquet file size
Why do you want larger files? Doesn't the result Parquet file contain
all the data in the original TSV file?
Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,
I’m reading a large tsv file, and c
The TSV original files is 600GB and generated 40k files of 15-25MB.
y
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: October-07-15 3:18 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size
Why do you want larger files? Doesn't the result Parquet f
Why do you want larger files? Doesn't the result Parquet file contain
all the data in the original TSV file?
Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,
I’m reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)..