Do you use some compression? Maybe there is some activated by default in your 
Hadoop environment?

> On 06 Nov 2015, at 00:34, rok <rokros...@gmail.com> wrote:
> 
> Apologies if this appears a second time! 
> 
> I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
> parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> the size of file this is way over-provisioned (I've tried it with fewer
> partitions and fewer nodes, no obvious effect). I was expecting the dump to
> disk to be very fast -- the DataFrame is cached in memory and contains just
> 14 columns (13 are floats and one is a string). When I write it out in json
> format, this is indeed reasonably fast (though it still takes a few minutes,
> which is longer than I would expect). 
> 
> However, when I try to write a parquet file it takes way longer -- the first
> set of tasks finishes in a few minutes, but the subsequent tasks take more
> than twice as long or longer. In the end it takes over half an hour to write
> the file. I've looked at the disk I/O and cpu usage on the compute nodes and
> it looks like the processors are fully loaded while the disk I/O is
> essentially zero for long periods of time. I don't see any obvious garbage
> collection issues and there are no problems with memory. 
> 
> Any ideas on how to debug/fix this? 
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to