yes I was expecting that too because of all the metadata generation and compression. But I have not seen performance this bad for other parquet files I’ve written and was wondering if there could be something obvious (and wrong) to do with how I’ve specified the schema etc. It’s a very simple schema consisting of a StructType with a few StructField floats and a string. I’m using all the spark defaults for io compression.
I'll see what I can do about running a profiler -- can you point me to a resource/example? Thanks, Rok ps: my post on the mailing list is still listed as not accepted by the mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-td25295.html -- none of your responses are there either. I am definitely subscribed to the list though (I get daily digests). Any clue how to fix it? On Nov 6, 2015, at 9:26 AM, Cheng Lian <lian.cs....@gmail.com> wrote: I'd expect writing Parquet files slower than writing JSON files since Parquet involves more complicated encoders, but maybe not that slow. Would you mind to try to profile one Spark executor using tools like YJP to see what's the hotspot? Cheng On 11/6/15 7:34 AM, rok wrote: Apologies if this appears a second time! I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for the size of file this is way over-provisioned (I've tried it with fewer partitions and fewer nodes, no obvious effect). I was expecting the dump to disk to be very fast -- the DataFrame is cached in memory and contains just 14 columns (13 are floats and one is a string). When I write it out in json format, this is indeed reasonably fast (though it still takes a few minutes, which is longer than I would expect). However, when I try to write a parquet file it takes way longer -- the first set of tasks finishes in a few minutes, but the subsequent tasks take more than twice as long or longer. In the end it takes over half an hour to write the file. I've looked at the disk I/O and cpu usage on the compute nodes and it looks like the processors are fully loaded while the disk I/O is essentially zero for long periods of time. I don't see any obvious garbage collection issues and there are no problems with memory. Any ideas on how to debug/fix this? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org