Hi Swapnil, We are facing same issue , could you please let me know how did you find that partitions are getting merged ?
Thanks in advance !! From: Swapnil Shinde [mailto:swapnilushi...@gmail.com] Sent: Thursday, March 09, 2017 1:31 AM To: cht liu <liucht...@gmail.com> Cc: user@spark.apache.org Subject: Re: Huge partitioning job takes longer to close after all tasks finished Thank you liu. Can you please explain what do you mean by enabling spark fault tolerant mechanism? I observed that after all tasks finishes, spark is working on concatenating same partitions from all tasks on file system. eg, task1 - partition1, partition2, partition3 task2 - partition1, partition2, partition3 Then after task1, task2 finishes, spark concatenates partition1 from task1, task2 to create partition1. This is taking longer if we have large number of files. I am not sure if there is a way to let spark not to concatenate partitions from each task. Thanks Swapnil On Tue, Mar 7, 2017 at 10:47 PM, cht liu <liucht...@gmail.com<mailto:liucht...@gmail.com>> wrote: Do you enable the spark fault tolerance mechanism, RDD run at the end of the job, will start a separate job, to the checkpoint data written to the file system before the persistence of high availability 2017-03-08 2:45 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com<mailto:swapnilushi...@gmail.com>>: Hello all I have a spark job that reads parquet data and partition it based on one of the columns. I made sure partitions equally distributed and not skewed. My code looks like this - datasetA.write.partitonBy("column1").parquet(outputPath) Execution plan - [Inline image 1] All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins to close application. I am not sure what spark is doing after all tasks are processes successfully. I checked thread dump (using UI executor tab) on few executors but couldnt find anything major. Overall, few shuffle-client processes are "RUNNABLE" and few dispatched-* processes are "WAITING". Please let me know what spark is doing at this stage(after all tasks finished) and any way I can optimize it. Thanks Swapnil ********************** IMPORTANT--PLEASE READ ************************ This electronic message, including its attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is intended for the authorized recipient of the sender. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying, or distribution of this message or any of the information included in it is unauthorized and strictly prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and permanently delete this message and its attachments, along with any copies thereof, from all locations received (e.g., computer, mobile device, etc.). Thank you. ********************************************************************