RE: Huge partitioning job takes longer to close after all tasks finished

PSwain Thu, 09 Mar 2017 04:11:08 -0800

Hi Swapnil,

  We are facing same issue , could you please let me know how did you find that 
partitions are getting merged ?

Thanks in advance !!

From: Swapnil Shinde [mailto:swapnilushi...@gmail.com]
Sent: Thursday, March 09, 2017 1:31 AM
To: cht liu <liucht...@gmail.com>
Cc: user@spark.apache.org
Subject: Re: Huge partitioning job takes longer to close after all tasks 
finished

Thank you liu. Can you please explain what do you mean by enabling spark fault 
tolerant mechanism?
I observed that after all tasks finishes, spark is working on concatenating 
same partitions from all tasks on file system. eg,
task1 - partition1, partition2, partition3
task2 - partition1, partition2, partition3

Then after task1, task2 finishes, spark concatenates partition1 from task1, 
task2 to create partition1. This is taking longer if we have large number of 
files. I am not sure if there is a way to let spark not to concatenate 
partitions from each task.

Thanks
Swapnil

On Tue, Mar 7, 2017 at 10:47 PM, cht liu 
<liucht...@gmail.com<mailto:liucht...@gmail.com>> wrote:

Do you enable the spark fault tolerance mechanism, RDD run at the end of the 
job, will start a separate job, to the checkpoint data written to the file 
system before the persistence of high availability

2017-03-08 2:45 GMT+08:00 Swapnil Shinde 
<swapnilushi...@gmail.com<mailto:swapnilushi...@gmail.com>>:
Hello all
   I have a spark job that reads parquet data and partition it based on one of 
the columns. I made sure partitions equally distributed and not skewed. My code 
looks like this -

datasetA.write.partitonBy("column1").parquet(outputPath)

Execution plan -
[Inline image 1]

All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins to 
close application. I am not sure what spark is doing after all tasks are 
processes successfully.
I checked thread dump (using UI executor tab) on few executors but couldnt find 
anything major. Overall, few shuffle-client processes are "RUNNABLE" and few 
dispatched-* processes are "WAITING".

Please let me know what spark is doing at this stage(after all tasks finished) 
and any way I can optimize it.

Thanks
Swapnil

********************** IMPORTANT--PLEASE READ ************************
This electronic message, including its attachments, is CONFIDENTIAL and may 
contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is 
intended for the authorized recipient of the sender.
If you are not the intended recipient, you are hereby notified that any use, 
disclosure, copying, or distribution of this message or any of the information 
included in it is unauthorized and strictly prohibited.
If you have received this message in error, please immediately notify the 
sender by reply e-mail and permanently delete this message and its attachments, 
along with any copies thereof, from all locations received (e.g., computer, 
mobile device, etc.).
Thank you.
********************************************************************

RE: Huge partitioning job takes longer to close after all tasks finished

Reply via email to