The problem with doing work in the callsite thread is that there are a
number of data structures that are updated during job submission and
these data structures are guarded by the event loop ensuring only one
thread accesses them.  I dont think there is a very easy fix for this
given the structure of the DAGScheduler.

Thanks
Shivaram

On Tue, Mar 6, 2018 at 8:53 AM, Ryan Blue <rb...@netflix.com.invalid> wrote:
> I agree with Reynold. We don't need to use a separate pool, which would have
> the problem you raised about FIFO. We just need to do the planning outside
> of the scheduler loop. The call site thread sounds like a reasonable place
> to me.
>
> On Mon, Mar 5, 2018 at 12:56 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>> Rather than using a separate thread pool, perhaps we can just move the
>> prep code to the call site thread?
>>
>>
>> On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty <ajith.she...@huawei.com>
>> wrote:
>>>
>>> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted
>>> events has to be processed as DAGSchedulerEventProcessLoop is single
>>> threaded and it will block other tasks in queue like TaskCompletion.
>>>
>>> The JobSubmitted event is time consuming depending on the nature of the
>>> job (Example: calculating parent stage dependencies, shuffle dependencies,
>>> partitions) and thus it blocks all the events to be processed.
>>>
>>>
>>>
>>> I see multiple JIRA referring to this behavior
>>>
>>> https://issues.apache.org/jira/browse/SPARK-2647
>>>
>>> https://issues.apache.org/jira/browse/SPARK-4961
>>>
>>>
>>>
>>> Similarly in my cluster some jobs partition calculation is time consuming
>>> (Similar to stack at SPARK-2647) hence it slows down the spark
>>> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if
>>> its tasks are finished within seconds, as TaskCompletion Events are
>>> processed at a slower rate due to blockage.
>>>
>>>
>>>
>>> I think we can split a JobSubmitted Event into 2 events
>>>
>>> Step 1. JobSubmittedPreperation - Runs in separate thread on
>>> JobSubmission, this will involve steps
>>> org.apache.spark.scheduler.DAGScheduler#createResultStage
>>>
>>> Step 2. JobSubmittedExecution - If Step 1 is success, fire an event to
>>> DAGSchedulerEventProcessLoop and let it process output of
>>> org.apache.spark.scheduler.DAGScheduler#createResultStage
>>>
>>>
>>>
>>> I can see the effect of doing this may be that Job Submissions may not be
>>> FIFO depending on how much time Step 1 mentioned above is going to consume.
>>>
>>>
>>>
>>> Does above solution suffice for the problem described? And is there any
>>> other side effect of this solution?
>>>
>>>
>>>
>>> Regards
>>>
>>> Ajith
>>
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to