Re: Parallel Execution of Spark Jobs

Ankit Jain Thu, 26 Jul 2018 21:36:54 -0700

Thanks for further clarification Jeff.


> On Jul 26, 2018, at 8:11 PM, Jeff Zhang <zjf...@gmail.com> wrote:
> 
> Let me rephrase it.  In scoped mode, there's multiple Interpreter Group 
> (Personally I prefer to call it multiple sessions) in ones JVM (For spark 
> interpreter, there's multiple SparkInterpreter instances). 
> And there's one SparkContext in this JVM which is shared by all the 
> SparkInterpreter instances. Regarding Scheduler, there's multiple Scheduler 
> in scoped mode in this JVM, each SparkInterpreter instance own its own 
> scheduler. Let me know if you have any other question.
> 
> 
> 
> Ankit Jain <ankitjain....@gmail.com>于2018年7月25日周三 下午10:27写道：
>> Jeff, what you said seems to be in conflict with what is detailed here - 
>> https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555
>> 
>> "In Scoped mode, Zeppelin still runs single interpreter JVM process but 
>> multiple Interpreter Group serve each Note."
>> 
>> In practice as well we see one Interpreter process for scoped mode.
>> 
>> Can you please clarify?
>> 
>> Adding Moon too.
>> 
>> Thanks
>> Ankit
>> 
>>> On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain <ankitjain....@gmail.com> 
>>> wrote:
>>> Aah that makes sense - so only all jobs from one user will block in 
>>> FIFOScheduler.
>>> 
>>> By moving to ParallelScheduler, only gain achieved is jobs from same user 
>>> can also be run in parallel but may have dependency resolution issues.
>>> 
>>> Just to confirm I have it right - If "Run all" notebook is not a 
>>> requirement and users run one paragraph at a time from different notebooks, 
>>> ParallelScheduler should be ok?
>>> 
>>> Thanks
>>> Ankit
>>> 
>>>> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>> 
>>>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>>>> 2. scheduler can not to figure out the dependencies between paragraphs. 
>>>> That's why SparkInterpreter use FIFOScheduler. 
>>>> If you use per user scoped mode. SparkContext is shared between users but 
>>>> SparkInterpreter is not shared. That means there's multiple 
>>>> SparkInterpreter instances that share the same SparkContext but they 
>>>> doesn't share the same FIFOScheduler, each SparkInterpreter use its own 
>>>> FIFOScheduler. 
>>>> 
>>>> Ankit Jain <ankitjain....@gmail.com>于2018年7月25日周三 下午12:58写道：
>>>>> Thanks for the quick feedback Jeff.
>>>>> 
>>>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may 
>>>>> want to force FAIR execution instead of letting user control it.
>>>>> 
>>>>> Re:2 - Is there an architecture issue here or we just need better thread 
>>>>> safety? Ideally scheduler should be able to figure out the dependencies 
>>>>> and run whatever can be parallel.
>>>>> 
>>>>> Re:Interpreter mode, I may not have been clear but we are running per 
>>>>> user scoped mode - so Spark context is shared among all users. 
>>>>> 
>>>>> Doesn't that mean all jobs from different users go to one FIFOScheduler 
>>>>> forcing all small jobs to block on a big one? That is specifically we are 
>>>>> trying to avoid.
>>>>> 
>>>>> Thanks
>>>>> Ankit
>>>>> 
>>>>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See 
>>>>>> https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
>>>>>> for more details. 
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>>> 
>>>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may 
>>>>>> hit weird issues if your paragraph has dependency between each other. 
>>>>>> e.g. paragraph 1 will use variable v1 which is defined in paragraph p2. 
>>>>>> Then the order of paragraph execution matters here, and 
>>>>>> ParallelScheduler can not guarantee the order of execution.
>>>>>> That's why we use FIFOScheduler for SparkInterpreter. 
>>>>>> 
>>>>>> In your scenario where multiple users share the same sparkcontext, I 
>>>>>> would suggest you to use scoped per user mode. Then each user will share 
>>>>>> the same sparkcontext which means you can save resources, and also they 
>>>>>> are in each FIFOScheduler which is isolated from each other. 
>>>>>> 
>>>>>> Ankit Jain <ankitjain....@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>>>>> Forgot to mention this is for shared scoped mode, so same Spark 
>>>>>>> application and context for all users on a single Zeppelin instance.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Ankit
>>>>>>> 
>>>>>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <ankitjain....@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I am playing around with execution policy of Spark jobs(and all 
>>>>>>>> Zeppelin paragraphs actually).
>>>>>>>> 
>>>>>>>> Looks like there are couple of control points-
>>>>>>>> 1) Spark scheduling - FIFO vs Fair as documented in 
>>>>>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools.
>>>>>>>> 
>>>>>>>> Since we are still on .7 version and don't have 
>>>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully 
>>>>>>>> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>>>>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>>>>> 
>>>>>>>> Also because we are exposing Zeppelin to multiple users we may not 
>>>>>>>> actually want users to hog the cluster and always use FAIR.
>>>>>>>> 
>>>>>>>> This may complicate our merge to .8 though.
>>>>>>>> 
>>>>>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems 
>>>>>>>> to have a scheduler queue. Each task is submitted to a FIFOScheduler 
>>>>>>>> except SparkSqlInterpreter which creates a ParallelScheduler ig 
>>>>>>>> concurrentsql flag is turned on.
>>>>>>>> 
>>>>>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and 
>>>>>>>> that seems to do the trick.
>>>>>>>> 
>>>>>>>> Now multiple notebooks are able to run in parallel.
>>>>>>>> 
>>>>>>>> My question is if other people have tested SparkInterpreter with 
>>>>>>>> ParallelScheduler? Also ideally this should be configurable. User 
>>>>>>>> should be specify fifo or parallel.
>>>>>>>> 
>>>>>>>> Executing all paragraphs does add more complication and maybe
>>>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep 
>>>>>>>> the execution order sane.
>>>>>>>> 
>>>>>>>> Thoughts?
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Thanks & Regards,
>>>>>>>> Ankit.
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Thanks & Regards,
>>>>> Ankit.
>>> 
>>> 
>>> 
>>> -- 
>>> Thanks & Regards,
>>> Ankit.
>> 
>> 
>> 
>> -- 
>> Thanks & Regards,
>> Ankit.

Re: Parallel Execution of Spark Jobs

Reply via email to