Re: Parallel Execution of Spark Jobs

Ankit Jain Tue, 24 Jul 2018 21:58:15 -0700

Thanks for the quick feedback Jeff.

Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
want to force FAIR execution instead of letting user control it.


Re:2 - Is there an architecture issue here or we just need better thread
safety? Ideally scheduler should be able to figure out the dependencies and
run whatever can be parallel.

Re:Interpreter mode, I may not have been clear but we are running per user
scoped mode - so Spark context is shared among all users.

Doesn't that mean all jobs from different users go to one FIFOScheduler
forcing all small jobs to block on a big one? That is specifically we are
trying to avoid.

Thanks
Ankit

On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <[email protected]> wrote:

> Regarding 1.  ZEPPELIN-3563 should be helpful. See
> https://github.com/apache/zeppelin/blob/master/docs/
> interpreter/spark.md#running-spark-sql-concurrently
> for more details.
> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>
> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
> hit weird issues if your paragraph has dependency between each other. e.g.
> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
> order of paragraph execution matters here, and ParallelScheduler can
> not guarantee the order of execution.
> That's why we use FIFOScheduler for SparkInterpreter.
>
> In your scenario where multiple users share the same sparkcontext, I would
> suggest you to use scoped per user mode. Then each user will share the same
> sparkcontext which means you can save resources, and also they are in each
> FIFOScheduler which is isolated from each other.
>
> Ankit Jain <[email protected]>于2018年7月25日周三 上午8:14写道：
>
>> Forgot to mention this is for shared scoped mode, so same Spark
>> application and context for all users on a single Zeppelin instance.
>>
>> Thanks
>> Ankit
>>
>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <[email protected]> wrote:
>>
>> Hi,
>> I am playing around with execution policy of Spark jobs(and all Zeppelin
>> paragraphs actually).
>>
>> Looks like there are couple of control points-
>> 1) Spark scheduling - FIFO vs Fair as documented in
>> https://spark.apache.org/docs/2.1.1/job-scheduling.
>> html#fair-scheduler-pools.
>>
>> Since we are still on .7 version and don't have https://issues.apache.
>> org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc.setLocalProperty(
>> "spark.scheduler.pool", "fair");
>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>
>> Also because we are exposing Zeppelin to multiple users we may not
>> actually want users to hog the cluster and always use FAIR.
>>
>> This may complicate our merge to .8 though.
>>
>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
>> have a scheduler queue. Each task is submitted to a FIFOScheduler except
>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>> is turned on.
>>
>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>> that seems to do the trick.
>>
>> Now multiple notebooks are able to run in parallel.
>>
>> My question is if other people have tested SparkInterpreter with 
>> ParallelScheduler?
>> Also ideally this should be configurable. User should be specify fifo or
>> parallel.
>>
>> Executing all paragraphs does add more complication and maybe
>>
>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>> the execution order sane.
>>
>>
>> Thoughts?
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>>


-- 
Thanks & Regards,
Ankit.

Re: Parallel Execution of Spark Jobs

Reply via email to