Re: why zeppelin SparkInterpreter use FIFOScheduler

Pranav Kumar Agarwal Sun, 16 Aug 2015 23:55:06 -0700

Hi Moon,

Yes, the notebookid comes from InterpreterContext. At the momentdestroying SparkIMain on deletion of notebook is not handled. I thinkSparkIMain is a lightweight object, do you see a concern having theseobjects in a map? One possible option could be to destroy notebookrelated objects when the inactivity on a notebook is greater than say 8hours.

>> 4. Build a queue inside interpreter to allow only one paragraphexecution
>> at a time per notebook.
One downside of this approach is, GUI will display RUNNING instead ofPENDING for jobs inside of queue in interpreter.

Yes that's an good point. Having a scheduler at Zeppelin server to builda scheduler that is parallel across notebook's and FIFO acrossparagraph's will be nice. Is there any plan for having such a scheduler?


Regards,
-Pranav.

On 17/08/15 5:38 am, moon soo Lee wrote:

Pranav, proposal looks awesome!

I have a question and feedback,

You said you tested 1,2 and 3. To create SparkIMain per notebook, youneed information of notebook id. Did you get it from InterpreterContext?Then how did you handle destroying of SparkIMain (when notebook isdeleting)?As far as i know, interpreter not able to get information of notebookdeletion.

>> 4. Build a queue inside interpreter to allow only one paragraphexecution

>> at a time per notebook.

One downside of this approach is, GUI will display RUNNING instead ofPENDING for jobs inside of queue in interpreter.


Best,
moon

On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com<mailto:goi....@gmail.com>> wrote:


    +1 for "to re-factor the Zeppelin architecture so that it can
    handle multi-tenancy easily"

    On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduy...@gmail.com
    <mailto:doanduy...@gmail.com>> wrote:

        Agree with Joel, we may think to re-factor the Zeppelin
        architecture so that it can handle multi-tenancy easily. The
        technical solution proposed by Pranav is great but it only
        applies to Spark. Right now, each interpreter has to manage
        multi-tenancy its own way. Ultimately Zeppelin can propose a
        multi-tenancy contract/info (like UserContext, similar to
        InterpreterContext) so that each interpreter can choose to use
        or not.


        On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
        <djo...@gmail.com <mailto:djo...@gmail.com>> wrote:

            I think while the idea of running multiple notes
            simultaneously is great. It is really dancing around the
            lack of true multi user support in Zeppelin. While the
            proposed solution would work if the applications resources
            are those of the whole cluster, if the app is limited (say
            they are 8 cores of 16, with some distribution in memory)
            then potentially your note can hog all the resources and
            the scheduler will have to throttle all other executions
            leaving you exactly where you are now.
            While I think the solution is a good one, maybe this
            question makes us think in adding true multiuser support.
            Where we isolate resources (cluster and the notebooks
            themselves), have separate login/identity and (I don't
            know if it's possible) share the same context.

            Thanks,
            Joel

            > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
            <mindpri...@gmail.com <mailto:mindpri...@gmail.com>> wrote:
            >
            > If the problem is that multiple users have to wait for
            each other while
            > using Zeppelin, the solution already exists: they can
            create a new
            > interpreter by going to the interpreter page and attach
            it to their
            > notebook - then they don't have to wait for others to
            submit their job.
            >
            > But I agree, having paragraphs from one note wait for
            paragraphs from other
            > notes is a confusing default. We can get around that in
            two ways:
            >
            >   1. Create a new interpreter for each note and attach
            that interpreter to
            >   that note. This approach would require the least amount of code 
changes but
            >   is resource heavy and doesn't let you share Spark
            Context between different
            >   notes.
            >   2. If we want to share the Spark Context between
            different notes, we can
            >   submit jobs from different notes into different
            fairscheduler pools (
            >
            
https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
            >   This can be done by submitting jobs from different
            notes in different
            >   threads. This will make sure that jobs from one note
            are run sequentially
            >   but jobs from different notes will be able to run in
            parallel.
            >
            > Neither of these options require any change in the Spark
            code.
            >
            > --
            > Thanks & Regards
            > Rohit Agarwal
            > https://www.linkedin.com/in/rohitagarwal003
            >
            > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal
            <praag...@gmail.com <mailto:praag...@gmail.com>>
            > wrote:
            >
            >> If someone can share about the idea of sharing single
            SparkContext through
            >>> multiple SparkILoop safely, it'll be really helpful.
            >> Here is a proposal:
            >> 1. In Spark code, change SparkIMain.scala to allow
            setting the virtual
            >> directory. While creating new instances of SparkIMain
            per notebook from
            >> zeppelin spark interpreter set all the instances of
            SparkIMain to the same
            >> virtual directory.
            >> 2. Start HTTP server on that virtual directory and set
            this HTTP server in
            >> Spark Context using classserverUri method
            >> 3. Scala generated code has a notion of packages. The
            default package name
            >> is "line$<linenumber>". Package name can be controlled
            using System
            >> Property scala.repl.name.line. Setting this property to
            "notebook id"
            >> ensures that code generated by individual instances of
            SparkIMain is
            >> isolated from other instances of SparkIMain
            >> 4. Build a queue inside interpreter to allow only one
            paragraph execution
            >> at a time per notebook.
            >>
            >> I have tested 1, 2, and 3 and it seems to provide
            isolation across
            >> classnames. I'll work towards submitting a formal patch
            soon - Is there any
            >> Jira already for the same that I can uptake? Also I
            need to understand:
            >> 1. How does Zeppelin uptake Spark fixes? OR do I need
            to first work
            >> towards getting Spark changes merged in Apache Spark
            github?
            >>
            >> Any suggestions on comments on the proposal are highly
            welcome.
            >>
            >> Regards,
            >> -Pranav.
            >>
            >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
            >>>
            >>> Hi piyush,
            >>>
            >>> Separate instance of SparkILoop SparkIMain for each
            notebook while
            >>> sharing the SparkContext sounds great.
            >>>
            >>> Actually, i tried to do it, found problem that
            multiple SparkILoop could
            >>> generates the same class name, and spark executor
            confuses classname since
            >>> they're reading classes from single SparkContext.
            >>>
            >>> If someone can share about the idea of sharing single
            SparkContext
            >>> through multiple SparkILoop safely, it'll be really
            helpful.
            >>>
            >>> Thanks,
            >>> moon
            >>>
            >>>
            >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data
            Platform) <
            >>> piyush.muk...@flipkart.com
            <mailto:piyush.muk...@flipkart.com>
            <mailto:piyush.muk...@flipkart.com
            <mailto:piyush.muk...@flipkart.com>>> wrote:
            >>>
            >>>    Hi Moon,
            >>>    Any suggestion on it, have to wait lot when
            multiple people  working
            >>> with spark.

>>> Can we create separate instance of SparkILoopSparkIMain and

            >>> printstrems  for each notebook while sharing
            theSparkContext
            >>> ZeppelinContext   SQLContext and DependencyResolver
            and then use parallel
            >>> scheduler ?
            >>>    thanks
            >>>
            >>>    -piyush
            >>>
            >>>    Hi Moon,
            >>>
            >>>    How about tracking dedicated SparkContext for a
            notebook in Spark's
            >>>    remote interpreter - this will allow multiple users
            to run their spark
            >>>    paragraphs in parallel. Also, within a notebook
            only one paragraph is
            >>>    executed at a time.
            >>>
            >>>    Regards,
            >>>    -Pranav.
            >>>
            >>>
            >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
            >>>> Hi,
            >>>>
            >>>> Thanks for asking question.
            >>>>
            >>>> The reason is simply because of it is running code
            statements. The
            >>>> statements can have order and dependency. Imagine i
            have two
            >>> paragraphs
            >>>>
            >>>> %spark
            >>>> val a = 1
            >>>>
            >>>> %spark
            >>>> print(a)
            >>>>
            >>>> If they're not running one by one, that means they
            possibly runs in
            >>>> random order and the output will be always different.
            Either '1' or
            >>>> 'val a can not found'.
            >>>>
            >>>> This is the reason why. But if there are nice idea to
            handle this
            >>>> problem i agree using parallel scheduler would help a
            lot.
            >>>>
            >>>> Thanks,
            >>>> moon
            >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
            >>>> <linxizeng0...@gmail.com
            <mailto:linxizeng0...@gmail.com>
            <mailto:linxizeng0...@gmail.com
            <mailto:linxizeng0...@gmail.com>>
            >>> <mailto:linxizeng0...@gmail.com
            <mailto:linxizeng0...@gmail.com>
            <mailto:linxizeng0...@gmail.com
            <mailto:linxizeng0...@gmail.com>>>>
            >>> wrote:
            >>>>
            >>>>    any one who have the same question with me? or
            this is not a
            >>> question?
            >>>>
            >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng
            <linxizeng0...@gmail.com <mailto:linxizeng0...@gmail.com>
            >>> <mailto:linxizeng0...@gmail.com
            <mailto:linxizeng0...@gmail.com>>
            >>>>    <mailto:linxizeng0...@gmail.com
            <mailto:linxizeng0...@gmail.com> <mailto:
            >>> linxizeng0...@gmail.com
            <mailto:linxizeng0...@gmail.com>>>>:
            >>>>
            >>>>        hi, Moon:
            >>>>           I notice that the getScheduler function in the
            >>>> SparkInterpreter.java return a FIFOScheduler which
            makes the
            >>>>        spark interpreter run spark job one by one.
            It's not a good
            >>>>        experience when couple of users do some work
            on zeppelin at
            >>>>        the same time, because they have to wait for
            each other.
            >>>>        And at the same time, SparkSqlInterpreter can
            chose what
            >>>>        scheduler to use by
            "zeppelin.spark.concurrentSQL".
            >>>>        My question is, what kind of consideration do
            you based on
            >>> to
            >>>>        make such a decision?
            >>>
            >>>
            >>>
            >>>
            >>>
            
------------------------------------------------------------------------------------------------------------------------------------------
            >>>
            >>>    This email and any files transmitted with it are
            confidential and
            >>>    intended solely for the use of the individual or
            entity to whom
            >>>    they are addressed. If you have received this email
            in error
            >>>    please notify the system manager. This message contains
            >>>    confidential information and is intended only for
            the individual
            >>>    named. If you are not the named addressee you
            should not
            >>>    disseminate, distribute or copy this e-mail. Please
            notify the
            >>>    sender immediately by e-mail if you have received
            this e-mail by
            >>>    mistake and delete this e-mail from your system. If
            you are not
            >>>    the intended recipient you are notified that
            disclosing, copying,
            >>>    distributing or taking any action in reliance on
            the contents of
            >>>    this information is strictly prohibited. Although
            Flipkart has
            >>>    taken reasonable precautions to ensure no viruses
            are present in
            >>>    this email, the company cannot accept
            responsibility for any loss
            >>>    or damage arising from the use of this email or
            attachments
            >>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to