Re: Implementing run all paragraphs sequentially

Jeff Zhang Fri, 06 Oct 2017 00:14:50 -0700

+1 for serial run by default.  Let's leave others in future.

Mohit Jaggi <[email protected]>于2017年10月6日周五 上午7:48写道：


> +1 for serial run by default.
>
> Sent from my iPhone
>
> On Oct 5, 2017, at 3:36 PM, moon soo Lee <[email protected]> wrote:
>
> I'd like to we also consider simplicity of use.
>
> We can have two different modes, or two different run buttons for Serial
> or Parallel run. This gives flexibility of choosing two different scheduler
> as a benefit, but to make user understand difference between two run
> button, there must be really good UI treatment.
>
> I see there're high user demands for run notebook sequentially. And i
> think there're 3 action items in this discussion threads.
>
> 1. Change Parallel -> Serial the current run all button behavior
> 2. Provide both Parallel and Serial run buttons with really good UI
> treatment.
> 3. Provides DAG
>
> I think 1) does not stop 2) and 3) in the future. 2) also does not stop 3)
> in the future.
>
> So, why don't we try 1) first and keep discuss and polish idea about 2)
> and 3)?
>
>
> Thanks,
> moon
>
> On Mon, Oct 2, 2017 at 10:22 AM Michael Segel <[email protected]>
> wrote:
>
>> Whoa!
>> Seems I walked in to something.
>>
>> Herval,
>>
>> What do you suggest?  A simple switch that runs everything in serial, or
>> everything in parallel?
>> That would be a very bad idea.
>>
>> I gave you an example of a class of solutions where you don’t want that
>> behavior.
>> E.g Unit testing where you have one setup and then run several unit tests
>> in parallel.
>>
>> If that’s not enough for you… how about if you want to test
>> producer/consumer problems?
>>
>> Or if you want to define classes in one paragraph but then call on them
>> in later paragraphs. If everything runs in parallel from the start of time
>> 0, you can’t do this.
>>
>>
>> So, if you want to do it right the first time… you need to establish a
>> way to control the dependency of paragraphs. This isn’t rocket science.
>> And frankly not that complex.
>>
>> BTW, this is the user list not the dev list…
>>
>> Just saying…  ;-)
>>
>>
>> On Oct 2, 2017, at 11:24 AM, Herval Freire <[email protected]> wrote:
>>
>>  "nice to have" isn't a very strong requirement. I strongly uggest you
>> really, really think about this before you start pounding an overengineered
>> solution to a non-issue :-)
>>
>> h
>>
>> On Mon, Oct 2, 2017 at 9:12 AM, Michael Segel <[email protected]>
>> wrote:
>>
>>> Yes…
>>>  You have bunch of unit tests you can run in parallel where you only
>>> need one constructor and one cleanup.
>>>
>>> I would strongly suggest that you really, really think about this long
>>> and hard before you start to pound code.
>>> Its going to be harder to back out and fix than if you take the time to
>>> think thru the problem and not make a dumb mistake.
>>>
>>> On Oct 2, 2017, at 11:02 AM, Herval Freire <[email protected]> wrote:
>>>
>>> Did anyone request such a case ("running some in parallel and some in
>>> sequence")? I haven't seen any requests for this in the wild (nor on this
>>> thread), other than theoretical "what if" - which is totally fine, when it
>>> doesn't introduce a lot of unecessary complexity for little to no gain
>>> (which seems to be the case here)
>>>
>>> h
>>>
>>> On Mon, Oct 2, 2017 at 8:48 AM, Michael Segel <[email protected]
>>> > wrote:
>>>
>>>> Because that simplicity doesn’t work.
>>>>
>>>> You will want to run some things serial and some things in parallel.
>>>>
>>>> Which is why you will need a dependency graph.
>>>>
>>>> On Oct 2, 2017, at 10:40 AM, Herval Freire <[email protected]> wrote:
>>>>
>>>> Why do you need rules and graphs and any of that to support running
>>>> everything sequentially or everything in parallel?
>>>>
>>>> 3) add a “run mode” to the note. If it’s “sequential”, run the
>>>> paragraphs one at a time, in the order they’re defined. If parallel, run
>>>> using current scheme (as many at the same time as the threadpool permits)
>>>>
>>>> Simpler and covers all cases, imo
>>>>
>>>> ------------------------------
>>>> *From:* Polyakov Valeriy <[email protected]>
>>>> *Sent:* Monday, October 2, 2017 8:24:35 AM
>>>> *To:* [email protected]
>>>> *Subject:* RE: Implementing run all paragraphs sequentially
>>>>
>>>> Let me try to summarize the discussion. Evidently, current behavior of
>>>> running notes does not meet actual requirements. The most important thing
>>>> that we need is the ability of sequential running. However, at the same
>>>> time we want to keep functionality of parallel running. We discussed that
>>>> the most suitable solution of building paragraphs` dependencies is a DAG
>>>> (directed acyclic graph). Therefore, surely, this kind of dependencies
>>>> should be defined in note and the running order should not depend on how we
>>>> launch it (button / scheduler / API). In this way, our objectives are to
>>>> implement “dependency definition engine” and to use it in “run engine”.
>>>> What are the options?
>>>> 1)      Explicit dependency definition.
>>>> We could take for a rule that each paragraph should wait for the end of
>>>> execution of ALL previous paragraphs. Then we add paragraph option “Wait
>>>> for …” where we can choose paragraph for which we are waiting for to start
>>>> execution. In case where the option is set, we start execution immediately
>>>> after the end of execution of selected paragraph. This pattern allows us to
>>>> implement full-parallel DAG running order. What are the disadvantages? All
>>>> of them are about the same – not easy understanding of the dependency
>>>> management process from the perspective of users (and probably redundancy
>>>> of the functionality – my personal view). At first, we should use strange
>>>> format of paragraph IDs, which in addition is hidden. We could come up with
>>>> visible and handsome paragraph ID aliases, but then it appears necessity of
>>>> duplication control. The second thing is in some kind of scenarios where we
>>>> should change existing dependencies (e.g. you need to add new paragraph
>>>> between one and dependent group – you have to change option “Wait for …”
>>>> for each paragraph in group).
>>>> 2)      Implicit dependency definition.
>>>>
>>>> We could take for a rule that each paragraph should wait for the end of
>>>> execution of ALL previous paragraphs. Then we add paragraph option “Run in
>>>> parallel with previous” which allows us to create paragraph groups to run
>>>> in parallel. It turns out that we have the way of sequential running of
>>>> paragraph groups – group by group in which paragraphs run in parallel. This
>>>> approach is much more understandable for the users, but the obvious defect
>>>> in comparison with “Explicit definition” is the fact that dependency graph
>>>> and level of parallelism are not so cool.
>>>> I am not sure which option (1) or (2) is correct to implement at the
>>>> moment. I hope to hear from product visionaries which way to choose and to
>>>> get approval for the start of implementation.
>>>> Thank you!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Valeriy Polyakov *
>>>>
>>>>
>>>> *From:* Michael Segel [mailto:[email protected]
>>>> <[email protected]>]
>>>> *Sent:* Saturday, September 30, 2017 4:22 PM
>>>> *To:* [email protected]
>>>> *Subject:* Re: Implementing run all paragraphs sequentially
>>>>
>>>>
>>>> Sorry to jump in…
>>>>
>>>>
>>>> If you want to run paragraphs in parallel, you are going to want to
>>>> have some sort of dependency graph.  Think of a common set up where you
>>>> need to set up common functions and imports. (setup of %spark.dep)
>>>>
>>>>
>>>> A good example is if your notebook is a bunch of unit tests and you
>>>> need to build the common tear down / set up methods to be used by the other
>>>> paragraphs.
>>>>
>>>>
>>>> If you’re going to do that, you’ll need to build out a metadata
>>>> structure where you can set up your dependencies  as well as add things
>>>> like labels beyond the ids (which only need to be unique to the given
>>>> notebook. )
>>>>
>>>>
>>>> Just my $0.02
>>>>
>>>>
>>>>
>>>> On Sep 29, 2017, at 1:30 PM, moon soo Lee <[email protected]> wrote:
>>>>
>>>>
>>>> Current behavior is as parallel as possible.
>>>> Run notebook button currently submits all paragraphs in a notebook into
>>>> each interpreter's own scheduler (FIFO, Parallel) at once. And each
>>>> individual scheduler of interpreter runs the paragraphs.
>>>>
>>>>
>>>> I think we can provide "sequential" run button for easier use, which
>>>> submits paragraph one and waits for finish before submit next paragraphs.
>>>>
>>>>
>>>> And I think sequential run button doesn't stop having more complex /
>>>> flexible DAG in the future?
>>>>
>>>>
>>>> Thanks,
>>>> moon
>>>>
>>>>
>>>> On Fri, Sep 29, 2017 at 10:08 AM Mohit Jaggi <[email protected]>
>>>> wrote:
>>>>
>>>> What is the current behavior?
>>>>
>>>>
>>>> On Fri, Sep 29, 2017 at 6:56 AM, Herval Freire <[email protected]>
>>>> wrote:
>>>>
>>>> At least in our case, the notebooks that we need to run sequentially
>>>> are expected to *always* run sequentially - thus it makes more sense to be
>>>> a note option than a per-run mode
>>>>
>>>>
>>>> H
>>>>
>>>>
>>>>
>>>> _____________________________
>>>> From: moon soo Lee <[email protected]>
>>>> Sent: Thursday, September 28, 2017 9:03 PM
>>>> Subject: Re: Implementing run all paragraphs sequentially
>>>> To: <[email protected]>
>>>>
>>>> This is going to be really useful!
>>>>
>>>>
>>>> Curios why do you prefer 'note option' instead of 'run option'?
>>>> Could you compare their pros and cons?
>>>>
>>>>
>>>> Thanks,
>>>> moon
>>>>
>>>>
>>>> On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[email protected]>
>>>> wrote:
>>>>
>>>> +1, our internal users at Twitter also often request this
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Belousov Maksim Eduardovich <[email protected]>
>>>> *Sent:* Thursday, September 28, 2017 8:28:58 AM
>>>> *To:* [email protected]
>>>> *Subject:* Implementing run all paragraphs sequentially
>>>>
>>>>
>>>> Hello, users!
>>>>
>>>>
>>>> At the moment our analysts often use mixes of interpreters in their
>>>> notes.
>>>> For example, they prepare data using %jdbc and then use it in %pyspark.
>>>> Besides, they often use scheduling to make some regular reporting. And they
>>>> should do something like `time.sleep()` to wait for the data from %jdbc. It
>>>> doesn`t guarantee the result and doesn`t look cool.
>>>>
>>>>
>>>> You can find early attempts to implement sequential running of all
>>>> paragraphs in [1].
>>>> We are really interested in implementation of the issue [2] and are
>>>> ready to solve it.
>>>>
>>>>
>>>> It seems a good idea to discuss any requirements.
>>>> My idea is to introduce note setting that defines the type of running
>>>> to use (parallel or sequential) and leave "Run all" to be the only button
>>>> running all the cells in the note. This will make sequential or parallel
>>>> running the `note option` but not `run option`.
>>>> Option will be controlled by nearby button as shown
>>>>
>>>>
>>>> <~WRD000.jpg>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> For new notes the default state would be "Run sequential all", for old
>>>> - "Run parallel for interpreters"
>>>>
>>>>
>>>> We are glad to hear any thoughts.
>>>> Thank you.
>>>>
>>>>
>>>>
>>>>
>>>> [1] https://issues.apache.org/jira/browse/ZEPPELIN-1165
>>>> [2] https://issues.apache.org/jira/browse/ZEPPELIN-2368
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Maksim Belousov*
>>>>
>>>>
>>>>
>>>
>>>
>>
>>

Re: Implementing run all paragraphs sequentially

Reply via email to