Hi all, Here's one PR to change the run all behavior to be run all paragraph sequentially. https://github.com/apache/zeppelin/pull/2627
Welcome any comment on this PR. David Howell <david.how...@zipmoney.com.au>于2017年10月8日周日 下午9:10写道: > This should be implemented as a DAG that is defined sequentially by > default; additional paragraphs should be appended to the DAG. Reordering > paragraphs should reorder the DAG. > > > > Implementing it as a DAG will make adding future functionality easier. > > Later you can add the functionality to rearrange paragraph dependencies > within the DAG, Perhaps by creating a special %dag interpreter. If you > define a dag interpreter and forget to add some paragraphs to the DAG > definition, they should either run sequentially by default (probably hard > to get right given ambiguous possibilities when missing paragraphs could be > anywhere) or should error that not all paragraphs have been > dependency-linked (easier to implement). The output of a DAG paragraph > should be a visual dependency graph. The syntax for the %dag paragraph > should follow other conventions like using some arrow to indicate upstream > to downstream e.g. -> or >> > > paragraph1 -> paragraph2 > > > > And allow some diamond dependencies e.g.: > > paragraph1 >> paragraph2 > > paragraph1 >> paragraph3 > > paragraph2 >> paragraph4 > > paragraph3 >> paragraph4 > > > > Dave > > > > *From: *Jianfeng (Jeff) Zhang <jzh...@hortonworks.com> > *Sent: *Saturday, 7 October 2017 11:57 AM > > > *To: *users@zeppelin.apache.org > *Subject: *Re: Implementing run all paragraphs sequentially > > Since almost everyone agree on to run serial by default. We could > implement it first. Regarding the parallel mode, we could leave it in > future although personally I prefer to define DAG for note. > > > Best Regard, > Jeff Zhang > > > From: Michael Segel <msegel_had...@hotmail.com> > Reply-To: "users@zeppelin.apache.org" <users@zeppelin.apache.org> > Date: Friday, October 6, 2017 at 10:08 PM > To: "users@zeppelin.apache.org" <users@zeppelin.apache.org> > Subject: Re: Implementing run all paragraphs sequentially > > Guys… > > 1) You’re posting this to the user list… Isn’t this a dev question? > > 2) +1 on the run serial… but doesn’t that already exist with the “run all > paragraphs” button already? > > 3) -1 on a ‘run all in parallel’ button. (Its like putting lipstick on a > pig.) > > Are you really going to run all of the paragraphs in parallel? You’re not > going to have a paragraph that is used to set things up? Import external > libraries? Define classes/functions for future paragraphs to use? > > IMHO I would much rather see a DAG where each paragraph can set their > dependancy… (this isn’t the right term. I’m trying to think back to how it > was described in NeXTStep objective-c code.) > Then you could set your parallel button to run in parallel but if your > paragraph is dependent on another, its blocked from executing until its > predecessor completes. > > But that’s just my $0.02 > > On Oct 6, 2017, at 2:25 AM, Polyakov Valeriy <v.polja...@tinkoff.ru> > wrote: > > Thank you all for sharing the problem. Naman Mishra had started the > implementation of serial run in [1] so I propose to come back for the > discussion of next step (both Parallel and Serial run buttons) after [1] > will resolved. > > [1] https://issues.apache.org/jira/browse/ZEPPELIN-2368 > > > *Valeriy Polyakov* > > *From:* Jeff Zhang [mailto:zjf...@gmail.com <zjf...@gmail.com>] > *Sent:* Friday, October 06, 2017 10:14 AM > *To:* users@zeppelin.apache.org > *Subject:* Re: Implementing run all paragraphs sequentially > > > +1 for serial run by default. Let's leave others in future. > > Mohit Jaggi <mohitja...@gmail.com>于2017年10月6日周五 上午7:48写道: > > +1 for serial run by default. > > Sent from my iPhone > > > On Oct 5, 2017, at 3:36 PM, moon soo Lee <m...@apache.org> wrote: > > I'd like to we also consider simplicity of use. > > We can have two different modes, or two different run buttons for Serial > or Parallel run. This gives flexibility of choosing two different scheduler > as a benefit, but to make user understand difference between two run > button, there must be really good UI treatment. > > I see there're high user demands for run notebook sequentially. And i > think there're 3 action items in this discussion threads. > > 1. Change Parallel -> Serial the current run all button behavior > 2. Provide both Parallel and Serial run buttons with really good UI > treatment. > 3. Provides DAG > > I think 1) does not stop 2) and 3) in the future. 2) also does not stop 3) > in the future. > > So, why don't we try 1) first and keep discuss and polish idea about 2) > and 3)? > > > Thanks, > moon > > On Mon, Oct 2, 2017 at 10:22 AM Michael Segel <msegel_had...@hotmail.com> > wrote: > > Whoa! > Seems I walked in to something. > > Herval, > > What do you suggest? A simple switch that runs everything in serial, or > everything in parallel? > That would be a very bad idea. > > I gave you an example of a class of solutions where you don’t want that > behavior. > E.g Unit testing where you have one setup and then run several unit tests > in parallel. > > If that’s not enough for you… how about if you want to test > producer/consumer problems? > > Or if you want to define classes in one paragraph but then call on them in > later paragraphs. If everything runs in parallel from the start of time 0, > you can’t do this. > > > So, if you want to do it right the first time… you need to establish a way > to control the dependency of paragraphs. This isn’t rocket science. > And frankly not that complex. > > BTW, this is the user list not the dev list… > > Just saying… ;-) > > > > On Oct 2, 2017, at 11:24 AM, Herval Freire <hfre...@twitter.com> wrote: > > "nice to have" isn't a very strong requirement. I strongly uggest you > really, really think about this before you start pounding an overengineered > solution to a non-issue :-) > > h > > On Mon, Oct 2, 2017 at 9:12 AM, Michael Segel <msegel_had...@hotmail.com> > wrote: > > Yes… > You have bunch of unit tests you can run in parallel where you only need > one constructor and one cleanup. > > I would strongly suggest that you really, really think about this long and > hard before you start to pound code. > Its going to be harder to back out and fix than if you take the time to > think thru the problem and not make a dumb mistake. > > > On Oct 2, 2017, at 11:02 AM, Herval Freire <hfre...@twitter.com> wrote: > > Did anyone request such a case ("running some in parallel and some in > sequence")? I haven't seen any requests for this in the wild (nor on this > thread), other than theoretical "what if" - which is totally fine, when it > doesn't introduce a lot of unecessary complexity for little to no gain > (which seems to be the case here) > > h > > On Mon, Oct 2, 2017 at 8:48 AM, Michael Segel <msegel_had...@hotmail.com> > wrote: > > Because that simplicity doesn’t work. > > You will want to run some things serial and some things in parallel. > > Which is why you will need a dependency graph. > > > On Oct 2, 2017, at 10:40 AM, Herval Freire <hfre...@twitter.com> wrote: > > Why do you need rules and graphs and any of that to support running > everything sequentially or everything in parallel? > > 3) add a “run mode” to the note. If it’s “sequential”, run the paragraphs > one at a time, in the order they’re defined. If parallel, run using current > scheme (as many at the same time as the threadpool permits) > > Simpler and covers all cases, imo > > ------------------------------ > *From:* Polyakov Valeriy <v.polja...@tinkoff.ru> > *Sent:* Monday, October 2, 2017 8:24:35 AM > *To:* users@zeppelin.apache.org > *Subject:* RE: Implementing run all paragraphs sequentially > > Let me try to summarize the discussion. Evidently, current behavior of > running notes does not meet actual requirements. The most important thing > that we need is the ability of sequential running. However, at the same > time we want to keep functionality of parallel running. We discussed that > the most suitable solution of building paragraphs` dependencies is a DAG > (directed acyclic graph). Therefore, surely, this kind of dependencies > should be defined in note and the running order should not depend on how we > launch it (button / scheduler / API). In this way, our objectives are to > implement “dependency definition engine” and to use it in “run engine”. > What are the options? > 1) Explicit dependency definition. > We could take for a rule that each paragraph should wait for the end of > execution of ALL previous paragraphs. Then we add paragraph option “Wait > for …” where we can choose paragraph for which we are waiting for to start > execution. In case where the option is set, we start execution immediately > after the end of execution of selected paragraph. This pattern allows us to > implement full-parallel DAG running order. What are the disadvantages? All > of them are about the same – not easy understanding of the dependency > management process from the perspective of users (and probably redundancy > of the functionality – my personal view). At first, we should use strange > format of paragraph IDs, which in addition is hidden. We could come up with > visible and handsome paragraph ID aliases, but then it appears necessity of > duplication control. The second thing is in some kind of scenarios where we > should change existing dependencies (e.g. you need to add new paragraph > between one and dependent group – you have to change option “Wait for …” > for each paragraph in group). > 2) Implicit dependency definition. > > We could take for a rule that each paragraph should wait for the end of > execution of ALL previous paragraphs. Then we add paragraph option “Run in > parallel with previous” which allows us to create paragraph groups to run > in parallel. It turns out that we have the way of sequential running of > paragraph groups – group by group in which paragraphs run in parallel. This > approach is much more understandable for the users, but the obvious defect > in comparison with “Explicit definition” is the fact that dependency graph > and level of parallelism are not so cool. > I am not sure which option (1) or (2) is correct to implement at the > moment. I hope to hear from product visionaries which way to choose and to > get approval for the start of implementation. > Thank you! > > > > > *Valeriy Polyakov* > > *From:* Michael Segel [mailto:msegel_had...@hotmail.com > <msegel_had...@hotmail.com>] > *Sent:* Saturday, September 30, 2017 4:22 PM > *To:* users@zeppelin.apache.org > *Subject:* Re: Implementing run all paragraphs sequentially > > Sorry to jump in… > > If you want to run paragraphs in parallel, you are going to want to have > some sort of dependency graph. Think of a common set up where you need to > set up common functions and imports. (setup of %spark.dep) > > A good example is if your notebook is a bunch of unit tests and you need > to build the common tear down / set up methods to be used by the other > paragraphs. > > If you’re going to do that, you’ll need to build out a metadata structure > where you can set up your dependencies as well as add things like labels > beyond the ids (which only need to be unique to the given notebook. ) > > Just my $0.02 > > > On Sep 29, 2017, at 1:30 PM, moon soo Lee <m...@apache.org> wrote: > > Current behavior is as parallel as possible. > Run notebook button currently submits all paragraphs in a notebook into > each interpreter's own scheduler (FIFO, Parallel) at once. And each > individual scheduler of interpreter runs the paragraphs. > > I think we can provide "sequential" run button for easier use, which > submits paragraph one and waits for finish before submit next paragraphs. > > And I think sequential run button doesn't stop having more complex / > flexible DAG in the future? > > Thanks, > moon > > On Fri, Sep 29, 2017 at 10:08 AM Mohit Jaggi <mohitja...@gmail.com> wrote: > > What is the current behavior? > > On Fri, Sep 29, 2017 at 6:56 AM, Herval Freire <hfre...@twitter.com> > wrote: > > At least in our case, the notebooks that we need to run sequentially are > expected to *always* run sequentially - thus it makes more sense to be a > note option than a per-run mode > > H > > > _____________________________ > From: moon soo Lee <m...@apache.org> > Sent: Thursday, September 28, 2017 9:03 PM > Subject: Re: Implementing run all paragraphs sequentially > To: <users@zeppelin.apache.org> > This is going to be really useful! > > Curios why do you prefer 'note option' instead of 'run option'? > Could you compare their pros and cons? > > Thanks, > moon > > On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <hfre...@twitter.com> wrote: > > +1, our internal users at Twitter also often request this > > ------------------------------ > *From:* Belousov Maksim Eduardovich <m.belou...@tinkoff.ru> > *Sent:* Thursday, September 28, 2017 8:28:58 AM > *To:* users@zeppelin.apache.org > *Subject:* Implementing run all paragraphs sequentially > > Hello, users! > > At the moment our analysts often use mixes of interpreters in their notes. > For example, they prepare data using %jdbc and then use it in %pyspark. > Besides, they often use scheduling to make some regular reporting. And they > should do something like `time.sleep()` to wait for the data from %jdbc. It > doesn`t guarantee the result and doesn`t look cool. > > You can find early attempts to implement sequential running of all > paragraphs in [1]. > We are really interested in implementation of the issue [2] and are ready > to solve it. > > It seems a good idea to discuss any requirements. > My idea is to introduce note setting that defines the type of running to > use (parallel or sequential) and leave "Run all" to be the only button > running all the cells in the note. This will make sequential or parallel > running the `note option` but not `run option`. > Option will be controlled by nearby button as shown > > <~WRD000.jpg> > > > > For new notes the default state would be "Run sequential all", for old - > "Run parallel for interpreters" > > We are glad to hear any thoughts. > Thank you. > > > [1] https://issues.apache.org/jira/browse/ZEPPELIN-1165 > [2] https://issues.apache.org/jira/browse/ZEPPELIN-2368 > > > > > *Maksim Belousov* > > >