RE: Implementing run all paragraphs sequentially

Polyakov Valeriy Fri, 29 Sep 2017 04:51:51 -0700

I suppose there is a fairly simple solution to the problem. We can use flag on 
paragraph which means “this paragraph should be run in parallel with previous”. 
Such a logic could help to create sequential-parallel running. It does not 
implement full-DAG capabilities, but it’s easy to understand and to use.

Valeriy Polyakov

From: Sotnichenko Sergey [mailto:[email protected]]
Sent: Friday, September 29, 2017 2:45 PM
To: [email protected]
Subject: RE: Implementing run all paragraphs sequentially

It would be very complicated to be honest to build a DAG with names like 
‘20170929-143857_1744629322’. Let’s imagine we have 20 paragraphs with such 
names.

Sergey Sotnichenko

From: Jeff Zhang [mailto:[email protected]]
Sent: Friday, September 29, 2017 2:35 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Implementing run all paragraphs sequentially

'p1', 'p2' is paragraphId. Regarding the readability, we could allow user to 
set paragraph name, but this is another story, could be an improvement later.

Partridge, Lucas (GE Aviation) 
<[email protected]<mailto:[email protected]>>于2017年9月29日周五 下午7:30写道：
Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were 
you using that as shorthand for the id of the paragraph?
If the former then what happens if someone inserts, deletes or reorders 
paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for 
someone to read and follow the dependency relationships…

From: Jeff Zhang [mailto:[email protected]<mailto:[email protected]>]
Sent: 29 September 2017 11:58
To: [email protected]<mailto:[email protected]>
Subject: EXT: Re: Implementing run all paragraphs sequentially

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph 
scheduling (take the spark tutorial note as an example, we should run the 
loading bank data paragraph first and then could run all the sql paragraph 
parallelly).  So the key is how we define the dependency relationship between 
paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). 
Sequential running is just one special kind of DAG (a linked list).

I believe we discuss it before in community.  My proposal is that we could add 
attribute to the interpreter indicator of each paragraph, so that user can 
specify the paragraph's dependency (If user don't specify it, the default 
dependency is the paragraph ahead of it).  Still take the spark tutorial note 
as an example. We have 3 paragraphes, the first one will load bank data, and 
the second, third paragraph will query the data. So paragraph 2,3 can run 
parallelly but must run after paragraph 1. Then we need to specify their 
dependency in the interpreter indicator part.  Of course, user don't need to 
specify dependencies if the want to run all the paragraphes sequentially, 
because the default dependencies is the paragraph ahead of it.

Paragraph 1.

%spark
// code to load bank data

Paragraph 2.

%spark.sql(deps=p1)
// query the bank data

Paragraph 3.
%spark.sql(deps=p1)
// query the bank data

afancy <[email protected]<mailto:[email protected]>>于2017年9月29日周五 下午5:35写道：
+1

I think this is one of the most important features. don't know why this 
requirement has been skipped.

/afancy

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich 
<[email protected]<mailto:[email protected]>> wrote:
Hello, users!
At the moment our analysts often use mixes of interpreters in their notes.
For example, they prepare data using %jdbc and then use it in %pyspark. 
Besides, they often use scheduling to make some regular reporting. And they 
should do something like `time.sleep()` to wait for the data from %jdbc. It 
doesn`t guarantee the result and doesn`t look cool.

You can find early attempts to implement sequential running of all paragraphs 
in [1].
We are really interested in implementation of the issue [2] and are ready to 
solve it.
It seems a good idea to discuss any requirements.
My idea is to introduce note setting that defines the type of running to use 
(parallel or sequential) and leave "Run all" to be the only button running all 
the cells in the note. This will make sequential or parallel running the `note 
option` but not `run option`.
Option will be controlled by nearby button as shown

For new notes the default state would be "Run sequential all", for old - "Run 
parallel for interpreters"
We are glad to hear any thoughts.
Thank you.

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165
[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

Maksim Belousov

RE: Implementing run all paragraphs sequentially

Reply via email to