Re: Implementing run all paragraphs sequentially

Jeff Zhang Fri, 29 Sep 2017 04:57:08 -0700

>>> I suppose there is a fairly simple solution to the problem. We can use
flag on paragraph which means “this paragraph should be run in parallel
with previous”. Such a logic could help to create sequential-parallel
running. It does not implement full-DAG capabilities, but it’s easy to
understand and to use.


This can cover some cases, but can not cover all the cases I think


Jeff Zhang <zjf...@gmail.com>于2017年9月29日周五 下午7:52写道：

> Yes, the may looks a little complicated, but it is due to how we name
> paragraph, not due to this approach I think. IMHO without specifying the
> dependency relationship between paragraphs, it is almost impossible to
> schedule paragraphs correctly.
>
>
>
>
> Sotnichenko Sergey <s.sotniche...@tinkoff.ru>于2017年9月29日周五 下午7:45写道：
>
>> It would be very complicated to be honest to build a DAG with names like
>> ‘20170929-143857_1744629322’. Let’s imagine we have 20 paragraphs with such
>> names.
>>
>>
>>
>>
>>
>>
>> *Sergey Sotnichenko *
>>
>>
>>
>>
>>
>> *From:* Jeff Zhang [mailto:zjf...@gmail.com]
>> *Sent:* Friday, September 29, 2017 2:35 PM
>> *To:* users@zeppelin.apache.org
>> *Subject:* Re: Implementing run all paragraphs sequentially
>>
>>
>>
>>
>>
>> 'p1', 'p2' is paragraphId. Regarding the readability, we could allow user
>> to set paragraph name, but this is another story, could be an improvement
>> later.
>>
>>
>>
>>
>>
>>
>>
>> Partridge, Lucas (GE Aviation) <lucas.partri...@ge.com>于2017年9月29日周五 下午
>> 7:30写道：
>>
>> Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or
>> were you using that as shorthand for the id of the paragraph?
>>
>> If the former then what happens if someone inserts, deletes or reorders
>> paragraphs? But if the latter then the paragraph ids wouldn’t be very easy
>> for someone to read and follow the dependency relationships…
>>
>>
>>
>> *From:* Jeff Zhang [mailto:zjf...@gmail.com]
>> *Sent:* 29 September 2017 11:58
>> *To:* users@zeppelin.apache.org
>> *Subject:* EXT: Re: Implementing run all paragraphs sequentially
>>
>>
>>
>>
>>
>> I don't think 2 note setting (parallel/sequential) is sufficient for
>> paragraph scheduling (take the spark tutorial note as an example, we should
>> run the loading bank data paragraph first and then could run all the sql
>> paragraph parallelly).  So the key is how we define the dependency
>> relationship between paragraphs.  Paragraphs of note could build a DAG
>> (directed acyclic graph). Sequential running is just one special kind of
>> DAG (a linked list).
>>
>>
>>
>> I believe we discuss it before in community.  My proposal is that we
>> could add attribute to the interpreter indicator of each paragraph, so that
>> user can specify the paragraph's dependency (If user don't specify it, the
>> default dependency is the paragraph ahead of it).  Still take the spark
>> tutorial note as an example. We have 3 paragraphes, the first one will load
>> bank data, and the second, third paragraph will query the data. So
>> paragraph 2,3 can run parallelly but must run after paragraph 1. Then we
>> need to specify their dependency in the interpreter indicator part.  Of
>> course, user don't need to specify dependencies if the want to run all the
>> paragraphes sequentially, because the default dependencies is the paragraph
>> ahead of it.
>>
>>
>>
>> Paragraph 1.
>>
>>
>>
>> %spark
>>
>> // code to load bank data
>>
>>
>>
>> Paragraph 2.
>>
>>
>>
>> %spark.sql(deps=p1)
>>
>> // query the bank data
>>
>>
>>
>> Paragraph 3.
>>
>> %spark.sql(deps=p1)
>>
>> // query the bank data
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> afancy <grou...@gmail.com>于2017年9月29日周五 下午5:35写道：
>>
>> +1
>>
>> I think this is one of the most important features. don't know why this
>> requirement has been skipped.
>>
>>
>>
>> /afancy
>>
>> On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <
>> m.belou...@tinkoff.ru> wrote:
>>
>> Hello, users!
>>
>> At the moment our analysts often use mixes of interpreters in their notes.
>>
>> For example, they prepare data using %jdbc and then use it in %pyspark.
>> Besides, they often use scheduling to make some regular reporting. And they
>> should do something like `time.sleep()` to wait for the data from %jdbc. It
>> doesn`t guarantee the result and doesn`t look cool.
>>
>>
>>
>> You can find early attempts to implement sequential running of all
>> paragraphs in [1].
>>
>> We are really interested in implementation of the issue [2] and are ready
>> to solve it.
>>
>> It seems a good idea to discuss any requirements.
>>
>> My idea is to introduce note setting that defines the type of running to
>> use (parallel or sequential) and leave "Run all" to be the only button
>> running all the cells in the note. This will make sequential or parallel
>> running the `note option` but not `run option`.
>>
>> Option will be controlled by nearby button as shown
>>
>>
>>
>>
>>
>> For new notes the default state would be "Run sequential all", for old -
>> "Run parallel for interpreters"
>>
>> We are glad to hear any thoughts.
>>
>> Thank you.
>>
>>
>>
>> [1] https://issues.apache.org/jira/browse/ZEPPELIN-1165
>>
>> [2] https://issues.apache.org/jira/browse/ZEPPELIN-2368
>>
>>
>>
>>
>>
>>
>> *Maksim Belousov*
>>
>>
>>
>>

Re: Implementing run all paragraphs sequentially

Reply via email to