Re: Zeppelin in GSOC 2019

Xun Liu Thu, 07 Mar 2019 00:06:10 -0800

Hi Vasiliy Morkovkin

Thank you very much for your willingness to implement this feature of workflow.
I will work with you with the highest priority.
I am planning to update the system design documentation for workflow first at 
https://issues.apache.org/jira/browse/ZEPPELIN-4018 
<https://issues.apache.org/jira/browse/ZEPPELIN-4018> .
Please set the Watcher in ZEPPELIN-4018.
This way you can get notification messages for document updates in a timely 
manner.


We can communicate all the questions in the ZEPPELIN-4018 JIRA comments.
If you need it, you can email me at [email protected] 
<mailto:[email protected]> , I will reply you the fastest.
Do you think this kind of cooperation is OK?


@moon, @Jeff, @Jongyoul Lee , If interested, Please help us improve our system 
design. Thanks!

:-)

> 在 2019年3月7日，上午6:04，Морковкин, Василий Владимирович 
> <[email protected]> 写道：
> 
> Thank you for such a detailed feedback!
> I am definitely interested to work on the workflow implementation with you 
> Xun Liu! Could you become a mentor in GSOC with this task?
> Some front-end work is not a problem at all.
> I'm ready to work at least 30 hours per week in the summer, while now I'd 
> like to take some smaller tasks to take a closer look at existing codebase 
> and to get familiar with your development workflow. Do you have such tasks on 
> mind?
> 
> ср, 6 мар. 2019 г. в 05:23, Xun Liu <[email protected] 
> <mailto:[email protected]>>:
> Hi Vasiliy Morkovkin
> 
> I said my thoughts on workflow, 
> https://issues.apache.org/jira/browse/ZEPPELIN-4018 
> <https://issues.apache.org/jira/browse/ZEPPELIN-4018> 
> 
> Because there are more than 20 interpreters in zeppelin, 
> Data analysts can be used to do a variety of data development,
> A lot of data development is interdependent. For example, 
> the development of machine learning algorithms requires relying on spark to 
> preprocess data, and so on.
> 
> Now open source workflow software has Azkaban, airflow,
> Azkaban is relatively simple and has been used to meet most scenarios, and 
> our company is using it.
> Airflow looks complicated and I have not used it.
> In fact, I have previously implemented workflow workflow for notes and 
> paragraphs in zeppelin via azkaban.
> https://youtu.be/2r6q-2Tq7hk?t=33 <https://youtu.be/2r6q-2Tq7hk?t=33> 
> 
> However, I think zeppelin should have built-in workflow capabilities. 
> Instead of relying on external software to schedule notes in zeppelin for the 
> following reasons:
> 1. Now that we have upgraded from the data processing era to the algorithm 
> era,
> After zeppelin has its own workflow, it will form a data loop.
> 
> 2. zeppelin's powerful interactive processing capabilities help algorithm 
> engineers improve productivity and work.
> Zeppelin should give the algorithm engineer more direct control.
> Instead of handing the algorithm to other teams(or software) to do the 
> workflow.
> 
> 3. zeppelin knows more about the processing status of data than Azkaban and 
> airflow.
> So the built-in workflow will have better performance, user experience and 
> control.
> 
> If you are interested in workflow(ZEPPELIN-4018), 
> I am willing to work with you to complete all system design and code 
> development work.
> 
> :-)
> 
>> 在 2019年3月6日，上午9:32，Jeff Zhang <[email protected] <mailto:[email protected]>> 
>> 写道：
>> 
>> https://issues.apache.org/jira/browse/ZEPPELIN-3857Hi 
>> <https://issues.apache.org/jira/browse/ZEPPELIN-3857Hi> Basil,
>> 
>> Thanks for your interest in zeppelin, here's my comments about the tickets
>> you interested.
>> 
>> 1. https://issues.apache.org/jira/browse/ZEPPELIN-3651 
>> <https://issues.apache.org/jira/browse/ZEPPELIN-3651>
>>    This involves 2 sides of work: frontend and backend:
>>    In frontend, we should use arrow js to handle the table data, include
>> display it and processing it (such as aggregation)
>>    In backend, we should use arrow for each language, and allow them to
>> exchange data in the same process. And use arrow IPC to exchange data
>> across processes.
>>   Overall, this is a pretty large task. If you really want to do, I would
>> suggest you to just take part of it.
>> 
>> 2. https://issues.apache.org/jira/browse/ZEPPELIN-3994 
>> <https://issues.apache.org/jira/browse/ZEPPELIN-3994>
>>    Regarding model serving, I don't have clear picture about this. Others
>> can comment on this.
>> 
>> 3. https://issues.apache.org/jira/browse/ZEPPELIN-4018 
>> <https://issues.apache.org/jira/browse/ZEPPELIN-4018>
>>    Job scheduling is pretty important for zeppelin, I would make this as
>> the highest priority for zeppelin among these tickets. airflow is one
>> option, but I am open to other solutions. First we need to figure out how
>> user schedule jobs in zeppelin, then choose the right framework. It would
>> also involves some frontend work
>> 
>> 4. https://issues.apache.org/jira/browse/ZEPPELIN-3857 
>> <https://issues.apache.org/jira/browse/ZEPPELIN-3857>
>>    Spark 2.4.0 supporting is already there, but scala 2.12 is not
>> supported yet. It won't be a big project for GSOC IMO.
>> 
>> 5. OLAP.
>>    Regarding OLAP, as long as the OLAP engine provide Jdbc interface,
>> Zeppelin can support it very well. But we could create specific interpreter
>> for OLAP engine if their native api perform better than jdbc. Another thing
>> I can think of improving OLAP is visualization, although Zeppelin already
>> support some built-in visualization, there's still some visualization
>> missing. We could provide more.
>> 
>> 6. Auto-completions.
>>   We have already support ipython[1]  in zeppelin which provide almost the
>> same auto-completion like jupyter. But it lacks for accessing python api
>> doc. This is also pretty important for python users IMO. SQL is another
>> popular language in Zeppelin, but it also doesn't provide good
>> code-completion experience, we can do better as well.
>> 
>> 7. Notifications.
>>   I think notification can be integrated into job scheduling. Notification
>> can be sent when job is failed/succeed.
>> 
>> 
>> Let us know which jira you are more interested, and also please consider
>> how much time you can spent on this. Again, we are very appreciated your
>> interest on zeppelin and look forward your contribution.
>> 
>> 
>> [1]
>> http://zeppelin.apache.org/docs/0.8.1/interpreter/python.html#ipython-support
>>  
>> <http://zeppelin.apache.org/docs/0.8.1/interpreter/python.html#ipython-support>
>> 
>> 
>> 
>> Морковкин, Василий Владимирович <[email protected] 
>> <mailto:[email protected]>> 于2019年3月6日周三
>> 上午7:41写道：
>> 
>>> Thank you for your replies! I've checked existing set of issues and found
>>> several curious ones:
>>> - https://issues.apache.org/jira/browse/ZEPPELIN-3651 
>>> <https://issues.apache.org/jira/browse/ZEPPELIN-3651> seems to be very
>>> nice
>>> way to increase analytical processing performance using Arrow project;
>>> - https://issues.apache.org/jira/browse/ZEPPELIN-3994 
>>> <https://issues.apache.org/jira/browse/ZEPPELIN-3994> deploying models
>>> regardless of ZeppelinServer sounds quite intriguing too. Although there is
>>> much to think about;
>>> - https://issues.apache.org/jira/browse/ZEPPELIN-4018 
>>> <https://issues.apache.org/jira/browse/ZEPPELIN-4018> at first glance
>>> https://airflow.apache.org/ <https://airflow.apache.org/> seems to be 
>>> useful in implementing complex
>>> execution workflows.
>>> Those tasks are global and intriguing, requiring complex architectural
>>> solutions.
>>> Also I've probably found the ticket which is suitable for me to get
>>> involved into the project:
>>> - https://issues.apache.org/jira/browse/ZEPPELIN-3857 
>>> <https://issues.apache.org/jira/browse/ZEPPELIN-3857>. What do you think?
>>> Are there any "low hanging fruits"?
>>> 
>>> And I have several ideas on my own. Some of them might be not relevant due
>>> to the vision of the project or other reasons. Just ideas:
>>> - OLAP. As Zeppelin is a tool aimed at analytics, it seems to be quite
>>> logical to add more integrations with existing OLAP solutions like Pinot,
>>> ClickHouse and Druid. Currently I've found integration only with Kylin;
>>> - Better autocompletion. Jupyter offers not only a list of already
>>> initialized variables, but also quick access to documentation. It's
>>> convenient;
>>> - Notifications. Some colleagues would have appreciated the notifications
>>> service, which sends you messages (via mail, Slack bot or something else)
>>> indicating that your long-running paragraphs has completed.
>>> 
>>> Feedback is very appreciated :)
>>> 
>>> It would be wonderful if someone agreed to sacrifice his time and become a
>>> mentor in GSOC program!
>>> 
>>> ----------------------------------------
>>> Best regards, Basil Morkovkin.
>>> 
>>> 
>>> вт, 5 мар. 2019 г. в 11:48, Jongyoul Lee <[email protected] 
>>> <mailto:[email protected]>>:
>>> 
>>>> Hello,
>>>> 
>>>> I've confirmed I could add more issues for GSOC. Can you explain what you
>>>> would like to contribute to? I can add more issues
>>>> 
>>>> JL
>>>> 
>>>> On Tue, Mar 5, 2019 at 1:03 PM Xun Liu <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>>> Hi, Vasiliy Morkovkin
>>>>> 
>>>>> Welcome to the zeppelin community! :-)
>>>>> 
>>>>>> 在 2019年3月5日，上午11:49，Jongyoul Lee <[email protected] 
>>>>>> <mailto:[email protected]>> 写道：
>>>>>> 
>>>>>> Thanks for contacting Zeppelin with your interest.
>>>>>> 
>>>>>> I added FE topics for GSOC because FE is the most urgent issue I have
>>>>>> thought about. We always encourage to contribute Zeppelin with several
>>>>>> topics including your idea.
>>>>>> 
>>>>>> Please describe something more.
>>>>>> 
>>>>>> Thanks.
>>>>>> JL
>>>>>> 
>>>>>> On Tue, Mar 5, 2019 at 10:41 AM moon soo Lee <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Great to see your interest to project. Thanks!
>>>>>>> Looks like we need volunteers for a mentor and some backend subject
>>> for
>>>>>>> GSoC2019.
>>>>>>> Any ideas?
>>>>>>> 
>>>>>>> Best,
>>>>>>> moon
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 4, 2019 at 3:05 PM Vasiliy Morkovkin <
>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi everyone, I'm pursuing bachelor degree at Moscow institute of
>>>>> physics
>>>>>>>> and technology and eager to contribute to Zeppelin in context of
>>> GSOC
>>>>>>>> 2019. I've become a real fan of Zeppelin over the past couple of
>>>>> months,
>>>>>>>> using it at my job. But I have found out only one ticket (front-end
>>>>>>>> task) with label of GSOC 2019 on your Jira. Perhaps you may have any
>>>>>>>> ideas for new features or improvements in Zeppelin, but you don't
>>> have
>>>>>>>> enough hands on them. It would be wonderful if anyone agreed to
>>> mentor
>>>>>>>> these ideas within GSOC :)
>>>>>>>> Currently I am in a position of Scala developer (back-end) for 1.5
>>>>> year.
>>>>>>>> I also can write in Java or Python without any problems if
>>> necessary.
>>>>>>>> Really fond of databases and highload. Also I have experience with
>>>>> some
>>>>>>>> other great Apache projects like Cassandra, Kafka and Spark.
>>>>>>>> 
>>>>>>>> Best regards, Basil Morkovkin.
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 이종열, Jongyoul Lee, 李宗烈
>>>>>> http://madeng.net <http://madeng.net/>
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> 이종열, Jongyoul Lee, 李宗烈
>>>> http://madeng.net <http://madeng.net/>
>>>> 
>>> 
>> 
>> 
>> -- 
>> Best Regards
>> 
>> Jeff Zhang
>

Re: Zeppelin in GSOC 2019

Reply via email to