Re: [DISCUSS] Share Data in Zeppelin

Jeff Zhang Thu, 12 Jul 2018 20:01:51 -0700

Thanks Sanjay, I have fixed the example note.

*Folks, to be noticed,* the example note is just a fake note, it won't work
for now.




Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:54写道：

> BTW, we need to consider the case where the result is large in a design
> time. In my experience, If we implement this feature, users could use it
> with large data.
>
> On Fri, Jul 13, 2018 at 11:51 AM, Sanjay Dasgupta <
> sanjay.dasgu...@gmail.com> wrote:
>
>> I prefer 2.b also. Could we use (save*Result*AsTable=people) instead?
>>
>> There are a few typos in the example note shared:
>>
>> 1) The line val peopleDF = spark.read.format("zeppelin").load() should
>> mention the table name (possibly as argument to load?)
>> 2) The python line val peopleDF = z.getTable("people").toPandas() should
>> not have the val
>>
>>
>> The z.getTable(<table-name>) method could be a very good tool to judge
>> which use-cases are important in the community. It is easy to implement for
>> the in-memory data case, and could be very useful for many situations where
>> a small amount of data is being transferred across interpreters (like the
>> jdbc -> matplotlib case mentioned).
>>
>> Thanks,
>> Sanjay
>>
>> On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee <jongy...@gmail.com> wrote:
>>
>>> Yes, it's similar to 2.b.
>>>
>>> Basically, my concern is to handle all kinds of data. But in your case,
>>> it looks like focusing on table data. It's also useful but it would be
>>> better to handle all of the data including table or plain text as well.
>>> WDYT?
>>>
>>> About storage, we could discuss it later.
>>>
>>> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>>
>>>> I think your use case is the same of 2.b.  Personally I don't recommend
>>>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
>>>> 1.  noteId, paragraphId is meaningless, which is not readable
>>>> 2. The note will break if we clone it as the noteId is changed.
>>>> That's why I suggest to use paragraph property to save paragraph's
>>>> result
>>>>
>>>> Regarding the intermediate storage, I also though about it and agree
>>>> that in the long term we should provide such layer to support large data,
>>>> currently we put the shared data in memory which is not a scalable
>>>> solution.  One candidate in my mind is alluxio [1], and regarding the data
>>>> format I think apache arrow [2] is another good option for zeppelin to
>>>> share table data across interpreter processes and different languages. But
>>>> these are all implementation details, I think we can talk about them in
>>>> another thread. In this thread, I think we should focus on the user facing
>>>> api.
>>>>
>>>>
>>>> [1] http://www.alluxio.org/
>>>> [2] https://arrow.apache.org/
>>>>
>>>>
>>>>
>>>> Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:11写道：
>>>>
>>>>> I have a bit different idea to share data.
>>>>>
>>>>> In my case,
>>>>>
>>>>> It would be very useful to get a paragraph's result as an input of
>>>>> other paragraphs.
>>>>>
>>>>> e.g.
>>>>>
>>>>> -- Paragrph 1
>>>>> %jdbc
>>>>> select * from some_table;
>>>>>
>>>>> -- Paragraph 2
>>>>> %spark
>>>>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
>>>>> spark.read(table).select....
>>>>>
>>>>> If paragraph 1's result is too big to show on FE, it would be saved in
>>>>> Zeppelin Server with proper way and pass to SparkInterpreter when 
>>>>> Paragraph
>>>>> 2 is executed.
>>>>>
>>>>> Basically, I think we need to intermediate storage to store
>>>>> paragraph's results to share them. We can introduce another layer or 
>>>>> extend
>>>>> NotebootRepo. In some cases, we might change notebook repos as well.
>>>>>
>>>>> JL
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> Recently, there's several tickets [1][2][3] about sharing data in
>>>>>> zeppelin.
>>>>>> Zeppelin's goal is to be an unified data analyst platform which could
>>>>>> integrate most of the big data tools and help user to switch between
>>>>>> tools
>>>>>> and share data between tools easily. So sharing data is a very
>>>>>> critical and
>>>>>> killer feature of Zeppelin IMHO.
>>>>>>
>>>>>> I raise this ticket to discuss about the scenario of sharing data and
>>>>>> how
>>>>>> to do that. Although zeppelin already provides tools and api to share
>>>>>> data,
>>>>>> I don't think it is mature and stable enough. After seeing these
>>>>>> tickets, I
>>>>>> think it might be a good time to talk about it in community and
>>>>>> gather more
>>>>>> feedback, so that we could provide a more stable and mature approach
>>>>>> for
>>>>>> it.
>>>>>>
>>>>>> Currently, there're 3 approaches to share data between interpreters
>>>>>> and
>>>>>> interpreter processes.
>>>>>> 1. Sharing data across interpreter in the same interpreter process.
>>>>>> Like
>>>>>> sharing data via the same SparkContext in %spark, %spark.pyspark and
>>>>>> %spark.r.
>>>>>> 2. Sharing data between frontend and backend via angularObject
>>>>>> 3. Sharing data across interpreter processes via Zeppelin's
>>>>>> ResourcePool
>>>>>>
>>>>>> For this thread, I would like to talk about the approach 3 (Sharing
>>>>>> data
>>>>>> via Zeppelin's ResourcePool)
>>>>>>
>>>>>> Here's my current thinking of sharing data.
>>>>>> 1. What kind of data would be shared ?
>>>>>>    IMHO, users would share 2 kinds of data: primitive data (string,
>>>>>> number)
>>>>>> and table data.
>>>>>>
>>>>>> 2. How to write shared data ?
>>>>>>     User may want to share data via 2 approches
>>>>>>     a. Use ZeppelinContext (e.g. z.put).
>>>>>>     b. Share the paragraph result via paragraph properties. e.g. user
>>>>>> may
>>>>>> want to read data from oracle database via jdbc interpreter and then
>>>>>> do
>>>>>> plotting in python interpreter. In such scenario. he can save the jdbc
>>>>>> result in ResourcePool via paragraph property and then read it it via
>>>>>> z.get. Here's one simple example (Not implemented yet)
>>>>>>
>>>>>>         %jdbc(saveAsTable=people)
>>>>>>          select * from oracle_table
>>>>>>
>>>>>>          %python
>>>>>>          z.getTable("people).toPandas()
>>>>>>
>>>>>> 3. How to read shared data ?
>>>>>>     User can also have 2 approaches to read the shared data.
>>>>>>     a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
>>>>>>     b. Via variable substitution [1]
>>>>>>
>>>>>> Here's one sample note which illustrate the scenario of sharing data.
>>>>>>
>>>>>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMzkxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24
>>>>>>
>>>>>> This is just my current thinking of sharing data in zeppelin, it
>>>>>> definitely
>>>>>> doesn't cover all the scenarios, so I raise this thread to discuss
>>>>>> about in
>>>>>> community, welcome any feedback and comments.
>>>>>>
>>>>>>
>>>>>> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377
>>>>>> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596
>>>>>> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> 이종열, Jongyoul Lee, 李宗烈
>>>>> http://madeng.net
>>>>>
>>>>
>>>
>>>
>>> --
>>> 이종열, Jongyoul Lee, 李宗烈
>>> http://madeng.net
>>>
>>
>>
>
>
> --
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net
>

Re: [DISCUSS] Share Data in Zeppelin

Reply via email to