Re: [DISCUSS] Share Data in Zeppelin

Jongyoul Lee Thu, 12 Jul 2018 19:55:25 -0700

BTW, we need to consider the case where the result is large in a design
time. In my experience, If we implement this feature, users could use it
with large data.


On Fri, Jul 13, 2018 at 11:51 AM, Sanjay Dasgupta <sanjay.dasgu...@gmail.com
> wrote:

> I prefer 2.b also. Could we use (save*Result*AsTable=people) instead?
>
> There are a few typos in the example note shared:
>
> 1) The line val peopleDF = spark.read.format("zeppelin").load() should
> mention the table name (possibly as argument to load?)
> 2) The python line val peopleDF = z.getTable("people").toPandas() should
> not have the val
>
>
> The z.getTable(<table-name>) method could be a very good tool to judge
> which use-cases are important in the community. It is easy to implement for
> the in-memory data case, and could be very useful for many situations where
> a small amount of data is being transferred across interpreters (like the
> jdbc -> matplotlib case mentioned).
>
> Thanks,
> Sanjay
>
> On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee <jongy...@gmail.com> wrote:
>
>> Yes, it's similar to 2.b.
>>
>> Basically, my concern is to handle all kinds of data. But in your case,
>> it looks like focusing on table data. It's also useful but it would be
>> better to handle all of the data including table or plain text as well.
>> WDYT?
>>
>> About storage, we could discuss it later.
>>
>> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>>
>>> I think your use case is the same of 2.b.  Personally I don't recommend
>>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
>>> 1.  noteId, paragraphId is meaningless, which is not readable
>>> 2. The note will break if we clone it as the noteId is changed.
>>> That's why I suggest to use paragraph property to save paragraph's result
>>>
>>> Regarding the intermediate storage, I also though about it and agree
>>> that in the long term we should provide such layer to support large data,
>>> currently we put the shared data in memory which is not a scalable
>>> solution.  One candidate in my mind is alluxio [1], and regarding the data
>>> format I think apache arrow [2] is another good option for zeppelin to
>>> share table data across interpreter processes and different languages. But
>>> these are all implementation details, I think we can talk about them in
>>> another thread. In this thread, I think we should focus on the user facing
>>> api.
>>>
>>>
>>> [1] http://www.alluxio.org/
>>> [2] https://arrow.apache.org/
>>>
>>>
>>>
>>> Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:11写道：
>>>
>>>> I have a bit different idea to share data.
>>>>
>>>> In my case,
>>>>
>>>> It would be very useful to get a paragraph's result as an input of
>>>> other paragraphs.
>>>>
>>>> e.g.
>>>>
>>>> -- Paragrph 1
>>>> %jdbc
>>>> select * from some_table;
>>>>
>>>> -- Paragraph 2
>>>> %spark
>>>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
>>>> spark.read(table).select....
>>>>
>>>> If paragraph 1's result is too big to show on FE, it would be saved in
>>>> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
>>>> 2 is executed.
>>>>
>>>> Basically, I think we need to intermediate storage to store paragraph's
>>>> results to share them. We can introduce another layer or extend
>>>> NotebootRepo. In some cases, we might change notebook repos as well.
>>>>
>>>> JL
>>>>
>>>>
>>>>
>>>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> Recently, there's several tickets [1][2][3] about sharing data in
>>>>> zeppelin.
>>>>> Zeppelin's goal is to be an unified data analyst platform which could
>>>>> integrate most of the big data tools and help user to switch between
>>>>> tools
>>>>> and share data between tools easily. So sharing data is a very
>>>>> critical and
>>>>> killer feature of Zeppelin IMHO.
>>>>>
>>>>> I raise this ticket to discuss about the scenario of sharing data and
>>>>> how
>>>>> to do that. Although zeppelin already provides tools and api to share
>>>>> data,
>>>>> I don't think it is mature and stable enough. After seeing these
>>>>> tickets, I
>>>>> think it might be a good time to talk about it in community and gather
>>>>> more
>>>>> feedback, so that we could provide a more stable and mature approach
>>>>> for
>>>>> it.
>>>>>
>>>>> Currently, there're 3 approaches to share data between interpreters and
>>>>> interpreter processes.
>>>>> 1. Sharing data across interpreter in the same interpreter process.
>>>>> Like
>>>>> sharing data via the same SparkContext in %spark, %spark.pyspark and
>>>>> %spark.r.
>>>>> 2. Sharing data between frontend and backend via angularObject
>>>>> 3. Sharing data across interpreter processes via Zeppelin's
>>>>> ResourcePool
>>>>>
>>>>> For this thread, I would like to talk about the approach 3 (Sharing
>>>>> data
>>>>> via Zeppelin's ResourcePool)
>>>>>
>>>>> Here's my current thinking of sharing data.
>>>>> 1. What kind of data would be shared ?
>>>>>    IMHO, users would share 2 kinds of data: primitive data (string,
>>>>> number)
>>>>> and table data.
>>>>>
>>>>> 2. How to write shared data ?
>>>>>     User may want to share data via 2 approches
>>>>>     a. Use ZeppelinContext (e.g. z.put).
>>>>>     b. Share the paragraph result via paragraph properties. e.g. user
>>>>> may
>>>>> want to read data from oracle database via jdbc interpreter and then do
>>>>> plotting in python interpreter. In such scenario. he can save the jdbc
>>>>> result in ResourcePool via paragraph property and then read it it via
>>>>> z.get. Here's one simple example (Not implemented yet)
>>>>>
>>>>>         %jdbc(saveAsTable=people)
>>>>>          select * from oracle_table
>>>>>
>>>>>          %python
>>>>>          z.getTable("people).toPandas()
>>>>>
>>>>> 3. How to read shared data ?
>>>>>     User can also have 2 approaches to read the shared data.
>>>>>     a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
>>>>>     b. Via variable substitution [1]
>>>>>
>>>>> Here's one sample note which illustrate the scenario of sharing data.
>>>>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMz
>>>>> kxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24
>>>>>
>>>>> This is just my current thinking of sharing data in zeppelin, it
>>>>> definitely
>>>>> doesn't cover all the scenarios, so I raise this thread to discuss
>>>>> about in
>>>>> community, welcome any feedback and comments.
>>>>>
>>>>>
>>>>> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377
>>>>> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596
>>>>> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> 이종열, Jongyoul Lee, 李宗烈
>>>> http://madeng.net
>>>>
>>>
>>
>>
>> --
>> 이종열, Jongyoul Lee, 李宗烈
>> http://madeng.net
>>
>
>


-- 
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

Re: [DISCUSS] Share Data in Zeppelin

Reply via email to