BTW, we need to consider the case where the result is large in a design time. In my experience, If we implement this feature, users could use it with large data.
On Fri, Jul 13, 2018 at 11:51 AM, Sanjay Dasgupta <sanjay.dasgu...@gmail.com > wrote: > I prefer 2.b also. Could we use (save*Result*AsTable=people) instead? > > There are a few typos in the example note shared: > > 1) The line val peopleDF = spark.read.format("zeppelin").load() should > mention the table name (possibly as argument to load?) > 2) The python line val peopleDF = z.getTable("people").toPandas() should > not have the val > > > The z.getTable(<table-name>) method could be a very good tool to judge > which use-cases are important in the community. It is easy to implement for > the in-memory data case, and could be very useful for many situations where > a small amount of data is being transferred across interpreters (like the > jdbc -> matplotlib case mentioned). > > Thanks, > Sanjay > > On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee <jongy...@gmail.com> wrote: > >> Yes, it's similar to 2.b. >> >> Basically, my concern is to handle all kinds of data. But in your case, >> it looks like focusing on table data. It's also useful but it would be >> better to handle all of the data including table or plain text as well. >> WDYT? >> >> About storage, we could discuss it later. >> >> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang <zjf...@gmail.com> wrote: >> >>> >>> I think your use case is the same of 2.b. Personally I don't recommend >>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons >>> 1. noteId, paragraphId is meaningless, which is not readable >>> 2. The note will break if we clone it as the noteId is changed. >>> That's why I suggest to use paragraph property to save paragraph's result >>> >>> Regarding the intermediate storage, I also though about it and agree >>> that in the long term we should provide such layer to support large data, >>> currently we put the shared data in memory which is not a scalable >>> solution. One candidate in my mind is alluxio [1], and regarding the data >>> format I think apache arrow [2] is another good option for zeppelin to >>> share table data across interpreter processes and different languages. But >>> these are all implementation details, I think we can talk about them in >>> another thread. In this thread, I think we should focus on the user facing >>> api. >>> >>> >>> [1] http://www.alluxio.org/ >>> [2] https://arrow.apache.org/ >>> >>> >>> >>> Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:11写道: >>> >>>> I have a bit different idea to share data. >>>> >>>> In my case, >>>> >>>> It would be very useful to get a paragraph's result as an input of >>>> other paragraphs. >>>> >>>> e.g. >>>> >>>> -- Paragrph 1 >>>> %jdbc >>>> select * from some_table; >>>> >>>> -- Paragraph 2 >>>> %spark >>>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself >>>> spark.read(table).select.... >>>> >>>> If paragraph 1's result is too big to show on FE, it would be saved in >>>> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph >>>> 2 is executed. >>>> >>>> Basically, I think we need to intermediate storage to store paragraph's >>>> results to share them. We can introduce another layer or extend >>>> NotebootRepo. In some cases, we might change notebook repos as well. >>>> >>>> JL >>>> >>>> >>>> >>>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang <zjf...@gmail.com> wrote: >>>> >>>>> Hi Folks, >>>>> >>>>> Recently, there's several tickets [1][2][3] about sharing data in >>>>> zeppelin. >>>>> Zeppelin's goal is to be an unified data analyst platform which could >>>>> integrate most of the big data tools and help user to switch between >>>>> tools >>>>> and share data between tools easily. So sharing data is a very >>>>> critical and >>>>> killer feature of Zeppelin IMHO. >>>>> >>>>> I raise this ticket to discuss about the scenario of sharing data and >>>>> how >>>>> to do that. Although zeppelin already provides tools and api to share >>>>> data, >>>>> I don't think it is mature and stable enough. After seeing these >>>>> tickets, I >>>>> think it might be a good time to talk about it in community and gather >>>>> more >>>>> feedback, so that we could provide a more stable and mature approach >>>>> for >>>>> it. >>>>> >>>>> Currently, there're 3 approaches to share data between interpreters and >>>>> interpreter processes. >>>>> 1. Sharing data across interpreter in the same interpreter process. >>>>> Like >>>>> sharing data via the same SparkContext in %spark, %spark.pyspark and >>>>> %spark.r. >>>>> 2. Sharing data between frontend and backend via angularObject >>>>> 3. Sharing data across interpreter processes via Zeppelin's >>>>> ResourcePool >>>>> >>>>> For this thread, I would like to talk about the approach 3 (Sharing >>>>> data >>>>> via Zeppelin's ResourcePool) >>>>> >>>>> Here's my current thinking of sharing data. >>>>> 1. What kind of data would be shared ? >>>>> IMHO, users would share 2 kinds of data: primitive data (string, >>>>> number) >>>>> and table data. >>>>> >>>>> 2. How to write shared data ? >>>>> User may want to share data via 2 approches >>>>> a. Use ZeppelinContext (e.g. z.put). >>>>> b. Share the paragraph result via paragraph properties. e.g. user >>>>> may >>>>> want to read data from oracle database via jdbc interpreter and then do >>>>> plotting in python interpreter. In such scenario. he can save the jdbc >>>>> result in ResourcePool via paragraph property and then read it it via >>>>> z.get. Here's one simple example (Not implemented yet) >>>>> >>>>> %jdbc(saveAsTable=people) >>>>> select * from oracle_table >>>>> >>>>> %python >>>>> z.getTable("people).toPandas() >>>>> >>>>> 3. How to read shared data ? >>>>> User can also have 2 approaches to read the shared data. >>>>> a. Via ZeppelinContext. (e.g. z.get, z.getTable) >>>>> b. Via variable substitution [1] >>>>> >>>>> Here's one sample note which illustrate the scenario of sharing data. >>>>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMz >>>>> kxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24 >>>>> >>>>> This is just my current thinking of sharing data in zeppelin, it >>>>> definitely >>>>> doesn't cover all the scenarios, so I raise this thread to discuss >>>>> about in >>>>> community, welcome any feedback and comments. >>>>> >>>>> >>>>> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377 >>>>> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596 >>>>> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617 >>>>> >>>> >>>> >>>> >>>> -- >>>> 이종열, Jongyoul Lee, 李宗烈 >>>> http://madeng.net >>>> >>> >> >> >> -- >> 이종열, Jongyoul Lee, 李宗烈 >> http://madeng.net >> > > -- 이종열, Jongyoul Lee, 李宗烈 http://madeng.net