Sure, we can support plain text as well. Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:37写道:
> Yes, it's similar to 2.b. > > Basically, my concern is to handle all kinds of data. But in your case, it > looks like focusing on table data. It's also useful but it would be better > to handle all of the data including table or plain text as well. WDYT? > > About storage, we could discuss it later. > > On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >> >> I think your use case is the same of 2.b. Personally I don't recommend >> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons >> 1. noteId, paragraphId is meaningless, which is not readable >> 2. The note will break if we clone it as the noteId is changed. >> That's why I suggest to use paragraph property to save paragraph's result >> >> Regarding the intermediate storage, I also though about it and agree that >> in the long term we should provide such layer to support large data, >> currently we put the shared data in memory which is not a scalable >> solution. One candidate in my mind is alluxio [1], and regarding the data >> format I think apache arrow [2] is another good option for zeppelin to >> share table data across interpreter processes and different languages. But >> these are all implementation details, I think we can talk about them in >> another thread. In this thread, I think we should focus on the user facing >> api. >> >> >> [1] http://www.alluxio.org/ >> [2] https://arrow.apache.org/ >> >> >> >> Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:11写道: >> >>> I have a bit different idea to share data. >>> >>> In my case, >>> >>> It would be very useful to get a paragraph's result as an input of other >>> paragraphs. >>> >>> e.g. >>> >>> -- Paragrph 1 >>> %jdbc >>> select * from some_table; >>> >>> -- Paragraph 2 >>> %spark >>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself >>> spark.read(table).select.... >>> >>> If paragraph 1's result is too big to show on FE, it would be saved in >>> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph >>> 2 is executed. >>> >>> Basically, I think we need to intermediate storage to store paragraph's >>> results to share them. We can introduce another layer or extend >>> NotebootRepo. In some cases, we might change notebook repos as well. >>> >>> JL >>> >>> >>> >>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang <zjf...@gmail.com> wrote: >>> >>>> Hi Folks, >>>> >>>> Recently, there's several tickets [1][2][3] about sharing data in >>>> zeppelin. >>>> Zeppelin's goal is to be an unified data analyst platform which could >>>> integrate most of the big data tools and help user to switch between >>>> tools >>>> and share data between tools easily. So sharing data is a very critical >>>> and >>>> killer feature of Zeppelin IMHO. >>>> >>>> I raise this ticket to discuss about the scenario of sharing data and >>>> how >>>> to do that. Although zeppelin already provides tools and api to share >>>> data, >>>> I don't think it is mature and stable enough. After seeing these >>>> tickets, I >>>> think it might be a good time to talk about it in community and gather >>>> more >>>> feedback, so that we could provide a more stable and mature approach for >>>> it. >>>> >>>> Currently, there're 3 approaches to share data between interpreters and >>>> interpreter processes. >>>> 1. Sharing data across interpreter in the same interpreter process. Like >>>> sharing data via the same SparkContext in %spark, %spark.pyspark and >>>> %spark.r. >>>> 2. Sharing data between frontend and backend via angularObject >>>> 3. Sharing data across interpreter processes via Zeppelin's ResourcePool >>>> >>>> For this thread, I would like to talk about the approach 3 (Sharing data >>>> via Zeppelin's ResourcePool) >>>> >>>> Here's my current thinking of sharing data. >>>> 1. What kind of data would be shared ? >>>> IMHO, users would share 2 kinds of data: primitive data (string, >>>> number) >>>> and table data. >>>> >>>> 2. How to write shared data ? >>>> User may want to share data via 2 approches >>>> a. Use ZeppelinContext (e.g. z.put). >>>> b. Share the paragraph result via paragraph properties. e.g. user >>>> may >>>> want to read data from oracle database via jdbc interpreter and then do >>>> plotting in python interpreter. In such scenario. he can save the jdbc >>>> result in ResourcePool via paragraph property and then read it it via >>>> z.get. Here's one simple example (Not implemented yet) >>>> >>>> %jdbc(saveAsTable=people) >>>> select * from oracle_table >>>> >>>> %python >>>> z.getTable("people).toPandas() >>>> >>>> 3. How to read shared data ? >>>> User can also have 2 approaches to read the shared data. >>>> a. Via ZeppelinContext. (e.g. z.get, z.getTable) >>>> b. Via variable substitution [1] >>>> >>>> Here's one sample note which illustrate the scenario of sharing data. >>>> >>>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMzkxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24 >>>> >>>> This is just my current thinking of sharing data in zeppelin, it >>>> definitely >>>> doesn't cover all the scenarios, so I raise this thread to discuss >>>> about in >>>> community, welcome any feedback and comments. >>>> >>>> >>>> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377 >>>> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596 >>>> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617 >>>> >>> >>> >>> >>> -- >>> 이종열, Jongyoul Lee, 李宗烈 >>> http://madeng.net >>> >> > > > -- > 이종열, Jongyoul Lee, 李宗烈 > http://madeng.net >