Thanks Sanjay, I have fixed the example note. *Folks, to be noticed,* the example note is just a fake note, it won't work for now.
Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:54写道: > BTW, we need to consider the case where the result is large in a design > time. In my experience, If we implement this feature, users could use it > with large data. > > On Fri, Jul 13, 2018 at 11:51 AM, Sanjay Dasgupta < > sanjay.dasgu...@gmail.com> wrote: > >> I prefer 2.b also. Could we use (save*Result*AsTable=people) instead? >> >> There are a few typos in the example note shared: >> >> 1) The line val peopleDF = spark.read.format("zeppelin").load() should >> mention the table name (possibly as argument to load?) >> 2) The python line val peopleDF = z.getTable("people").toPandas() should >> not have the val >> >> >> The z.getTable(<table-name>) method could be a very good tool to judge >> which use-cases are important in the community. It is easy to implement for >> the in-memory data case, and could be very useful for many situations where >> a small amount of data is being transferred across interpreters (like the >> jdbc -> matplotlib case mentioned). >> >> Thanks, >> Sanjay >> >> On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee <jongy...@gmail.com> wrote: >> >>> Yes, it's similar to 2.b. >>> >>> Basically, my concern is to handle all kinds of data. But in your case, >>> it looks like focusing on table data. It's also useful but it would be >>> better to handle all of the data including table or plain text as well. >>> WDYT? >>> >>> About storage, we could discuss it later. >>> >>> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang <zjf...@gmail.com> wrote: >>> >>>> >>>> I think your use case is the same of 2.b. Personally I don't recommend >>>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons >>>> 1. noteId, paragraphId is meaningless, which is not readable >>>> 2. The note will break if we clone it as the noteId is changed. >>>> That's why I suggest to use paragraph property to save paragraph's >>>> result >>>> >>>> Regarding the intermediate storage, I also though about it and agree >>>> that in the long term we should provide such layer to support large data, >>>> currently we put the shared data in memory which is not a scalable >>>> solution. One candidate in my mind is alluxio [1], and regarding the data >>>> format I think apache arrow [2] is another good option for zeppelin to >>>> share table data across interpreter processes and different languages. But >>>> these are all implementation details, I think we can talk about them in >>>> another thread. In this thread, I think we should focus on the user facing >>>> api. >>>> >>>> >>>> [1] http://www.alluxio.org/ >>>> [2] https://arrow.apache.org/ >>>> >>>> >>>> >>>> Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:11写道: >>>> >>>>> I have a bit different idea to share data. >>>>> >>>>> In my case, >>>>> >>>>> It would be very useful to get a paragraph's result as an input of >>>>> other paragraphs. >>>>> >>>>> e.g. >>>>> >>>>> -- Paragrph 1 >>>>> %jdbc >>>>> select * from some_table; >>>>> >>>>> -- Paragraph 2 >>>>> %spark >>>>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself >>>>> spark.read(table).select.... >>>>> >>>>> If paragraph 1's result is too big to show on FE, it would be saved in >>>>> Zeppelin Server with proper way and pass to SparkInterpreter when >>>>> Paragraph >>>>> 2 is executed. >>>>> >>>>> Basically, I think we need to intermediate storage to store >>>>> paragraph's results to share them. We can introduce another layer or >>>>> extend >>>>> NotebootRepo. In some cases, we might change notebook repos as well. >>>>> >>>>> JL >>>>> >>>>> >>>>> >>>>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang <zjf...@gmail.com> wrote: >>>>> >>>>>> Hi Folks, >>>>>> >>>>>> Recently, there's several tickets [1][2][3] about sharing data in >>>>>> zeppelin. >>>>>> Zeppelin's goal is to be an unified data analyst platform which could >>>>>> integrate most of the big data tools and help user to switch between >>>>>> tools >>>>>> and share data between tools easily. So sharing data is a very >>>>>> critical and >>>>>> killer feature of Zeppelin IMHO. >>>>>> >>>>>> I raise this ticket to discuss about the scenario of sharing data and >>>>>> how >>>>>> to do that. Although zeppelin already provides tools and api to share >>>>>> data, >>>>>> I don't think it is mature and stable enough. After seeing these >>>>>> tickets, I >>>>>> think it might be a good time to talk about it in community and >>>>>> gather more >>>>>> feedback, so that we could provide a more stable and mature approach >>>>>> for >>>>>> it. >>>>>> >>>>>> Currently, there're 3 approaches to share data between interpreters >>>>>> and >>>>>> interpreter processes. >>>>>> 1. Sharing data across interpreter in the same interpreter process. >>>>>> Like >>>>>> sharing data via the same SparkContext in %spark, %spark.pyspark and >>>>>> %spark.r. >>>>>> 2. Sharing data between frontend and backend via angularObject >>>>>> 3. Sharing data across interpreter processes via Zeppelin's >>>>>> ResourcePool >>>>>> >>>>>> For this thread, I would like to talk about the approach 3 (Sharing >>>>>> data >>>>>> via Zeppelin's ResourcePool) >>>>>> >>>>>> Here's my current thinking of sharing data. >>>>>> 1. What kind of data would be shared ? >>>>>> IMHO, users would share 2 kinds of data: primitive data (string, >>>>>> number) >>>>>> and table data. >>>>>> >>>>>> 2. How to write shared data ? >>>>>> User may want to share data via 2 approches >>>>>> a. Use ZeppelinContext (e.g. z.put). >>>>>> b. Share the paragraph result via paragraph properties. e.g. user >>>>>> may >>>>>> want to read data from oracle database via jdbc interpreter and then >>>>>> do >>>>>> plotting in python interpreter. In such scenario. he can save the jdbc >>>>>> result in ResourcePool via paragraph property and then read it it via >>>>>> z.get. Here's one simple example (Not implemented yet) >>>>>> >>>>>> %jdbc(saveAsTable=people) >>>>>> select * from oracle_table >>>>>> >>>>>> %python >>>>>> z.getTable("people).toPandas() >>>>>> >>>>>> 3. How to read shared data ? >>>>>> User can also have 2 approaches to read the shared data. >>>>>> a. Via ZeppelinContext. (e.g. z.get, z.getTable) >>>>>> b. Via variable substitution [1] >>>>>> >>>>>> Here's one sample note which illustrate the scenario of sharing data. >>>>>> >>>>>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMzkxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24 >>>>>> >>>>>> This is just my current thinking of sharing data in zeppelin, it >>>>>> definitely >>>>>> doesn't cover all the scenarios, so I raise this thread to discuss >>>>>> about in >>>>>> community, welcome any feedback and comments. >>>>>> >>>>>> >>>>>> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377 >>>>>> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596 >>>>>> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617 >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> 이종열, Jongyoul Lee, 李宗烈 >>>>> http://madeng.net >>>>> >>>> >>> >>> >>> -- >>> 이종열, Jongyoul Lee, 李宗烈 >>> http://madeng.net >>> >> >> > > > -- > 이종열, Jongyoul Lee, 李宗烈 > http://madeng.net >