Re: [DISCUSSION] Extending TableData API

Park Hoon Wed, 14 Jun 2017 07:22:59 -0700

 @Jeff, Thanks for sharing your opinions and important questions.


> Q1. What does the resource registration mean? IIUC, currently it means it
would cache the data in Interpreter Process. Then it might be a memory
issue when more and more resources are registered. Maybe we could introduce
resource retention mechanism or cache the data in other formats (just like
the spark table cache policy, user can specify how to cache the data, like
memory, disk and etc.

A1. It depends on an implementation of TableData for each interpreter. For
example,

If JDBC interpreter only keeps the SQL in a paragraph to reproduce the
table, we don’t need to persist the whole table data in memory or file
system or an external storage. That’s what the section 3.2 describes.

[image: Inline image 2]




> Q2. The scope of resource sharing. For now, it seems it is globally
shared. But I think user level sharing might be more common. Then we need
to create a namespace for each user. That means the same resource name
could exist in different user namespace.

A2. Regarding the namespace concept, the proposal only describes what the
table resource name should be? (Section 5.3) not about namespaces.

The namespace can be the name of a note or custom (e.g creating users’
namespace). We can discuss this.

Personally, +1 for having namespace because it is helpful for searching and
sharing. This might be included by `ResourceRegistry`


[image: Inline image 1]


> Q3. The data route might cause performance issue.  From the diagram, If
spark interpreter needs to access a resource from jdbc interpreter. Then
first data needs to be send to zeppelin server, and then zeppelin server
send the data to spark interpreter. This kind of data route introduce a bit
more overhead to me. And zeppelin server will become a bottleneck and
require large memory when there're many resources to be shared across
users/interpreters. So I would suggest the following approach. Zeppelin
Server just control the metadata and ACL of resources. And Spark
Interpreter will fetch data from Jdbc Interpreter directly instead of
through zeppelin server.  Here's the sequences
       1). SparkInterpreter ask for metadata and token for the resource
       2). Zeppelin Server will check whether this SparkInterprter has
permission to access this resource, if yes, then send the metadata and
token to SparkInterpreter. The metadata includes the RPC address of the
JdbcInterpreter and token is for security.
       3). SparkInterpreter ask JdbcInterpreter for the resource via the
the token and metadata received in step 2
       4). JdbcInterpreter verify the token, and send the data to
SparkInterpreter.

A3. +1 direct accessing in spark interpreter to JDBC since it’s better for
large data handling. But not sure about how other interpreters can do the
same thing. (e.g trivial, but let’s think about shell interpreter which
keeps it’s tabledata on memory)


------------------------------

Some people might wonder why we do not use external storages to persist
(large) table resources instead of keeping them in memory of ZeppelinServer.

The authors originally discussed whether having an external storage or not.
But having external storage

- requires additional (lots of) dependencies. (Geode? Redis? HDFS? Which
one should we use? or support all?)
- even with external storage, we might not persist 400GB, 10TB.

Thus, the proposal was written to

- utilize interpreter’s own storage (e.g spark cluster for spark
interpreter)
- keep the minimal things to reproduce the table result (e.g keeping the
only query) while don’t affect on external storage as well at first.


And now we are discussing, hope we can improve the proposal and turn it
into a reall implementation soon. :)



Thanks.




On Wed, Jun 14, 2017 at 12:20 PM, Jeff Zhang <zjf...@gmail.com> wrote:

>
> Hi Park,
>
> Thanks for the sharing, this is a very interested and innovated idea. I
> have several comments and concerns.
>
> 1. What does the resource registration mean ?
>    IIUC, currently it means it would cache the data in Interpreter
> Process. Then it might be a memory issue when more and more resources are
> registered. Maybe we could introduce resource retention mechanism or cache
> the data in other formats (just like the spark table cache policy, user can
> specify how to cache the data, like memory, disk and etc.)
>
> 2. The scope of resource sharing
>    For now, it seems it is globally shared. But I think user level sharing
> might be more common. Then we need to create a namespace for each user.
> That means the same resource name could exist in different user namespace.
>
> 3. The data route might cause performance issue.
>    From the diagram, If spark interpreter needs to access a resource from
> jdbc interpreter. Then first data needs to be send to zeppelin server, and
> then zeppelin server send the data to spark interpreter. This kind of data
> route introduce a bit more overhead to me. And zeppelin server will become
> a bottleneck and require large memory when there're many resources to be
> shared across users/interpreters. So I would suggest the following
> approach. Zeppelin Server just control the metadata and ACL of resources.
> And Spark Interpreter will fetch data from Jdbc Interpreter directly
> instead of through zeppelin server.  Here's the sequences
>        1). SparkInterpreter ask for metadata and token for the resource
>        2). Zeppelin Server will check whether this SparkInterprter has
> permission to access this resource, if yes, then send the metadata and
> token to SparkInterpreter. The metadata includes the RPC address of the
> JdbcInterpreter and token is for security.
>        3). SparkInterpreter ask JdbcInterpreter for the resource via the
> the token and metadata received in step 2
>        4). JdbcInterpreter verify the token, and send the data to
> SparkInterpreter.
> [image: image.png]
>
>
> Khalid Huseynov <kha...@apache.org>于2017年6月13日周二 上午11:53写道：
>
>> Thanks for the questions guys!
>>
>> @Jun Kim actually that feature was originally discussed and was put into
>> backlog since proposal was more about tables processed by interpreters and
>> their sharing. However having quick visualisation on the fly for not so
>> large data makes sense indeed, and possibly could be done by importing data
>> into some interpreter by default (Spark, python, etc). So I believe it can
>> be done once initial basics for resource sharing is completed.
>>
>> @Andrea Santurbano there should be listing of tables with schema info,
>> but i'm not sure exactly what you mean by drop-down feature between
>> tables in the UI. Could you give little more details/example on that as
>> well as  enhancements on graph part you meant?
>>
>>
>> On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano <sant...@gmail.com>
>> wrote:
>>
>>> Hi guys,
>>> this is great! I think this can also enable some drop-down feature
>>> between tables in the UI...
>>> Do you think this enhancements can also include the graph part?
>>>
>>> Andrea
>>>
>>> Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim <i2r....@gmail.com> ha
>>> scritto:
>>>
>>>> All of the enhancements looks great to me!
>>>>
>>>> And I wish a feature which can upload a small CSV file (maybe about
>>>> 20MB..?) and play with it directly.
>>>> It would be great if I can drag a file to Zeppelin and register it as
>>>> the table.
>>>>
>>>> Thanks :)
>>>>
>>>> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Recently, ZEPPELIN-753
>>>>> <https://issues.apache.org/jira/browse/ZEPPELIN-753> (Tabledata
>>>>> abstraction) and ZEPPELIN-2020
>>>>> <https://issues.apache.org/jira/browse/ZEPPELIN-2020> (Remote method
>>>>> invocation for resources) were resolved.
>>>>> Based on this work, we can improve Zeppelin with the following
>>>>> enhancements:
>>>>>
>>>>> * register the table result as a shared resource
>>>>> * list all available (registered) tables
>>>>> * preview tables including its meta information (e.g columns, types,
>>>>> ..)
>>>>> * download registered tables as CSV, and other formats.
>>>>> * pivoting/filtering in backend to transforming larger data
>>>>> * cross join tables in different interpreters (e.g Spark interpreter
>>>>> uses a table result generated from JDBC interpreter)
>>>>>
>>>>> You can find the full proposal in Extending Table Data API
>>>>> <https://cwiki.apache.org/confluence/display/ZEPPELIN/Proposal%3A+Extending+TableData+API>
>>>>>  which
>>>>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>>>>
>>>>> Any question, feedback or discussion will be welcomed.
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>> --
>>>> Taejun Kim
>>>>
>>>> Data Mining Lab.
>>>> School of Electrical and Computer Engineering
>>>> University of Seoul
>>>>
>>>
>>

Re: [DISCUSSION] Extending TableData API

Reply via email to