[GitHub] [zeppelin] zkytech opened a new pull request, #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

GitBox Fri, 22 Jul 2022 21:37:08 -0700


zkytech opened a new pull request, #4423:
URL: https://github.com/apache/zeppelin/pull/4423


   ### What is this PR for?
   Add **cross datasource query** support to Spark SQL interpreter, currently 
support cross datasource query with these datasources:
   
   - hive
   - jdbc
   - mongodb
   
   #### How to Use
   1. User should declare a cross datasource table in format: 
`interpreterName.databaseName.tableName`.
   2. `interpreterName` should exists in zeppelin interpreter configuration.
   3. For JDBC datasource, jdbc driver jars should be included in dependencies.
   
   ### What type of PR is it?
   Feature
   
   
   ### Need Help
   #### 1. Is there a better way to load all interpreter settings:
   
   Currently inplement by reading interpreter settings list inside zengine 
module and pass this list to Spark SQL interpreter. So this pull request 
include 2 modules:
   
   1. `zeppelin-zengine`: read and pass all interpreter settings to Spark SQL 
interpreter
   2. `spark-interpreter`: add Spark SQL cross datasource query
   
   
   When spark is launch with `local` or `yarn-client` mode, it is easy to load 
interpreter settings list inside `spark-interpreter` and we do not need to make 
a change to `zeppelin-zengine`, but when you luanch spark interpreter in 
`yarn-cluster` mode, `interpreter.json` do not exists in yarn-cluster driver 
node, so you can not get interpreter settings. So I made a change to 
`zeppelin-zengine` to read and pass all interpreter settings to Spark SQL 
interpreter, and this works in `yarn-cluster` mode.
   
   I think it is not good to make change to zengine, is there a better way to 
get all interpreter settings in `yarn-cluster` mode without make change to 
`zengine` ?
   
   #### 2. How to distinguish between user and role in `option.owners` field of 
interpreter setting?
   
   I cannot distinguish user and role inside `option.owners` field of 
interpreter setting and datasource authorization check is implemented with 
these code:
   ```java
   HashSet<String> usersAndRoles = new 
HashSet<>(authenticationInfo.getUsersAndRoles());
   HashSet<String> owners = new HashSet<>(iSetting.option.owners);
   // if owners is empty, means all users can access
   if(!owners.isEmpty()){
     int size1 = owners.size();
     owners.retainAll(usersAndRoles);
     int size2 = owners.size();
     if(size1 == size2){
       // no user or role match
       throw new InvalidCredentialsException(String.format(String.format("user 
%s has not privilege to access interpreter %s",authenticationInfo.getUser(), 
interpreterId)));
     }
   } 
   ```
   
   If there is any security concern, how can I make a better authentication 
check ?
   
   ### What is the Jira issue?
   [ZEPPELIN-5781]
   
   ### How should this be tested?
   1. make sure sparkSQL-interepreter(`%sql`) works 
   2. config a jdbc / mongodb interpreter with name `interpreter-nameX`
   3. test query jdbc/mongodb in %sql:
   ```sql
   %sql
   select * from interpreter-nameX.databaseName.tableName;
   ```
   
   ### Screenshots (if appropriate)
   
   
![image](https://user-images.githubusercontent.com/30063898/180590511-231d7a69-1be4-4157-9cc9-13dfc04d4654.png)
   
   
   ### Questions:
   * Does the licenses files need to update? no
   * Is there breaking changes for older versions? yes
   * Does this needs documentation? yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [zeppelin] zkytech opened a new pull request, #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Reply via email to