[GitHub] zeppelin issue #1339: [WIP][ZEPPELIN-1332] Remove spark-dependencies & sugge...

AhyoungRyu Sat, 20 Aug 2016 02:09:32 -0700

Github user AhyoungRyu commented on the issue:

    https://github.com/apache/zeppelin/pull/1339
  
    @bzz Thank you for such precise comment! Let me break down your feedback 
one by one(just for making it clear) :)
    
    1.
    >/.spark-dist/ is under cache on TravisCI which is S3 bucket that gets 
synced automatically with the content of this folder while running a build. 
    
    Right. That's my bad. I'll change the dir to another. Then how about 
`ZEPPELIN_HOME/interpreter/spark/` as like before? 
    
    2, 3, 4.
    >what is the benefit and what problem does this change solves?
    
    Actually I also tried to describe well about the current problem & the 
advantage of this change in Jira issue and the PR description, but i guess i 
didn't. I should've explain more clearly. Let me explain more in here with 
actual digit. (I'll update the Jira & PR description as well)
    
     - **What was the problem?**
    
    As you said in the above, yes. The main problem is the Zeppelin binary 
package size. The latest version of Zeppelin bin size was
    ```
    zeppelin-0.6.1-bin-all.tgz: 517MB
    zeppelin-0.6.1-bin-netinst.tgz: 236MB
    ```
    Didn't we ask ASF infra team(?) every release because of Zeppelin's huge 
package size?
    
     - **What is the benefit?**
    
    When I created binary package without `spark-dependencies`, the each bin 
package size was
    ```
    zeppelin-0.6.1-bin-all.tgz: 344MB
    zeppelin-0.6.1-bin-netinst.tgz: 64MB
    ```
    As you can see in the above those two cases' size diff is about `170MB`!  
Moreover, users don't need to type build profiles i.e. `-Pr` or `-Psparkr`. I 
saw many users who are trying to use `%sparkr` in Zeppelin, they hit NPE 
because they didn't build with `-Psparkr`. It's truly confuse maybe they don't 
know well about the maven build mechanism. But with this change, they don't 
need to know about the complicating maven build profiles. 
    
    5.
    > Also regarding user experience - while running zeppelin-demon.sh user 
does not usually expect it to be network-dependant and download 100Mb archives 
- is there at least a user notification\progress indicator
    
    So far, I just added below line to show in console after users start 
`zeppelin-daemon.sh`
    ```
    echo "There is no SPARK_HOME in your system. After successful Spark bin 
installation, Zeppelin will be started."
    ``` 
    Then it starts downloading Spark binary from the mirror site. I'm planning 
to add some description to README as we have provided many build profiles 
information in there. I also agree there must be better way to notify that 
instead of just writing about "We will download 100MB Spark binary package if 
you don't set SPARK_HOME yet" on README. 
    
    After first I came up with removing `spark-dependencies` to reduce Zeppelin 
bin package size, I spent long time to think about how can we substitute the 
preexisting way seamlessly to provide embedded Spark in Zeppelin as like 
before. Please regard this PR as the first initiative. And will be appreciated 
if you can share your awesome idea about this issue! :)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1339: [WIP][ZEPPELIN-1332] Remove spark-dependencies & sugge...

Reply via email to