1. EMR does not currently provide anything like this for Zeppelin. (Good idea though!) Zeppelin's built-in S3 notebook storage might help you, especially if you turn on bucket versioning, I suppose, but I have not tried this.
2. Yes, if you go to the ResourceManager on port 8088 then click the ApplicationMaster link next to the Zeppelin app, you can get to the Spark UI associated with the Zeppelin SparkContext (assuming you have first run a notebook containing Spark code, otherwise the Zeppelin YARN app won't exist yet). 3. Sorry, I have not tried using Zeppelin's notebook scheduler, but yes, DataPipelines would probably provide you more reliability for production batch ETL jobs. I don't know what your use case is, but maybe you could use DataPipelines to generate some dataset that you store in S3 and can query via Zeppelin? 4. This is a limitation of Zeppelin (really though, of Spark), not specifically of Zeppelin on EMR, in that you must load any dependencies before running any Spark code because the dependencies can only be loaded once. However, once you solve this issue, you will run into a known issue with Zeppelin on EMR where you hit a weird NPE that is caused by the zeppelin user not having write access to /usr/lib/zeppelin/local-repo. I would suggest creating /var/lib/zeppelin/local-repo then creating a symlink from /usr/lib/zeppelin/local-repo to /var/lib/zeppelin/local-repo. We will fix this in emr-4.3.0. ~ Jonathan — Sent from Mailbox On Fri, Dec 4, 2015 at 11:18 PM, armen donigian <donig...@gmail.com> wrote: > Hi all, > Installed Zeppelin on Amazon EMR and it's running swell. Had a few > questions... > 1. How do we version control Zeppelin notes? > 2. How do you check for status of a long running Zeppelin task? Is there a > web UI for this or do you simply check the Resource Manager UI > @master-node:8088 (in case of AWS)? > 3. Are there any known issues/limitations of running Zeppelin note > scheduler in production for batch ETL jobs? Trying to assess it vs Amazon > Data Pipelines. > 4. When trying to add an external jar, I'm getting this error. > %dep > z.reset() > z.load("com.databricks:spark-redshift_2.10:0.5.2") > Must be used before SparkInterpreter (%spark) initialized > Thanks