1. EMR does not currently provide anything like this for Zeppelin. (Good idea 
though!) Zeppelin's built-in S3 notebook storage might help you, especially if 
you turn on bucket versioning, I suppose, but I have not tried this.


2. Yes, if you go to the ResourceManager on port 8088 then click the 
ApplicationMaster link next to the Zeppelin app, you can get to the Spark UI 
associated with the Zeppelin SparkContext (assuming you have first run a 
notebook containing Spark code, otherwise the Zeppelin YARN app won't exist 
yet).




3. Sorry, I have not tried using Zeppelin's notebook scheduler, but yes, 
DataPipelines would probably provide you more reliability for production batch 
ETL jobs. I don't know what your use case is, but maybe you could use 
DataPipelines to generate some dataset that you store in S3 and can query via 
Zeppelin?




4. This is a limitation of Zeppelin (really though, of Spark), not specifically 
of Zeppelin on EMR, in that you must load any dependencies before running any 
Spark code because the dependencies can only be loaded once. However, once you 
solve this issue, you will run into a known issue with Zeppelin on EMR where 
you hit a weird NPE that is caused by the zeppelin user not having write access 
to /usr/lib/zeppelin/local-repo. I would suggest creating 
/var/lib/zeppelin/local-repo then creating a symlink from 
/usr/lib/zeppelin/local-repo to /var/lib/zeppelin/local-repo. We will fix this 
in emr-4.3.0.




~ Jonathan





—
Sent from Mailbox

On Fri, Dec 4, 2015 at 11:18 PM, armen donigian <donig...@gmail.com>
wrote:

> Hi all,
> Installed Zeppelin on Amazon EMR and it's running swell. Had a few
> questions...
> 1. How do we version control Zeppelin notes?
> 2. How do you check for status of a long running Zeppelin task? Is there a
> web UI for this or do you simply check the Resource Manager UI
> @master-node:8088 (in case of AWS)?
> 3. Are there any known issues/limitations of running Zeppelin note
> scheduler in production for batch ETL jobs? Trying to assess it vs Amazon
> Data Pipelines.
> 4. When trying to add an external jar, I'm getting this error.
> %dep
> z.reset()
> z.load("com.databricks:spark-redshift_2.10:0.5.2")
> Must be used before SparkInterpreter (%spark) initialized
> Thanks

Reply via email to