I am glad to see this thread about the multi-user support of zeppelin. I think
this is a very important and urgent feature for zeppelin’s next step.
Here’s the issues that I see for multiuser support, not sure whether there’s
umbrella ticket for multiple user support, if not, then I think we should
create one and start that.
1. Interpreter setting.
- User level interpreter setting. For now the interpreter setting is global
applied. That means if user A change interpreter setting of spark, it would
apply to others, this is a pretty bad user experience.
2. Interpreter Instance.
- Although there’s several options for this, but the default behavior
is to share the interpreter instance. ZEPPELIN-1210 is to creating interpreter
per user, I think this should be the default behavior.
- Performance issue, as zeppelin only support yarn-client mode ( I
think the yarn-cluster mode in the previous reply means livy ). Supporting
yarn-cluster mode for the native spark interpreter should also been necessary.
3. Note management
- For now, there’s no concept like workspace for user. Each user can
see all the notes, it’s pretty hard to manage and organize that. I think there
should be a module for managing and organizing the notes per user.
4. Secured cluster.
- In the kerberized environment, all the interpreter star the same
keytab/principal, this is pretty dangerous. E.g. User A can use shell
interpreter which run as user B to delete all the files owned by user B.
Best Regard,
Jeff Zhang
From: vincent gromakowski
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Saturday, August 6, 2016 at 5:11 AM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: Multiuser support of Zeppelin.
One zeppelin per user in mesos container on datanode type server is fine for
me. An Ansible script configure each instance with user specifities and launch
it in Marathon. A service discovery (basic shell script) update an apache
server with basic auth and route each user to his instance. Mesos also run a
SMACK stack on which zeppelin rely.
Le 5 août 2016 11:01 PM, "Egor Pahomov"
<[email protected]<mailto:[email protected]>> a écrit :
I need to build a chart for 10 days for all countries(200) for several products
by some dimensions. I would need at least 4-6 gb per zeppelin for it.
2016-08-05 12:31 GMT-07:00 Mohit Jaggi
<[email protected]<mailto:[email protected]>>:
put your big results somewhere else not in Z’s memory?
On Aug 5, 2016, at 12:26 PM, Egor Pahomov
<[email protected]<mailto:[email protected]>> wrote:
- Use spark driver in “cluster mode” where driver runs on a worker instead of
the node running Z
Even without driver Z is heavy process. You need a lot of RAM to keep big
results from job. And most of all - zeppelin 0.5.6 does not support cluster
mode and I'm not ready to move to 0.6.
2016-08-05 12:03 GMT-07:00 Mohit Jaggi
<[email protected]<mailto:[email protected]>>:
Egor,
Running a scale out system like Spark with multiple users is always tricky.
Operating systems are designed to let multiple users share a single machine.
But for “big data” a single user requires the use of several machines which is
the exact opposite. Having said that I would suggest the following:
- Use spark driver in “cluster mode” where driver runs on a worker instead of
the node running Z
- Set appropriate limits/sizes in spark master configuration
- run separate instances of Z per user, but then you will have a tough time
collaborating and sharing notebooks…maybe they can be stored in a shared space
and all Z instances can read them but I am afraid that shared access might
clobber the files. Z developers can tell us if that is true
Another alternative is virtualization using containers but I think that will
not be easy either.
Mohit
Founder,
Data Orchard LLC
www.dataorchardllc.com<http://www.dataorchardllc.com/>
On Aug 5, 2016, at 11:45 AM, Egor Pahomov
<[email protected]<mailto:[email protected]>> wrote:
Hi, I'd like to discuss best practices for using zeppelin in the multi-user
environment. There are several naive approaches, I've tried for at least couple
month each and not a single one worked:
All users on one zeppelin.
* One spark context - people really break sc and when they are all in the
same boat a single person can stop many from working.
* No resource management support. One person can allocate all resources for
a long time
* The number of notebooks is enormous - it's hard to find anything in it.
* No security separation - everyone sees everything. I do not care about
security, but I care about fool prove. And people can accidently delete
notebooks of each other.
Every user has his own Zeppelin on one machine
* Every zeppelin instance eats memory for zeppelin itself. It's not enough
memory at some point.
* Every spark driver(I use yarn client mode) eats memory. Same issue.
* Single point of failure
* Cores might be not enough
* I can not prove it, but even if memory and cores enough, Zeppelin
experience problems when it's >10 zeppelin instances on one machine. Do not
know for which reason, maybe it's spark driver issues.
Our current approach:
Every department has it's own VM, it's own zeppelin in it.
* I'm not Devops I do not have experience support multiple VM
* It's expensive to have hardware for a lot of VM
* Most of this hardware do not work even 20% of the time.
How are you dealing with this situation?
--
Sincerely yours
Egor Pakhomov
--
Sincerely yours
Egor Pakhomov
--
Sincerely yours
Egor Pakhomov