On Tue, May 5, 2009 at 10:44 AM, Dan Milstein <[email protected]> wrote: > Best-practices-type question: when a single cluster is being used by a team > of folks to run jobs, how do people on this list handle user accounts? > > Many of the examples seem to show everything being run as root on the > master, which is hard to imagine is a great idea. > > Do you: > > - Create a distinct account for every developer who will need to run jobs? > > - Create a single hadoop-dev or hadoop-jobs account, have everyone use > that? > > - Just stick with running it all as root? > > Thanks, > -Dan Milstein >
This is an interesting issue. First remember that the user that starts hadoop is considered the hadoop 'superuser'. You probably do not want to run hadoop as root, or someone could make an 'rm -rf /' map/reduce application and execute it across your cluster. We run hadoop as hadoop user. We use LDAP public key over ssh authentication. Every user has their own account and their own home directory /usr/<username> and /user/<username>. (hadoop) Now the fun comes, user1 runs a process that creates files owned by 'user1'. No surprise. What happens when 'user2' needs to modify this file? This issue is not a hadoop issue, we have this same scenario with people trying to share any file system in unix. On the unix side sticky bit and umask help. What I do is give each user the ability to login as themselves and the team user. passwd hadoop:hadoop user1:user1 user2:user2 team1:team1 group groups team1:user1:user2 In this way the burden falls on the user to make sure the file ownership is correct. If user1 intends for user2 to see the work they have two options. 1) set liberal HDFS file permissions 2) start the process as team1 not 'user1' This is suitable for a development style cluster. Some people follow the policy that a production environment should not allow user access. In that case only one user would exist. passwd hadoop:hadoop mysoftware:mysoftware Code that runs on that type of cluster would be deployed and run by an automated process or configuration management system. Individual users could not directly log in.
