On Wed, Sep 4, 2013 at 9:36 AM, janI <j...@apache.org> wrote: > Hi. > > We have had some longer discussions on different ML/IRC about how a > vm-admin should behave and which level of service we expect for our > servers. > > We need new admins, so this is also a request for anyone interested to chip > in. > > We have had some unfortunate incidents on all 3 vm, of different nature, > which made me question if we as a community: > a) want servers, that are cared for professionally or by happening. > b) want to (are capable to) maintain the servers ourself. > c) are prepared to support a change that make a), b) possible. > > I have formulated some thoughts on how admins could work, but in general I > believe we should convince infra to take over the vm responsibility and > keep our well functioning forum/wiki admins. > > We have a vm-team in place, that was created with the purpose of not having > a single person as admin. I my opinion the team have not lived up to that > purpose but I am still thankful for the help I have received. > > Remarks the ideas below are my personal thought, which I have used during > the time where I maintained the servers: > > =========== > The server should at all times be maintained with the following priority: > 1) security (the backside of being popular is to have the attention of > people who want to gain merit by breaking our servers) > 2) stability (we have limited cpu/ram/disk so we must optimize) > 3) add user wishes (we already have stable systems, 1,2 are far more > important that enhancing the systems). > > Being an admin on a vm is a job that does not take soo much time, but > requires a lot of monitoring and communication (especially with infra). > > A good setup would be, 3 types of admin: > Each server will have an appointed "owner" (anchor-admin) > A number of persons have full sudo on a server (admin) > A number of persons can reboot/restart/work on po files (help-admin) > > === Anchor-admin responsibilities === > Anchor-admin is the "owner" of the vm and the prime contact to infra. > > Anchor-admin has the overall responsibility of the vm. > 1) help when receiving alerts > 2) keep informed on available patches, especial security related patches > 3) create/keep a maintenance plan > 4) coordinate changes external to vm (like dns) with infra > 5) participate in infra discussions relevant for the vm (e.g. certificates) > 6) monitor the vm regularly for resource usage > 7) secure that appl changes are implemented with relevant consensus > 8) discuss work with admin, with the goal that they should be able to take > over one day. > > These activities are expected to take 3-4 hours pr week, more in the > beginning and less later. The hour usage highly depend on the number and > level of admins. > > === Admin responsibilities === > Admins help the anchor admin with ongoing maintenance and have full sudo. > > All changes must be discussed and agreed with the anchor admin, no change > is so important that it cannot wait until discussed ! > > Admins are expected to: > 1) help when receiving alerts > 2) stay informed with the vm configuration > including but not limited to: > - where are which configuration done, and stored (svn/backup) > - how are the apps. configured > - read and update runbook, if something is unclear > 3) participate in the regular maintenance > 4) coordinate all non-scheduled work with anchor-admin > > These activities are expected to take 1-2 hours pr week, more in the > beginning and less later. > > Admin does not need to be specialists, we all learn, but it is important > that the admin have motivation and time to learn. > > > === Help-admin responsibilities === > Help-admins are located in different timezones, so we have 24/7 coverage > and have limited sudo (only restart/reboot/handle po files). > > When a help-admin receives an alert mail, actions should be taken > 1) is the vm reachable via ssh, then login else escalate to admin/infra > 2) is the vm overloaded, or is apache/mysql not running > 3) restart the needed processes > 4) mail at least anchor-admin about with obervations and what was done. > > > === > remark the above are just my thoughts, there are a lot of other > possibilities. > > Lets hear your opinion? > > rgds > jan I. >
I would like to discuss this topic further, much further as a matter of fact, but right now I don't really have enough information. Can you provide details on the following 9or point to document that describes this): * to aid our memories, who are the current vm-team * what are the three servers now under the vm-team * what vm-OS does each use * for each server, what are the specific applications a vm-sysadmin would need to know/become familiar with to be an effective sysadmin * how are alerts on system failure currently handled * what resources would a vm-admin need to respond to a system failure Your role outline is good, but I think before we discuss future strategy, we need a better idea about what's involved. -- ------------------------------------------------------------------------------------------------- MzK "Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities. Truth isn't." -- "Following the Equator", Mark Twain