On 22/05/13 15:37, Reuti wrote:
Hi Reuti,

have finally decided to look into upgrading our SGE6.2 installation >> - mainly 
to see if it helps with my job scheduling problem.

I'm trying to build Son of Grid Engine - succeeded actually.
Currently trying to make it run / import my old configuration.
Which mostly worked. Couple of niggles.

Our setup is SGE_ROOT on shared NFS file system, SGE running as a
non-root user. I'd quite like to keep it that way (it worked well
for us).

The real and effective user is not root? I wonder how to change to a
different user during execution then. Often this can be seen:

$ ps -e -o user,ruser,group,rgroup,command
USER     RUSER    GROUP    RGROUP   COMMAND
...
sgeadmin root     gridware root     /usr/sge/bin/lx24-x86/sge_execd

The real and effective user is not root, and never was. Never caused us any 
problems. The NFS share is exported with root_squash.

This is quite interesting. And all jobs are running under their inquired user 
account or do you use one common user account for all jobs?

Jobs are running as the user that submitted them, yes. No common account. Been set up like that since we installed it.

Haven't had time to progress with this setup much; is there any documentation on how the 'inbuild' qrsh etc work? As at the moment, my test installation works, and I can submit jobs (and they run), but interactive sessions don't work - I get a commlib error:

[kdf51254@ws112 ~]$ qrsh
error: commlib error: got read error (closing "cs04r-sc-com99-04.diamond.ac.uk/shepherd_ijs/2")

Didn't have that problem on my old 6.2 installation :)

Tina

Managed to build & install, got the qmaster running, managed to
start
execds. However, at least inst_sge.sh -upd-execd simply refuses to work
if you're not root, if I remember correctly (not helping!).

Script(s) sometimes say 'You are not installing as user >root< -
Can't set the file owner/group and permissions'. It would help if they'd
tell me (without digging through them) what files they're trying to
chown/chmod and what they're trying to chown/chmod it to - so I can fix
that, if there is a problem. Goes for a lot of these sort of errors (to
do with running as non-root) - if it fails to do something, it would
really help to know what it failed to do.

The other thing is that I keep having to run it with -nobincheck,
as
far as I can tell simply because I didn't build qmon. Annoying - should
it not just check for actually required binaries?

Importing my old installation / upgrading from my old installation
didn't quite work. Mostly did, it seems, which is something. No error
that I'd seen during the import/upgrade, but none of my queues are
there. Host groups are; exec hosts are; complexes look okay; global
config looks right. PEs aren't there; trying to create the PEs from the
config files I originally created them from I get 'error: required
attribute "qsort_args" is missing'. Assume that's the root problem (i.e.
did not manage to import PEs, thus can't import queues). Anyone else had
issues with that? Should the save_config script have caught that?

The "qsort_args" is new therein. You dumped the old configuration
using $SGE_ROOT/util/upgrade_modules/save_sge_config.sh? Then it should
work to add just this line to the generated textfile for the PEs in the
created directory with the text files.

I indeed dumped the config using said script. Was just wondering if the script 
were supposed to add a default qsort_args line, or at least the import script 
warn you that it's missing and will thus not work? (Or the export script 
telling you?)


And now for the important question :). My execds currently are a
mix
of RHEL5 and RHEL6; SoGE got compiled on RHEL6, doesn't work on RHEL5
execds.

Do you use the old original execds or the newly compiled one?

If you use the new ones: maybe compiling all on RHEL5 and execute
these on RHEL6 might have better chances to work.

I shall try that; I was just wondering if anyone already knows of a way to make 
them work on both.

Also, all nodes and the master/shadow hosts get software upgrades
quite regularly

I would fear that with updates to the nodes all the software you use
also need to be revalidated, i.e. running the test-suite for all.
Otherwise a change to e.g. a mathematical library may lead to different
results after an update.

The cluster node configuration is very similar to our standard workstation(s) - 
and there is a lot of software people are using on both. A lot of it compiled 
(and/or written) in house, and in a central location. So the risk of said 
libraries being out of sync (as it were) with the standard workstation setup 
(and hence, things that work on workstations not working on the cluster or vice 
versa) is - to us - much more of a concern. So, cluster nodes get upgraded 
along with the rest of the estate.


--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to