Hi all. OK, so in the time I disappeared for days to check on why I was getting very frustrating "commlib" errrors, I *think* I finally have an answer- sort of. But behind this is a critical question I'm seeking an opinion on.
In my case, the software (CASAVA from Illumina, Inc) runs under SGE. Swell. So far, so good. But, after several hours, frequent "commlib" errors with really NO other errors or info. Just a connection reset. Well, when I finally had our researcher "scale things back" a bit, the job actually ran to completion. <-- light bulb went off here. The difference? Number of simultaneous jobs PER HOST. At least, that's my working theory. So, we have a dedicated 10-gig network behind our cluster, but I think I'm running into some potential network limitations. I am working on ways to confirm that. *Here's my question*: In our case, we have nodes that have 8, 12 and even 24 cores. A much different world than not too long ago when everything was single or dual or (max) quad core. And while I have no doubt the boxes can handle the cpu load of many jobs, I think I'm hitting network limitations and stuff is getting dropped. Can anyone here speak to opinions, experiences, etc- when it comes to "max simultaneous jobs per executions host" as relates to networking? I'd love to hear any insight on this. Thanks all, --Kent
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
