This happens when 2 threads start to back up the system object, and the second one starts sending data before the first one is able to create the group leader, which is the anchor for management and expiration of the entire system object as a single entity even though it's made of multiple objects.
As a workaround, you can set resourceutil to 2 on all of your windows clients, do another backup of the system objects, and expire the old ones (through policy changes or just by waiting). The hang is related to the defect involving RESTORE STGVOL. We had the same problem; however, the RESTORE STGVOL process never actually made its way into the process table. I would initially be able to get in and HALT dsmserv. Officially, the defect indicated that if left to its own devices, the lock condition would degrade to unreachability. The fix is in 5.3.2.3. HOWEVER, We upgraded to 5.3.2.3 and have had SERIOUS lock issues. SHOW DEADLOCK doesn't show anything. Actlog will periodically show a swarm of errors about operations failing due to lock issues, similar to: 2006-02-26 13:00:18.000000 ANR2033E UPDATE STGPOOL: Command failed - lock conflict. (SESSION: 124639) 2006-02-26 13:00:18.000000 ANR2033E QUERY STGPOOL: Command failed - lock conflict. (SESSION: 124664) 2006-02-26 13:00:18.000000 ANR2033E QUERY DRMEDIA: Command failed - lock conflict. (SESSION: 124670) and similar. ALSO MIGRATE STG will lock tables in such a way that Q STG will hang, but Q PROC and Q SES work. Client sessions will continue writing to whatever volume they have; however, most new sessions will also hang. Once the offending process is killed, everything resumes. ALSO I've found that REPAIR STGVOL has been showing up a very often (a subprocess of RECLAIM STG). ALSO Tonight, REPAIR STGVOL, 2 RECLAIM STG and one AUDIT LIC were all running and had hung. Unfortunately, I didn't pull dbtxn, txn, lock, etc info prior to issuing HALT. ALSO dsmserv seems to chew up more CPU now than at 5.3.1.6 and 5.3.2.1; however, I don't have quantitative measurements of the previous levels. I'm not sure if this progression of locking issues is limited to us or is a 5.3.2.3 problem; however, I'm very worried about the safety and stability of TSM. -Josh On 06.03.03 at 14:51 [EMAIL PROTECTED] wrote:
Date: Fri, 3 Mar 2006 14:51:52 -0800 From: Larry Peifer <[EMAIL PROTECTED]> Reply-To: "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> To: ADSM-L@VM.MARIST.EDU Subject: Re: dsmserv process hung. We too have just started to have this problem in the last 4 days. In our case the symptoms and solutions seem to fit in with what's described in IBM Document Ref #: PK00196. However that was to have been fixed with 5.3.1 release which we are using. Can anyone shed more light on what might be triggering this situation? AIX 5.2 ML5 TSM 5.3.1.0 Here's a series of errors that cropped up this week for the first time. Any insights would be helpful. 02/27/06 21:59:00 ANR9999D imgroup.c(1180): ThreadId<90> Error 8 retrieving Backup Objects row for object 0.101495737 (SESSION: 2838) 02/27/06 21:59:00 ANR9999D ThreadId<90> issued message 9999 from: <-0x000000010001bf74 outDiagf <-0x00000001003fb114 imIsGroupLeader <-0x0000000100396b9c SmNodeSession <-0x000000010047f854 HandleNodeSession <-0x0000000100485760 smExecuteSession <-0x000000010051c3e4 SessionThread <-0x000000010000e958 StartThread <-0x0900000000286460 _pthread_body (SESSION: 2838) 02/27/06 21:59:00 ANR9999D smnode.c(7353): ThreadId<90> Session 2838: Invalid Group Id 0,101495737 for ADD function (SESSION: 2838) 02/27/06 21:59:00 ANR9999D ThreadId<90> issued message 9999 from: <-0x000000010001bf74 outDiagf <-0x0000000100396bc4 SmNodeSession <-0x000000010047f854 HandleNodeSession <-0x0000000100485760 smExecuteSession <-0x000000010051c3e4 SessionThread <-0x000000010000e958 StartThread <-0x0900000000286460 _pthread_body (SESSION: 2838) 02/28/06 23:24:55 ANR9999D lmlcaud.c(506): ThreadId<75> Error 17 checking filespace data for license audit. (PROCESS: 72) 02/28/06 23:24:55 ANR9999D ThreadId<75> issued message 9999 from: <-0x000000010001bf74 outDiagf <-0x00000001006d8e70 LmLcAuditThread <-0x000000010000e958 StartThread <-0x0900000000286460 _pthread_body (PROCESS: 72) 03/01/06 11:20:55 ANR9999D lmlcaud.c(506): ThreadId<43> Error 17 checking filespace data for license audit. (PROCESS: 79) 03/01/06 11:20:55 ANR9999D ThreadId<43> issued message 9999 from: <-0x000000010001bf74 outDiagf <-0x00000001006d8e70 LmLcAuditThread <-0x000000010000e958 StartThread <-0x0900000000286460 _pthread_body (PROCESS: 79) 03/03/06 03:41:10 ANR9999D lmlcaud.c(506): ThreadId<51> Error 17 checking filespace data for license audit. (PROCESS: 29) 03/03/06 03:41:10 ANR9999D ThreadId<51> issued message 9999 from: <-0x000000010001bf74 outDiagf <-0x00000001006d8e70 LmLcAuditThread <-0x000000010000e958 StartThread <-0x0900000000286460 _pthread_body (PROCESS: 29) In each case we need to halt and restart the TSM server to free up the locks. Finding slack time to do that is not always easy. "Ochs, Duane" <[EMAIL PROTECTED]> Sent by: "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> 01/30/2006 12:44 PM Please respond to "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> To ADSM-L@VM.MARIST.EDU cc Subject [ADSM-L] dsmserv process hung. AIX 5.3 TSM 5.3.1.2 This weekend one of my three TSM servers had the DSMSERV process hang. The machine was accessible, the DSMSERV process still existed. It was still accepting connections but not talking to them. In turn our cross server backups and volume reconciliation hung from the the other 2 TSM servers. One server ended up crashing due to a full recovery log. The other was near that same point. Looks like the root cause was a full recovery log on the hung server. I monitor to see if DSMSERV exists, I monitor for backup and archive failures. I use operational reporting to give me additional information for clients. I even monitor to make sure the client scheduler is running and communicating. Does anybody have a method in place or an idea to monitor if the TSM server is actually capable of communication ? Duane Ochs Information Systems - Enterprise Computing Quad/Graphics Inc. Sussex, Wisconsin 414-566-2375 phone 414-566-4010 pin# 2375 beeper [EMAIL PROTECTED] www.QG.com <outbind://8/www.QG.com>