Sharon,

this night my short script recreated the problem!

Unfortunately I didn't do any logging, so I don't know how long it looped
until the hang occured. It was running monitored by STAXMon. STAXMon
was running on the same Linux machine as the STAX service, but the
DISPLAY was redirected to my workplace (Win-XP running Exceed).

You'll find the script attached.

Bodo

________________________________
From: Strösser, Bodo
Sent: Tuesday, August 04, 2009 10:19 PM
To: 'Sharon Lucas'
Cc: 'staf-users@lists.sourceforge.net'
Subject: RE: [staf-users] STAX Job hangs

Sharon,

I tried to create a smaller script that recreates the problem, but had no 
success yet.

Looking into the STAX code, one should think that this condition never can 
occur:
Finally's state is THREAD_WAIT, but HardHoldCondition is removed from thread.
fFinallyThread's fState field is STATE_COMPLETE and its fCompletionNotifiees 
list
is empty.
Would it make sense to build a new STAX.jar with DEBUG set in 
STAXFinallyAction.java?

Or even might this be a bug in JVM?

Bodo

________________________________
From: Strösser, Bodo
Sent: Tuesday, August 04, 2009 8:33 PM
To: 'Sharon Lucas'
Cc: 'staf-users@lists.sourceforge.net'
Subject: RE: [staf-users] STAX Job hangs

Sharon,


have played with jdb a bit.
No matter when I stop all the threads inside JVM, one of the ten "worker 
threads" is
on the "synchronized (fConditionSet)" on line 1359 in STAXThread.java:

Thread-11[1] where
  [1] com.ibm.staf.service.stax.STAXThread.execute (STAXThread.java:1.359)
  [2] com.ibm.staf.service.stax.STAXThreadQueue$QueueThread.run 
(STAXThreadQueue.java:54)
Thread-11[1]

I also found out how to dump object data. For example here is the <finally>:

Thread-11[1] dump this.fActionStack.header.previous.element
 this.fActionStack.header.previous.element = {
    DEBUG: false
    INIT: 0
    TRY_ACTION: 1
    WAIT_THREAD: 2
    THREAD_COMPLETE: 3
    COMPLETE: 4
    INIT_STRING: "INIT"
    TRY_ACTION_STRING: "TRY_ACTION"
    WAIT_THREAD_STRING: "WAIT_THREAD"
    THREAD_COMPLETE_STRING: "THREAD_COMPLETE"
    COMPLETE_STRING: "COMPLETE"
    STATE_UNKNOWN_STRING: "UNKNOWN"
    USE_SAME_PYINTERPRETER: true
    fHardHoldCondition: instance of 
com.ibm.staf.service.stax.STAXHardHoldThreadCondition(id=3162)
    fState: 2
    fTryAction: instance of com.ibm.staf.service.stax.STAXTryAction(id=3163)
    fFinallyAction: instance of com.ibm.staf.service.stax.STAXIfAction(id=3164)
    fFinallyThread: instance of com.ibm.staf.service.stax.STAXThread(id=3165)
    fSaveConditionList: instance of java.util.ArrayList(id=3166)
    fSavedConditions: false
    fHasInheritableConditions: false
    com.ibm.staf.service.stax.STAXActionDefaultImpl.fElement: "finally"
    com.ibm.staf.service.stax.STAXActionDefaultImpl.fXmlFile: 
"/home/STAF/emach/EMACH-stax.xml"
    com.ibm.staf.service.stax.STAXActionDefaultImpl.fXmlMachine: "local://local"
    com.ibm.staf.service.stax.STAXActionDefaultImpl.fElementInfo: instance of 
com.ibm.staf.service.stax.STAXElementInfo(id=3170)
    com.ibm.staf.service.stax.STAXActionDefaultImpl.fLineNumberMap: instance of 
java.util.HashMap(id=3171)
}

So, maybe it's really possible to dump out what's wrong. I guess, I even could 
set break points in the code (e.g. finally action). So, if you have some 
questions, i'll try to find out the answers from jdb.

Regarding the informations you sent, please see my comments below in teal.

Bodo



________________________________
From: Sharon Lucas [mailto:luc...@us.ibm.com]
Sent: Tuesday, August 04, 2009 8:01 PM
To: Strösser, Bodo
Cc: 'staf-users@lists.sourceforge.net'
Subject: RE: [staf-users] STAX Job hangs


Bodo,

When a <try> element with a <finally> element is encountered, the <finally> 
element is added to the call stack before the <try> element (as part of the 
code that ensures that the finally element is always run), unlike any other 
STAX element.  So, just because the <finally> element is on the top of the call 
stack doesn't mean that the hang necessarily occurred within the finally.  
Something strange is happening though where the finally element is not being 
removed from the call stack.

Yes, the problem might occur in the <try>, but what cycles definitly is the 
finally, as this.fActionStack.header.previous.element is a <finally> (see 
above).

It would be helpful if you added some more <log> elements to debug this problem 
(you don't have to send these to the STAX Monitor, just log them in the STAX 
Job User Log).  For example, to know if the <finally> element started 
execution, add a <log> as the first task in the finally element so that even if 
MyProcessHandle is 0, you'll know if the finally element started execution.  
Also, to know if the <finally> element completed, add a <log> as the last task 
in the finally element.  For example:

      <finally>
        <sequence>
          <log>"Entering Finally block"</log>
          <if expr="MyProcessHandle != 0">
            <sequence>

              <log message="True">
                "        Interaction '%s': Signal '%s' sent due to User Abort" 
% \
                (Interaction['Name'], Interaction['AbortSignal']) </log>

              ...

              <log message="True">
                "        Signalled interaction is gone"</log>
            </sequence>
          </if>
          <log>"Exiting Finally block"</log>
        </sequence>
      </finally>
    </try>

Yes, I will insert loggongs as you suggest.


Even though the process is no longer running (as you verified via the ps 
command), you need to verify if both STAF and STAX have been notified that the 
process is no longer running.  You can do this as follows:

1) To see if STAF knows that the process is no longer running:

STAF processMachine PROCESS LIST HANDLES LONG

Is the process handle still in the list?  If it's not in the list, then STAF 
knows the process has completed and its process completion information has been 
freed.  If the process handle is in the list, if its "End Date-Time" and 
"Return Code" fields contain a value other than <None>, then the process has 
completed but its process completion infiormation has not yet been freed.

The list is empty.

2) To see if STAX knows that the process is no longer running:

STAF staxMachine STAX LIST JOB 15 PROCESSES

Is the process handle in the list?  If it's in the list, then STAX has not been 
notified (or did not receive the notification) that the process has completed.

The list is empty. BTW: I did expect this, as STAXMon removed the gearwheel for 
the process from the screen.

Also, as a side note, why do you have a <try>/<finally> where the <try> element 
contains <nop/> like as follows?  It doesn't really make any sense to do this 
as the purpose of the finally element is to ensure that the finally element's 
task is executed, no matter whether the tr y task completes normally or 
abnormally.  Since a <nop/> element does nothing (e.g. no operation), then it's 
can't fail, so it doesn't make sense to do that.  You should change this as 
follows:

Thank you for the hint, but I did it intentionally. For me,´the finally block 
guarantees, that the included code is executed "atomically". This means, it is 
a part of the run that may not be interrupted by user's block termination.

Change:

    <try>
        <nop/>
      <finally>
        <sequence>
          ...
        </sequence>
      </finally>
    </try>

to:

    <sequence>
      ...
    </sequence>

Let me know when you have an easier recreation scenario (e.g. one that I could 
run on my STAX machine to recreate the problem and debug it).  That's going to 
be the most likely way that this problem will be resolved.

Yes, but from the last time (<hold> in <finally>) I know, this means a lot of 
try and error. Meanwhile it is a complex script, not easy to make a small 
recreator from it.

--------------------------------------------------------------
Sharon Lucas
IBM Austin,   luc...@us.ibm.com
(512) 286-7313 or Tieline 363-7313



Strösser, Bodo <bodo.stroes...@ts.fujitsu.com>

08/04/2009 10:38 AM

To
Sharon Lucas/Austin/i...@ibmus
cc
"'staf-users@lists.sourceforge.net'" <staf-users@lists.sourceforge.net>
Subject
RE: [staf-users] STAX Job hangs







Sharon,

for Job 7 it's exactly what I've sent, the part starting on 20090803 up to 
20090804
10:53. On 10:53 the hang imlicitly was released by STAF shutdown.
Those logs of JOB 7 show the last started <process> being completed when the
job stopped.

Currently again a job is hanging (10 Threads). The logs are appended.
This time, the logs say that the <process> still is running. But that isn't 
true.
The process is gone (ps command) and also no longer is displayed by
STAXMon (no gearwheel).
STAF local STAX QUERY JOB 15 THREAD 1 says, that the job hangs inside of
the <finally> on line 1923, as it did when I mailed the first time. So, the 
script
simply didn't reach the line, where the completion message for the <process> is
logged.

This time I didn't terminate any block, so the <process> in <try> came to
its normal end and the following <script> must have resetted MyProcessHandle
to 0. Thus, the <if> on line 1924 must be false and all the content of the 
<finally>
must be skipped. How can it hang in an empty <finally>?

The only thing that is common to all hangs I've looked into is a <finally> on 
top of
the stack.

The STAX job still hangs and I can connect jdb to the JVM. If you have more
experience using jdb, maybe you could tell me how to get more info from it.

Bodo

BTW: I'll try to strip off my script to have an easy way to recreate the 
problem.
But that might take a lot of time. If there is a chance to catch the problem 
using
the current script, it would be better for me.



________________________________
From: Sharon Lucas [mailto:luc...@us.ibm.com]
Sent: Tuesday, August 04, 2009 4:43 PM
To: Strösser, Bodo
Cc: 'staf-users@lists.sourceforge.net'
Subject: Re: [staf-users] STAX Job hangs


Bodo,

What are the contents of the STAX Job Log and the STAX Job User Log when this 
job hangs?

--------------------------------------------------------------
Sharon Lucas
IBM Austin,   luc...@us.ibm.com
(512) 286-7313 or Tieline 363-7313[attachment "Job_15_User.log" deleted by 
Sharon Lucas/Austin/IBM] [attachment "Job_15.log" deleted by Sharon 
Lucas/Austin/IBM]
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE stax SYSTEM "stax.dtd">

<stax>

  <defaultcall function="EMACH_main"/>

  <signalhandler signal="'STAXProcessStartError'">
    <nop/>
  </signalhandler>


<!-- ####################################################################### -->

  <function name="EMACH_main" scope="global">

  <sequence>

    <loop while="1">
      <sequence>

        <block name="'level 1'">

          <iterate var="n" in="[1, 2, 3, 4, 5]">
            <block name="'level 2 / %d' % n">

              <iterate var="k" in="[10, 20, 30, 40, 50]">
                <try>

                  <stafcmd name="'DELAY %d' % (n + k)">
                    <location>"LOCAL"</location>
                    <service>"DELAY"</service>
                    <request>"DELAY 1s"</request>
                  </stafcmd>

                  <finally>
                    <if expr="0">
                      <nop/>
                    </if>
                  </finally>

                </try>
              </iterate>

            </block>
          </iterate>

        </block>

        <stafcmd>
          <location>"LOCAL"</location>
          <service>"DELAY"</service>
          <request>"DELAY 1s"</request>
        </stafcmd>

      </sequence>
    </loop>

  </sequence>

  </function>

</stax>
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
staf-users mailing list
staf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/staf-users

Reply via email to