Sharon,
thank you for your help. Please see my comments below.
Bodo
________________________________
From: Sharon Lucas [mailto:luc...@us.ibm.com]
Sent: Thursday, July 30, 2009 6:39 PM
To: Strösser, Bodo
Cc: staf-users@lists.sourceforge.net
Subject: Re: [staf-users] STAX Job hangs
Bodo,
Were there any errors in the STAX JVM log?
What version of STAX and what version of STAF are you running?
You're doing several things in your <finally> element that we don't recommend
and can cause you not to be able to stop your STAX job.
1) First, in the STAX User's Guide (in the documentation for the <finally>
element) it says:
"Note that if you want to have a guaranteed way to stop a finally task, you
should have the first element contained in your finally task be a block or
timer element. For example, if you submit a request to terminate the job, it
will not terminate the job until the finally task(s) complete. But if you
submit a request to terminate a block that is currently running which is
contained within a finally task, then the block will be terminated (it will not
wait until that finally task completes)."
Yes, I know that. But this <finally> is written to wait for process
termination. It must not be aborted by user,
as this could cause the next test to be started while the previous one still is
running.
The part inside <if> will run only, if the process started in <try> was
terminated by use (block termination).
If the process ends normally, the next thing in the script is to reset
MyProcessHanndle.
Thus, I intentionally did not put a <block> inside the <finally>
So, you should change your <finally> element to contain a <block> as its first
task (and possible a <timer> element if you know that this process should be
able to be freed within a specified duration) For example:
<finally>
<block name="'FinallyCleanupBlock'">
<if expr="MyProcessHandle != 0">
....
</if>
</block>
</finally>
or, if you know this process should always complete within 30 minutes (or
whatever value), you can also specify a timer element.
<finally>
<block name="'FinallyCleanupBlock'">
<timer duration="'30m'">
<if expr="MyProcessHandle != 0">
....
</if>
</timer>
</block>
</finally>
2) You should be using a <loop> element instead of a <script> element because
STAX cannot stop an infinite loop in Python code, but it can stop a <loop>
element. Also, your loop is using up a lot of CPU as it only waits 0.1 seconds
before it repeats itself. You should have a longer wait interval such as 10
seconds or more depending on how long it takes your process to complete. If it
takes a long time, then your wait interval should be longer. If you have a
long wait interval, then you could use a <stafcmd> to submit a local DELAY
request to the DELAY service instead of using the Python time.sleep() because
STAX cannot stop a Python time.sleep() if you wanted to terminate the finally
block.
Yes, I could use a <loop>. But that would not maqke sense without the <block>,
right?
So I wrote the code in Python entirely, as it would be unkillable anyway.
3) What does EMACHTools.STAFSubmit2() do? Is it using a <stafcmd> to submit a
STAF service request? A STAX job should only use a <stafcmd> element to submit
a STAF service request.
Yes, this is a STAF service request written in Java. I wrote this routine to
get a better way for
handling <testcase>. Using the XML-element, it is quite difficult to make sure,
that each
testcase opened will have its result even in case of block terminations. So my
script uses
STAF requests to stax service to start, update and stop a testcase. But I don't
like
to have STAF service symbols flickering on the screen, so I used Java instead
of <stafcmd>.
Is there any reason, that this might not work correctly?
So, here's what I think your <finally> block should look like (based on what
little I know that you're trying to do). You should also see if your other
<finally> blocks need to be updated for the reasons I talked about above.
<finally>
<block name="'FinallyCleanupBlock'">
<if expr="MyProcessHandle != 0">
<sequence>
<log message="1">
" Interaction '%s': Signal '%s' sent due to User Abort" % \
(Interaction['Name'], Interaction['AbortSignal'])
</log>
<log message="1">
" Waiting for signalled interaction to exit"
</log>
<script>done = 0</script>
<loop while="not done">
<sequence>
<stafcmd name="'Free process handle %s' % (MyProcessHandle)">
<location>'local'</location>
<service>'PROCESS'</service>
<request>'FREE HANDLE %s' % (MyProcessHandle)</request>
</stafcmd>
<if expr="RC == STAFRC.Ok or RC == STAFRC.HandleDoesNotExist">
<script>done = 1</script>
<elseif expr="RC == STAFRC.ProcessNotComplete">
<stafcmd name="'Delay for 10 seconds while waiting for
process to end'">
<location>'local'</location>
<service>'DELAY'</service>
<request>'DELAY 10s'</request>
</stafcmd>
</elseif>
<else>
<script>
FatalMsg = 'PROCESS FREE HANDLE %s failed with RC=%s' %
(MyProcessHandle, RC)
done = 1
</script>
</else>
</if>
</sequence>
</loop>
<call function="'EMACH_CheckError'"/>
<log message="1">
" Signalled interaction is gone"
</log>
</sequence>
</if>
</block>
</finally>
I don't know why the <finally> element is the last element shown in the call
stack. If it the STAX job was stuck in the infinite loop in the <script>
element, then I would have expected the <script> element to be the last element
shown in the call stack. Perhaps there's another problem in your STAX job.
But, you should first update your finally element(s) as I recommended and see
if that resolves the problem.
Yes, this is the question: why is the <finally> the last element on the stack?
If this is true, my job does NOT run in the Python loop.
Also, as I saw the process having exited, and even freed the handle "manually",
the loop must stop.
BTW: the hung thread is the only one in this STAX job, and the job is the only
one in the STAX service.
Is there any way to get a java call stack? The job still is looping, I didn't
kill it as I do not know, how fast I can recreate the problem.
--------------------------------------------------------------
Sharon Lucas
IBM Austin, luc...@us.ibm.com
(512) 286-7313 or Tieline 363-7313
Strösser, Bodo <bodo.stroes...@ts.fujitsu.com>
07/30/2009 09:57 AM
To
"staf-users@lists.sourceforge.net" <staf-users@lists.sourceforge.net>
cc
Subject
[staf-users] STAX Job hangs
Hi,
today my STAX-Job hanged up. This is the result of a thread query:
# staf local stax query job 14 thread 1
Response
--------
{
Thread ID : 1
Parent TID : <None>
Start Date-Time: 20090730-16:04:03
Call Stack : [
function: EMACH_main (Line: 804, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
sequence: 12/12 (Line: 865, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
block: main.Test execution (Line: 1076, File: /home/EMACH/EMACH-stax.xml,
Machine: local://local)
sequence: 1/1 (Line: 1077, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
finally (Line: 1154, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
try (Line: 1079, File: /home/EMACH/EMACH-stax.xml, Machine: local://local)
sequence: 10/10 (Line: 1080, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
iterate: 1/1 {'Name': '2.1.1 CSTA_base_all', 'Tes... (Line: 1124, File:
/home/EMACH/EMACH-stax.xml, Machine: local://local)
sequence: 1/2 (Line: 1125, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
block: main.Test execution.2:1:1 CSTA_base_all (Line: 1127, File:
/home/EMACH/EMACH-stax.xml, Machine: local://local)
sequence: 3/4 (Line: 1128, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
function: EMACH_ProcessTestCases (Line: 1331, File:
/home/EMACH/EMACH-stax.xml, Machine: local://local)
sequence: 1/1 (Line: 1338, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
finally (Line: 1493, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
try (Line: 1340, File: /home/EMACH/EMACH-stax.xml, Machine: local://local)
iterate: 1/1 {'Name': 'TC_1', 'SUT': 'PINGUIN', '... (Line: 1342, File:
/home/EMACH/EMACH-stax.xml, Machine: local://local)
sequence: 1/2 (Line: 1343, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
block: main.Test execution.2:1:1 CSTA_base_all.TC_1 / PINGUIN (Line: 1346,
File: /home/EMACH/EMACH-stax.xml, Machine: local://local)
sequence: 1/2 (Line: 1347, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
finally (Line: 1379, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
try (Line: 1348, File: /home/EMACH/EMACH-stax.xml, Machine: local://local)
sequence: 5/5 (Line: 1349, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
function: EMACH_ProcessInteractions (Line: 1575, File:
/home/EMACH/EMACH-stax.xml, Machine: local://local)
finally (Line: 1620, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
try (Line: 1582, File: /home/EMACH/EMACH-stax.xml, Machine: local://local)
iterate: 1/2 {'Type': 'C', 'ExitPass': '0', 'Outp... (Line: 1583, File:
/home/EMACH/EMACH-stax.xml, Machine: local://local)
sequence: 1/3 (Line: 1584, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
if: Interaction['Type'] == 'F' (Line: 1586, File:
/home/EMACH/EMACH-stax.xml, Machine: local://local)
function: EMACH_ProcessCaller (Line: 1781, File: /home/EMACH/EMACH-stax.xml,
Machine: local://local)
sequence: 1/2 (Line: 1787, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
finally (Line: 1907, File: /home/EMACH/EMACH-stax.xml, Machine:
local://local)
]
Condition Stack: []
}
The part of the job included in the <finally> starting on line 1907 is:
<finally>
<if expr="MyProcessHandle != 0">
<sequence>
<log message="True">
" Interaction '%s': Signal '%s' sent due to User Abort" % \
(Interaction['Name'], Interaction['AbortSignal']) </log>
<log message="True">
" Waiting for signalled interaction to exit"</log>
<script>
while True :
msg = EMACHTools.STAFsubmit2('LOCAL', 'PROCESS', \
'FREE HANDLE %s' % MyProcessHandle)
if msg == None :
break
RC = msg.split("RC=")[1]
RC = int(RC.split(",")[0])
if RC == 5 : # RC 5 is 'Handle does not exist'
break
if RC != 12 : # RC 12 is 'Process Not Complete'
FatalMsg = msg
break
time.sleep(0.1)
</script>
<call function="'EMACH_CheckError'"/>
<log message="True">
" Signalled interaction is gone"</log>
</sequence>
</if>
</finally>
The code is here to wait for a process termination after user has terminated a
block.
I looked for the process the loop wait for, found it to be done and having a
RC. So I
released the handle via staf, but that didn't make the job run.
# staf local process list
Response
--------
H# Command Start Date-Time End Date-Time RC
-- ----------------------------- ----------------- ----------------- ----------
38 /home/EMACH/EMACHRemoteHelper 20090727-21:02:12 20090728-13:57:26 129
# staf local process free handle 38
Response
--------
# staf local process list
Response
--------
#
Thinking about the call trace: why do I see <finally> as the innermost element?
I would guess
the job is looping in <script>, but shouldn't <if> be the innermost then? When
looking for
the jvm (ps command) I see its CPU-usage count as fast as real time.
Any help to find the problem is welcome.
Best Regards
Bodo
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.
http://p.sf.net/sfu/bobj-july_______________________________________________
staf-users mailing list
staf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/staf-users
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
staf-users mailing list
staf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/staf-users