Re: APAR OA58438 (was Re: Planned ESQA change and HealthCheck)

Mark Zelden Thu, 03 Oct 2019 14:34:26 -0700

On Thu, 3 Oct 2019 16:18:32 -0500, Mark Zelden <m...@mzelden.com> wrote:


>If you are running z/OS 2.3 and increasing ESQA because of expansion into ECSA 
>messages
>or sudden unexplained growth, check out APAR OA58438.
>
>We had 3 system crashes after migrations to z/OS 2.3 in 2019 and one close 
>call after
>ECSA got to 99% when ESQA expanded into it (only a vendor monitor crashed in 
>that
>case after a failed ECSA getmain).  Stand alone dumps didn't find the root 
>cause
>other than we new it was RPB pool growth related to SVC dumps from CICS.  In 
>one
>case a single SVC dump caused an 80M ESQA spike within one or two seconds 
>crashed 
>a system when it spilled into ECSA and also filled up ECSA (typically at about
>70% use, but "stable").  
>
>We worked with IBM all summer on this.  We had different SLIPs and GTF traces 
>put in
>place, but with the traces going the problem never happen. But SVC dump 
>processing
>did take over the CPU with the trace + GTF active!   :-)  
>
>Meanwhile, we increased ESQA on 30 LPARs via normal IPLs over the summer by 
>about
>80M and ECSA a bit as a "work around".   Settings that haven't been touched in 
>god knows
>how long (certainly not since 64-bit usage has increased and HVCOMMON).   So 
>we had
>to loose about 100M of high private to do this.  We also increased real 
>storage on a 
>couple of LPARs that really didn't warrant it (based on zero or close to zero 
>demand
>paging during normal operations), but we knew real storage was also involved in
>the problem (no flash memory for SVC dumps on my client's mainframes).  
>
>The entire time IBM has said we are the only ones reporting the problem, but 
>since we
>had the problem in big sysplexes, small sysplexes, big LPARs, small LPARs, I 
>know that
>we can't be the only ones.  I think other shops are ignoring the ESQA 
>expansion into
>ECSA (since that in itself doesn't hurt) and / or they have more "white 
>space".  The
>RPB control blocks are freed after about 10 minutes, so anyone looking at their
>current ESQA (and ECSA) usage wouldn't notice the spikes or would just say 'oh 
>well,
>looks good now".   
>
>Anyway,  IBM was getting close to figuring this out not too long ago and 
>partially 
>re-created the problem in the lab some weeks ago and just got back to us today 
>with the root cause and the APAR that was opened.   It is related to being real
>storage constrained at the time of the SVC dumps (I think all of the crashes 
>were
>during CICS startup time in the wee morning hours).  
>
>I really wanted to post something about this earlier but didn't since IBM said
>they had no other reported problems,  So if you have seen this problem since
>migrating to z/OS 2.3, now you know you aren't the only ones.
>


One thing I didn’t mention in the post (well, I did the first time I started to 
compose it, 
but accidentally closed the window) is that one may not even notice the problem 
because
the RPB pools are releases after some period of time (10-15 minutes?).   So if 
one looked
at any given point in time ESQA usage would be “normal”.  The only clue would 
be Health
Checker messages if Health Checker was running or some other monitor tripping
an ESQA threshold hit or expansion into ECSA.  


Regards,

Mark
--
Mark Zelden - Zelden Consulting Services - z/OS, OS/390 and MVS
ITIL v3 Foundation Certified
mailto:m...@mzelden.com
Mark's MVS Utilities: http://www.mzelden.com/mvsutil.html

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: APAR OA58438 (was Re: Planned ESQA change and HealthCheck)

Reply via email to