Thanks for providing such a detailed reply.
While I believe in multi period batch, in most circumstances I would not
recommend 4 periods. The usefulness of additional periods diminishes
very quickly after 2 periods.
I don't really buy the argument that period 1 work will significantly
delay period 2 work. Period 1 work should be a small percentage of the
work in the service class - too small to produce a meaningful delay to
larger jobs, whereas larger jobs CAN significantly delay smaller jobs.
If period 2 is being delayed it is more likely to be by work in other
service classes - which might require other adjustments to the WLM policy.
The issue of over initiation is interesting. If a programmer overloads
the initiators with long running jobs to the point that short running
jobs can't get an initiator, starting more initiators to meet the short
job's goal seems to be exactly what I want WLM to do. Over initiation is
a minor problem compared to having all your programmers sitting around
waiting for their jobs to run because the capacity planner submitted
their suite of monthly reports. (Not that that has ever happened!) :-)
If you don't want WLM to start more initiators to meet the goals, maybe
the goals should be relaxed?
I'm not completely convinced of the evils of over initiation anyway.
I've seen the charts showing increased elapsed time but they didn't seem
to count queue time. With most jobs what counts is the time between job
submission and job end - not starting execution and job end.
How much overhead is there really for a job in discretionary which isn't
getting much service, vs. a job in the queue? In the early days of WLM
we were told WLM will drive the system to 100%, and that's OK because
the overhead of having jobs sitting there soaking up the occasional
leftover CPU cycle was negligible. They might run slow but it's faster
than in the input queue. Is the overhead larger now?
Regards
Andrew Rowley
On 7/05/2012 12:13 AM, Cheryl Walker wrote:
I'm writing a series of articles for my Tuning Letter about service level
agreements and mentioned in the last issue that I strongly believe in single
period batch and two-period TSO service classes. One of my readers asked me to
clarify, so I pulled up an old article on multi-period batch. It will soon be
added to our website as part of the z/OS 101 Primer articles that are free to
the public - http://www.watsonwalker.com/articles.html. I've included the
entire article below, but would like to qualify that I consider work like DDF
to be more like TSO, needing two periods, than batch. (I've kept this as plain
text, so it isn't pretty. Sorry.)
Best regards,
Cheryl
======================
Cheryl Watson
Watson& Walker, Inc.
www.watsonwalker.com
======================
Multi-Period Batch
What are the advantages and disadvantages of running batch in single-period
service class versus a multi-period service class?
We must have heard this question at least six times at the latest SHARE.
Although we did provide an answer in our September 1994 TUNING Letter, we think
it's time for an update. We'll address the considerations for both batch and
production jobs, because they tend to have different requirements.
Test Batch
If your intention is to provide the best turnaround to the most people by
allowing large resource consumers to suffer slightly, then you'll want to use
the typical method of managing test batch jobs. That method simply consists of
getting as many of the small jobs through the system, at a high dispatch
priority, as you can. You would then let the larger jobs run at a lower
priority, and possibly miss their service goals.
This technique is used in almost every data center today. The only difference
is in how it's implemented. Let us describe the two typical methods and the
pros and cons of each.
Priority by Job Classes
The most common technique is to define a set of test batch job classes that
allow a certain set of resources. For example, you might define the following
test batch job classes:
A - Less than 5 seconds CPU time, no tapes - 10 minute turnaround
B - Less than 15 seconds CPU time, 0 to 1 tape - 30 minute turnaround
C - Unlimited CPU time, 0 or 1 tape - 2 hour turnaround
D - Unlimited CPU time, unlimited tapes - overnight
Then you would define some JES initiators to process these jobs. There are
dozens of ways to set up initiators, but a typical scenario, might be:
Init 1 - Classes: A
Init 2 - Classes: A
Init 3 - Classes: B
Init 4 - Classes: BA
Init 5 - Classes: CA
Init 6 - Classes: DCBA
You would then set up a single period service class for each job class. As one
example:
TSTBATA - 90% within 10 minutes
TSTBATB - 90% within 30 minutes
TSTBATC - period 1 = velocity of 20%; period 2 = discretionary
TSTBATD - discretionary
We're making an assumption that there aren't enough ended class C jobs to allow
a response time goal.
The advantage of this technique is that the initiators will determine the
highest priority jobs to allow into MVS. If the operators feel that the system
is too busy at the moment, they can close down the initiators in order of 6, 5,
4, 3, 2 and 1. When jobs in classes A and B get onto an initiator, they'll go
into a single-period service class and stay at the same dispatch priority while
they're executing. For those job classes, the first jobs on an initiator are
normally the first jobs completed.
Job classes C and D, on the other hand, have unlimited CPU time. They might
need 20 seconds of CPU time or three hours of CPU time - you don't really know.
Therefore, the multi-period batch allows you to push the smaller of these large
jobs through the system by setting the dispatch priority of period one to
provide higher performance.
Priority by Period
Prioritizing test batch jobs by their actual use rather than their anticipated
use is another common technique. In this method, there would be just one test
batch job class. The initiators would be used to manage the number of test jobs
in the system, but wouldn't differentiate between the short jobs or the long
jobs.
A service class for this method might have four periods and look like:
Period 1 - 90% within 10 minutes, duration = 1000 Service Units (SUs)
Period 2 - 90% within 30 minutes, duration = 3000 SUs
Period 3 - velocity of 20%, duration = 10000 SUs
Period 4 - discretionary
All test jobs would enter the system in a first-come, first-served order. As
soon as MVS sees them, they will probably be run at a high dispatch priority
until they've consumed 1000 service units. Those jobs taking less than 4,000
service units (1000 in period one and 3000 in period two) have the next highest
priority and will be completed next. The longer jobs will compete at the same
low priority, with the smaller jobs typically completing first.
Comparison
The first method using job classes takes more effort on the part of the sysprogs and the
programmers that submit jobs. The sysprogs will need to analyze the current data to
determine appropriate job class groupings, the job class information will need to be
distributed to the users, and the users will need to estimate their job's usage before
they submit the job. If they guess too low, the jobs will ABEND with a time-out for the
job class. If they guess too high, they'll get poorer service than they deserve. If job
class designations change, there may be additional ABENDs because programmers have a
tendency to use "old JCL," change a line or two (seldom the job class) and
submit the job. If the job runs in the wrong job class, you may have an ABEND. This
technique is not too productive for the programmers, but will result in the best
turnaround times for the majority of users. (As you'll see next.)
The second method is very simple to use, but can lead to severe problems. The
programmers don't need to be concerned with which job class to use, and the
sysprogs don't need to do any analysis (other than determine the appropriate
durations for the periods). MVS and SRM will get the shortest jobs out at the
highest priority. This method has a large problem however, and that's the
possibility that a programmer will overload the initiators with a lot of
long-running jobs. If the short jobs can't even get on an initiator, WLM can't
get them completed in time. This technique is very useful when you have plenty
of resources, and can greatly over-initiate the system. If the system is very
constrained, and you have to limit the number of initiators, it may be
difficult or impossible to get the small jobs in and out of the system. The
biggest problem with this technique is that you can't guarantee turnaround
times for your users. It becomes extremely difficult to manage to a set o!
f !
service level objectives.
The easiest compromise seems to be to define a very, very small number of batch
job classes, such as three. Make them significantly different enough in terms
of resource usage that it will be easy for the users to choose the correct
class. For service levels, you might simply talk about short, medium, and long
batch (e.g. class A, B, and C), with response goals for the first two. When
setting up your initiators, make sure that classes A and B have at least one
dedicated initiator each (otherwise a bunch of class C jobs could grab the
initiators during slack times and class A jobs couldn't get started). Then
create a multi-period service class for class C. This would give you the
capability of managing your test jobs to provide consistent turnaround times
for classes A and B. Class C users that use the least amount of resources would
tend to get better response than those using more resources.
WLM managed initiators can have similar problems with multi-period batch
service classes, because the work is run in the same type of service classes.
An additional problem with these initiators is that if all of the current
initiators are blocked with long running jobs and small jobs are missing their
goals, then WLM will open up more initiators. It's quite possible that the
system can become over-initiated. WLM will eventually stop the unnecessary
initiators, but the problem may exist for a period of time.
Production Batch
Production batch jobs present a different problem. The most typical scenario is
that all production batch jobs are placed in a single job class with
TYPRUN=HOLD in the JCL. Then they are released one job at a time by the
operations staff or an automated scheduler. There are no turnaround times
associated with production batch jobs. Although the intention of most
production batch jobs is to complete before the online systems come up in the
morning, there is no way to indicate this to MVS and SRM. Most installations
have solved the problem by identifying critical jobs in the batch cycle and
assigning them to a higher priority service class (one with a higher dispatch
priority).
If you put all production batch in a multi-period service class, you will
generally have problems when resources become constrained. One of the major
jobs in the critical path, for example, might fall into second or third period.
If resources are constrained, other smaller jobs will come in and out of the
system at a higher priority than the critical batch job. In a very constrained
system, the critical job could take three to four times the normal elapsed time
just because it's running at a low priority. Often the solution to this is to
move that particular job to a higher priority, single-period, service class.
But this solution is applied one job at a time as problems are diagnosed.
In general, multi-period production batch is very frustrating to the operators
and schedulers. If they've released a job, it's because they want it to run
(now!). They don't want it to lie around the bottom of the resource pool using
CPU only when nobody else wants it. For production jobs, usually first-in,
first-out is the desired mode of operation. With single-period production batch
service classes, that's what you get. With multi-period service classes, the
job using the most resources will take the longest, sometimes to the detriment
of the critical batch window. This has been one of the primary causes for sites
not meeting their critical batch window. If you're having trouble getting your
batch jobs complete before your online systems come up each day, check to see
if this is the cause.
** pull-quote - Multi-period production batch service classes are one of the
primary causes for missing your batch window goals
One alternative for this is to identify the critical jobs in your batch cycle
and place them in a unique job class assigned to a unique single-period service
class. This service class would run at a higher velocity and importance than
other batch, and would exhibit first-in, first-out characteristics.
--
Andrew Rowley
Black Hill Software Pty. Ltd.
Phone: +61 413 302 386
EasySMF for z/OS: Interactive SMF Reports on Your PC
http://www.smfreports.com
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN