Thanks for providing such a detailed reply.

While I believe in multi period batch, in most circumstances I would not recommend 4 periods. The usefulness of additional periods diminishes very quickly after 2 periods.

I don't really buy the argument that period 1 work will significantly delay period 2 work. Period 1 work should be a small percentage of the work in the service class - too small to produce a meaningful delay to larger jobs, whereas larger jobs CAN significantly delay smaller jobs. If period 2 is being delayed it is more likely to be by work in other service classes - which might require other adjustments to the WLM policy.

The issue of over initiation is interesting. If a programmer overloads the initiators with long running jobs to the point that short running jobs can't get an initiator, starting more initiators to meet the short job's goal seems to be exactly what I want WLM to do. Over initiation is a minor problem compared to having all your programmers sitting around waiting for their jobs to run because the capacity planner submitted their suite of monthly reports. (Not that that has ever happened!) :-) If you don't want WLM to start more initiators to meet the goals, maybe the goals should be relaxed?

I'm not completely convinced of the evils of over initiation anyway. I've seen the charts showing increased elapsed time but they didn't seem to count queue time. With most jobs what counts is the time between job submission and job end - not starting execution and job end.

How much overhead is there really for a job in discretionary which isn't getting much service, vs. a job in the queue? In the early days of WLM we were told WLM will drive the system to 100%, and that's OK because the overhead of having jobs sitting there soaking up the occasional leftover CPU cycle was negligible. They might run slow but it's faster than in the input queue. Is the overhead larger now?

Regards

Andrew Rowley


On 7/05/2012 12:13 AM, Cheryl Walker wrote:
I'm writing a series of articles for my Tuning Letter about service level 
agreements and mentioned in the last issue that I strongly believe in single 
period batch and two-period TSO service classes. One of my readers asked me to 
clarify, so I pulled up an old article on multi-period batch. It will soon be 
added to our website as part of the z/OS 101 Primer articles that are free to 
the public - http://www.watsonwalker.com/articles.html. I've included the 
entire article below, but would like to qualify that I consider work like DDF 
to be more like TSO, needing two periods, than batch. (I've kept this as plain 
text, so it isn't pretty. Sorry.)

Best regards,
Cheryl

======================
Cheryl Watson
Watson&  Walker, Inc.
www.watsonwalker.com
======================

Multi-Period Batch

What are the advantages and disadvantages of running batch in single-period 
service class versus a multi-period service class?

We must have heard this question at least six times at the latest SHARE. 
Although we did provide an answer in our September 1994 TUNING Letter, we think 
it's time for an update. We'll address the considerations for both batch and 
production jobs, because they tend to have different requirements.

Test Batch

If your intention is to provide the best turnaround to the most people by 
allowing large resource consumers to suffer slightly, then you'll want to use 
the typical method of managing test batch jobs. That method simply consists of 
getting as many of the small jobs through the system, at a high dispatch 
priority, as you can. You would then let the larger jobs run at a lower 
priority, and possibly miss their service goals.

This technique is used in almost every data center today. The only difference 
is in how it's implemented. Let us describe the two typical methods and the 
pros and cons of each.

Priority by Job Classes

The most common technique is to define a set of test batch job classes that 
allow a certain set of resources. For example, you might define the following 
test batch job classes:

       A - Less than 5 seconds CPU time, no tapes - 10 minute turnaround
       B - Less than 15 seconds CPU time, 0 to 1 tape - 30 minute turnaround
       C - Unlimited CPU time, 0 or 1 tape - 2 hour turnaround
       D - Unlimited CPU time, unlimited tapes - overnight

Then you would define some JES initiators to process these jobs. There are 
dozens of ways to set up initiators, but a typical scenario, might be:

       Init 1 - Classes:  A
       Init 2 - Classes:  A
       Init 3 - Classes:  B
       Init 4 - Classes:  BA
       Init 5 - Classes:  CA
       Init 6 - Classes:  DCBA

You would then set up a single period service class for each job class. As one 
example:

       TSTBATA - 90% within 10 minutes
       TSTBATB - 90% within 30 minutes
       TSTBATC - period 1 = velocity of 20%; period 2 = discretionary
       TSTBATD - discretionary

We're making an assumption that there aren't enough ended class C jobs to allow 
a response time goal.

The advantage of this technique is that the initiators will determine the 
highest priority jobs to allow into MVS. If the operators feel that the system 
is too busy at the moment, they can close down the initiators in order of 6, 5, 
4, 3, 2 and 1. When jobs in classes A and B get onto an initiator, they'll go 
into a single-period service class and stay at the same dispatch priority while 
they're executing. For those job classes, the first jobs on an initiator are 
normally the first jobs completed.

Job classes C and D, on the other hand, have unlimited CPU time. They might 
need 20 seconds of CPU time or three hours of CPU time - you don't really know. 
Therefore, the multi-period batch allows you to push the smaller of these large 
jobs through the system by setting the dispatch priority of period one to 
provide higher performance.

Priority by Period

Prioritizing test batch jobs by their actual use rather than their anticipated 
use is another common technique. In this method, there would be just one test 
batch job class. The initiators would be used to manage the number of test jobs 
in the system, but wouldn't differentiate between the short jobs or the long 
jobs.

A service class for this method might have four periods and look like:

       Period 1 - 90% within 10 minutes, duration = 1000 Service Units (SUs)
       Period 2 - 90% within 30 minutes, duration = 3000 SUs
       Period 3 - velocity of 20%, duration = 10000 SUs
       Period 4 - discretionary

All test jobs would enter the system in a first-come, first-served order. As 
soon as MVS sees them, they will probably be run at a high dispatch priority 
until they've consumed 1000 service units. Those jobs taking less than 4,000 
service units (1000 in period one and 3000 in period two) have the next highest 
priority and will be completed next. The longer jobs will compete at the same 
low priority, with the smaller jobs typically completing first.

Comparison

The first method using job classes takes more effort on the part of the sysprogs and the 
programmers that submit jobs. The sysprogs will need to analyze the current data to 
determine appropriate job class groupings, the job class information will need to be 
distributed to the users, and the users will need to estimate their job's usage before 
they submit the job. If they guess too low, the jobs will ABEND with a time-out for the 
job class. If they guess too high, they'll get poorer service than they deserve. If job 
class designations change, there may be additional ABENDs because programmers have a 
tendency to use "old JCL," change a line or two (seldom the job class) and 
submit the job. If the job runs in the wrong job class, you may have an ABEND. This 
technique is not too productive for the programmers, but will result in the best 
turnaround times for the majority of users. (As you'll see next.)

The second method is very simple to use, but can lead to severe problems. The 
programmers don't need to be concerned with which job class to use, and the 
sysprogs don't need to do any analysis (other than determine the appropriate 
durations for the periods). MVS and SRM will get the shortest jobs out at the 
highest priority. This method has a large problem however, and that's the 
possibility that a programmer will overload the initiators with a lot of 
long-running jobs. If the short jobs can't even get on an initiator, WLM can't 
get them completed in time. This technique is very useful when you have plenty 
of resources, and can greatly over-initiate the system. If the system is very 
constrained, and you have to limit the number of initiators, it may be 
difficult or impossible to get the small jobs in and out of the system. The 
biggest problem with this technique is that you can't guarantee turnaround 
times for your users. It becomes extremely difficult to manage to a set o!
f !
  service level objectives.

The easiest compromise seems to be to define a very, very small number of batch 
job classes, such as three. Make them significantly different enough in terms 
of resource usage that it will be easy for the users to choose the correct 
class. For service levels, you might simply talk about short, medium, and long 
batch (e.g. class A, B, and C), with response goals for the first two. When 
setting up your initiators, make sure that classes A and B have at least one 
dedicated initiator each (otherwise a bunch of class C jobs could grab the 
initiators during slack times and class A jobs couldn't get started). Then 
create a multi-period service class for class C. This would give you the 
capability of managing your test jobs to provide consistent turnaround times 
for classes A and B. Class C users that use the least amount of resources would 
tend to get better response than those using more resources.

WLM managed initiators can have similar problems with multi-period batch 
service classes, because the work is run in the same type of service classes. 
An additional problem with these initiators is that if all of the current 
initiators are blocked with long running jobs and small jobs are missing their 
goals, then WLM will open up more initiators. It's quite possible that the 
system can become over-initiated. WLM will eventually stop the unnecessary 
initiators, but the problem may exist for a period of time.

Production Batch

Production batch jobs present a different problem. The most typical scenario is 
that all production batch jobs are placed in a single job class with 
TYPRUN=HOLD in the JCL. Then they are released one job at a time by the 
operations staff or an automated scheduler. There are no turnaround times 
associated with production batch jobs. Although the intention of most 
production batch jobs is to complete before the online systems come up in the 
morning, there is no way to indicate this to MVS and SRM. Most installations 
have solved the problem by identifying critical jobs in the batch cycle and 
assigning them to a higher priority service class (one with a higher dispatch 
priority).

If you put all production batch in a multi-period service class, you will 
generally have problems when resources become constrained. One of the major 
jobs in the critical path, for example, might fall into second or third period. 
If resources are constrained, other smaller jobs will come in and out of the 
system at a higher priority than the critical batch job. In a very constrained 
system, the critical job could take three to four times the normal elapsed time 
just because it's running at a low priority. Often the solution to this is to 
move that particular job to a higher priority, single-period, service class. 
But this solution is applied one job at a time as problems are diagnosed.

In general, multi-period production batch is very frustrating to the operators 
and schedulers. If they've released a job, it's because they want it to run 
(now!). They don't want it to lie around the bottom of the resource pool using 
CPU only when nobody else wants it. For production jobs, usually first-in, 
first-out is the desired mode of operation. With single-period production batch 
service classes, that's what you get. With multi-period service classes, the 
job using the most resources will take the longest, sometimes to the detriment 
of the critical batch window. This has been one of the primary causes for sites 
not meeting their critical batch window. If you're having trouble getting your 
batch jobs complete before your online systems come up each day, check to see 
if this is the cause.

** pull-quote - Multi-period production batch service classes are one of the 
primary causes for missing your batch window goals

One alternative for this is to identify the critical jobs in your batch cycle 
and place them in a unique job class assigned to a unique single-period service 
class. This service class would run at a higher velocity and importance than 
other batch, and would exhibit first-in, first-out characteristics.

--
Andrew Rowley
Black Hill Software Pty. Ltd.
Phone: +61 413 302 386

EasySMF for z/OS: Interactive SMF Reports on Your PC
http://www.smfreports.com

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to