Re: ZCX task monitoring, anyone?

Sean Gleann Thu, 27 Aug 2020 01:15:29 -0700

*sigh*. Oh, how I wish that various e-mail clients would quit re-formatting
stuff. My previous response here was so nice & neat & tidy before I hit
'Send'. Reading that response back via IBMMain makes me look like a
complete illiterate...


Sean

On Thu, 27 Aug 2020 at 09:01, Sean Gleann <sean.gle...@gmail.com> wrote:

> Hi Attila - thanks for the pointers, but I'm not sure of how to act upon
> them.
>
> The start-up for Cadvisor that I'm using doesn't feature any pointer to a
> parameter list, and despite much googling I don't see any mention of such a
> thing. Everything keeps referring back to Prometheus and then on to Grafana
> My Cadvisor start-up (taken directly from the IBM Red Book and slightly
> modified to comply with local restrictions):
> docker network create monitoring
> docker run --name cadvisor -v /sys:/sys:ro -v
> /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network
> monitoring ibmcom/cadvisor-s390x:0.33.0
>
> Perhaps I'm looking at things the wrong way, but my current understanding
> is:
> Cadvisor (and also Nodeexporter) collect various usage stats;
> Prometheus then gathers that data and does some sort of pre-processing of
> it (it doesn't tell Cadvisor to 'do something' - it just passively makes
> use of the data that Cadvisor collects)
> Grafana takes the data from Prometheus and uses it to generate various
> graphs/tables/reports.
>
> My situation is that when I run Cadvisor on it's own - no other containers
> at all - then it floods as many processors as I define in the zcx
> start.json file.
>
> Whilst Cadvisor is running, I can go to the relevant web-page and I can
> see that it is producing meters/charts, etc all on its own. Since that is
> the case, what is the point of Grafana?
>
> I have a Prometheus.yml file that features the term 'scrape_interval' (but
> not 'housekeeping'), but that file is for use by Prometheus, isn't it? How
> does it affect the amount of work that Cadvisor is doing, since I haven't
> even started that container yet?
>
> Regards
> Sean
>
> On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi <fogar...@gmail.com> wrote:
>
>> Check your values for housekeeping interval and scrape_interval.
>> Recommended is 15s and 30s (which makes for a 60 second rate window).
>> Small value for housekeeping interval will cause cAdvisor cpu usage to be
>> high, while scrape_interval affects Prometheus cpu usage.  It is entirely
>> possible to cause data collection to use 100% of the z/OS cpu -- remember
>> that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu
>> time while z/OS is far more efficient and runs well under 10%.  You will
>> see this behaviour in zCX containers, it isn't going to measure the same
>> as
>> z/OS workload.  The optimizations in Unix have the premise that cpu time
>> is
>> low cost (as is memory), while z/OS considers cpu to be high cost and path
>> length worth saving.  Same for the subsystems in z/OS and performance
>> monitors.
>>
>> On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann <sean.gle...@gmail.com>
>> wrote:
>>
>> > Allan - "...count the beans differently...' Yes, I'm beginning to get
>> used
>> > to that concept. For instance, with the CPU Utilisation data that I
>> *have*
>> > been able to retrieve, the metric given is not 'CPU%', but 'Number of
>> > cores'. I'm having to do some rapid re-orienting to my way of thinking.
>> > As for the memory size, I've got "mem-gb" : 2 defined in my start.json
>> > file, but I've not seen any indication of paging load at all in my
>> testing.
>> >
>> > Michael - 5 zIIPs?   I wish!  Nope - these are all general-purpose
>> > processors.
>> > The z/OS system I'm using is a z/VM guest on a system run by an external
>> > supplier, so I'm not sure if defining zIIPs would actually achieve
>> anything
>> > (Is it possible to dedicate a zIIP engine to a specific z/VM guest?
>> That's
>> > a road I've not yet gone down).
>> > With regard to the WLM definitions, I followed the advice in the red
>> book
>> > and I'm reasonably certain I've got it right. Having said that,
>> cross-refer
>> > to a thread that I started earlier this week, titled "WLM Query"
>> > The response to that led to me defining a resource group to cap the
>> > started task to 10MSU, which resulted in a CPU% Util value of roughly
>> 5% -
>> > something I could be happy with.
>> > Under that cap, the started task ran, yes, but it ran like a
>> three-legged
>> > dog (my apologies to limb-count-challenged canines).
>> > Start-up of the task, from the START command to the "server is
>> > listening..." message took over an hour, and
>> > STOP-command-to-task-termination took approx. 30 minutes.
>> > (SSH-ing to the task was a bit of a joke, too. Responses to simple
>> commands
>> > like 'docker ps -a' could be seen 'painting' across the screen,
>> > character-by-character...)
>> > As a result, I've moved away from trying to limit the task for the time
>> > being. I'm concentrating on attempting to get cadvisor to be a bit less
>> > greedy.
>> >
>> > Regards
>> > Sean
>> >
>> > On Wed, 26 Aug 2020 at 13:49, Michael Babcock <bigironp...@gmail.com>
>> > wrote:
>> >
>> > > I can’t check my zCX out right now since my internet is down.
>> > >
>> > > You are running these on zIIP engines correct? Must be nice to have 5
>> > > zIIPs!  And have the WLM parts in place?   Although it probably
>> wouldn’t
>> > > make much difference during startup/shutdown.
>> > >
>> > > On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann <sean.gle...@gmail.com>
>> > wrote:
>> > >
>> > > > Can anyone offer advice, please, with regard to monitoring the
>> system
>> > > >
>> > > > resource consumption of a zcx Container task?
>> > > >
>> > > >
>> > > >
>> > > > I've got a zcx Container task running on a 'sandbox' system where -
>> as
>> > > yet
>> > > >
>> > > > - I'm not collecting any RMF/SMF data. Because of that, my only
>> source
>> > of
>> > > >
>> > > > system usage is the SDSF DA panel. I feel that the numbers I see
>> there
>> > > >
>> > > > are... 'questionable' is the best word I can think of.
>> > > >
>> > > >
>> > > >
>> > > > Firstly, the EXCP-count for the task goes up to about 15360 during
>> the
>> > > >
>> > > > initial start-up phase, but then it stays there until the STOP
>> command
>> > is
>> > > >
>> > > > issued. At that point, EXCP-count starts rising again, until the
>> task
>> > > >
>> > > > finally terminates. The explanation for that is probably because all
>> > the
>> > > >
>> > > > I/O is being handled internally at the 'Linux' level - the task
>> must be
>> > > >
>> > > > doing *some* I/O, right? - but the data isn't getting back to SDSF
>> for
>> > > some
>> > > >
>> > > > reason. Without the benefit of SMF data to examine, I'm wondering if
>> > this
>> > > >
>> > > > is part of a larger problem.
>> > > >
>> > > >
>> > > >
>> > > > The other thing that troubles me is the CPU% busy value. My sandbox
>> > > system
>> > > >
>> > > > has 5 engines defined, and in the 'start.json' file that controls
>> the
>> > zcx
>> > > >
>> > > > Container task, I've specified a 'cpu' value of 4. During the
>> start-up
>> > > >
>> > > > phase for the Container started task, SDSF shows CPU% values of
>> approx
>> > > 80%,
>> > > >
>> > > > but when the task is finally initialised, this drops to 'tickover'
>> > rates
>> > > of
>> > > >
>> > > > about 1%. I'm happy with that - the initial start-up of *any* task
>> as
>> > > >
>> > > > complex as a zcx Container is likely to cause high CPU usage, and
>> the
>> > > >
>> > > > subsequent drop to the 1% levels is fine by me.
>> > > >
>> > > >
>> > > >
>> > > > But... Once the Container task is started and I've ssh'd into it, I
>> > then
>> > > >
>> > > > want to monitor its 'internal' system consumption. I've been using
>> the
>> > > >
>> > > > 'Getting Started...' redbook as my guide throughout all this
>> project,
>> > and
>> > > >
>> > > > it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and
>> > > "Grafana"
>> > > >
>> > > > as tools for this. I've got all those things installed and I can
>> start
>> > > and
>> > > >
>> > > > stop them quite happily, but I've found that using Cadvisor on it's
>> own
>> > > can
>> > > >
>> > > > drive CPU% levels back up to 80% for the entire time it is running.
>> If
>> > a
>> > > >
>> > > > system is running flat-out when all it is doing is monitoring
>> itself,
>> > > well,
>> > > >
>> > > > there's something wrong somewhere... I'm trying to find an idiot's
>> > guide
>> > > to
>> > > >
>> > > > controlling what Cadvisor does, but as yet I've been unsuccessful.
>> > > >
>> > > >
>> > > >
>> > > > Regards
>> > > >
>> > > > Sean
>> > > >
>> > > >
>> > > >
>> > > >
>> ----------------------------------------------------------------------
>> > > >
>> > > > For IBM-MAIN subscribe / signoff / archive access instructions,
>> > > >
>> > > > send email to lists...@listserv.ua.edu with the message: INFO
>> IBM-MAIN
>> > > >
>> > > > --
>> > > Michael Babcock
>> > > OneMain Financial
>> > > z/OS Systems Programmer, Lead
>> > >
>> > > ----------------------------------------------------------------------
>> > > For IBM-MAIN subscribe / signoff / archive access instructions,
>> > > send email to lists...@listserv.ua.edu with the message: INFO
>> IBM-MAIN
>> > >
>> >
>> > ----------------------------------------------------------------------
>> > For IBM-MAIN subscribe / signoff / archive access instructions,
>> > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
>> >
>>
>> ----------------------------------------------------------------------
>> For IBM-MAIN subscribe / signoff / archive access instructions,
>> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
>>
>

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: ZCX task monitoring, anyone?

Reply via email to