Re: ZCX task monitoring, anyone?

Sean Gleann Thu, 27 Aug 2020 04:07:14 -0700

OK, so the scrape_interval part of your answer is something I can quickly
understand and deal with.
I'll put that to one side for now because my interest is with Cadvisor and
how to control it.


To take the parallel example for Prometheus from the IBM red book, I have
to:
 1. create a 'prometheus.yml' file
 2. create a Dockerfile, which features a COPY command referring to that
'prometheus.yml' file
 3. do a 'docker build' of the Prometheus image
 4. then I can 'run' the image.
With subsequent 'runs' of Prometheus I can skip steps 1-3 as they have
already been done.
For Cadvisor, there is no 'build' to be done, thus no copying of a yml file
The only command I know of for Cadvisor is the 'run' command I detailed
earlier in this thread.
(If the Cadvisor image does not exist, then it is automatically downloaded
before being started.)
In conceptual terms, am I right in thinking that I'm downloading a
'program' that has already been prepared for execution?
If that is true, then the value of any control parameters appears to be
hard-coded within the program.
Given this, I'm not sure I have any control over the container manifest for
Cadvisor.

Regards
Sean

On Thu, 27 Aug 2020 at 10:15, Attila Fogarasi <fogar...@gmail.com> wrote:

> Housekeeping interval is part of the container manifest as it governs
> normal operation, not just performance metric collection.  As such it is
> specified wherever you have your container manifest defined (for example, a
> .yaml file or by HTTP endpoint or HTTP server).   You can also use the
> command line "kubelet" tool.
> Scrape_interval is the value for how often Prometheus asks cAdvisor for
> data from the collection cache, thus it affects cpu used by cAdvisor to
> prepare and send this data to Prometheus.
> As for how Docker monitoring works, you are right that there is overlap in
> the open-source tools, but the hierarchy is cAdvisor collects the metrics
> and also does some aggregation and processing.  You can use just cAdvisor.
> Prometheus is a layer on top, getting metrics from cAdvisor and then
> provides both better quality reporting and also alerting (which cAdvisor
> does not).  The next layer is Grafana which is a generalized metric
> analytics and visualization tool (not just for Docker).  For larger scale
> more complex container environments you need all 3.
> In a z/OS context these 3 tools became integrated circa 30 years ago, but
> for Unix they are not.   Splitting the processing like this has both good
> and bad points (for example Grafana can run in a separate Docker container)
> but definitely burns more cpu (a lot more).   If not careful the
> measurement tooling can cost more than the application being measured, even
> though it is "free".
>
> On Thu, Aug 27, 2020 at 6:02 PM Sean Gleann <sean.gle...@gmail.com> wrote:
>
> > Hi Attila - thanks for the pointers, but I'm not sure of how to act upon
> > them.
> >
> > The start-up for Cadvisor that I'm using doesn't feature any pointer to a
> > parameter list, and despite much googling I don't see any mention of
> such a
> > thing. Everything keeps referring back to Prometheus and then on to
> Grafana
> > My Cadvisor start-up (taken directly from the IBM Red Book and slightly
> > modified to comply with local restrictions):
> > docker network create monitoring
> > docker run --name cadvisor -v /sys:/sys:ro -v
> > /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d
> --network
> > monitoring ibmcom/cadvisor-s390x:0.33.0
> >
> > Perhaps I'm looking at things the wrong way, but my current understanding
> > is:
> > Cadvisor (and also Nodeexporter) collect various usage stats;
> > Prometheus then gathers that data and does some sort of pre-processing of
> > it (it doesn't tell Cadvisor to 'do something' - it just passively makes
> > use of the data that Cadvisor collects)
> > Grafana takes the data from Prometheus and uses it to generate various
> > graphs/tables/reports.
> >
> > My situation is that when I run Cadvisor on it's own - no other
> containers
> > at all - then it floods as many processors as I define in the zcx
> > start.json file.
> >
> > Whilst Cadvisor is running, I can go to the relevant web-page and I can
> see
> > that it is producing meters/charts, etc all on its own. Since that is the
> > case, what is the point of Grafana?
> >
> > I have a Prometheus.yml file that features the term 'scrape_interval'
> (but
> > not 'housekeeping'), but that file is for use by Prometheus, isn't it?
> How
> > does it affect the amount of work that Cadvisor is doing, since I haven't
> > even started that container yet?
> >
> > Regards
> > Sean
> >
> > On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi <fogar...@gmail.com>
> wrote:
> >
> > > Check your values for housekeeping interval and scrape_interval.
> > > Recommended is 15s and 30s (which makes for a 60 second rate window).
> > > Small value for housekeeping interval will cause cAdvisor cpu usage to
> be
> > > high, while scrape_interval affects Prometheus cpu usage.  It is
> entirely
> > > possible to cause data collection to use 100% of the z/OS cpu --
> remember
> > > that on Unix systems the rule of thumb is 40% overhead for uncaptured
> cpu
> > > time while z/OS is far more efficient and runs well under 10%.  You
> will
> > > see this behaviour in zCX containers, it isn't going to measure the
> same
> > as
> > > z/OS workload.  The optimizations in Unix have the premise that cpu
> time
> > is
> > > low cost (as is memory), while z/OS considers cpu to be high cost and
> > path
> > > length worth saving.  Same for the subsystems in z/OS and performance
> > > monitors.
> > >
> > > On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann <sean.gle...@gmail.com>
> > > wrote:
> > >
> > > > Allan - "...count the beans differently...' Yes, I'm beginning to get
> > > used
> > > > to that concept. For instance, with the CPU Utilisation data that I
> > > *have*
> > > > been able to retrieve, the metric given is not 'CPU%', but 'Number of
> > > > cores'. I'm having to do some rapid re-orienting to my way of
> thinking.
> > > > As for the memory size, I've got "mem-gb" : 2 defined in my
> start.json
> > > > file, but I've not seen any indication of paging load at all in my
> > > testing.
> > > >
> > > > Michael - 5 zIIPs?   I wish!  Nope - these are all general-purpose
> > > > processors.
> > > > The z/OS system I'm using is a z/VM guest on a system run by an
> > external
> > > > supplier, so I'm not sure if defining zIIPs would actually achieve
> > > anything
> > > > (Is it possible to dedicate a zIIP engine to a specific z/VM guest?
> > > That's
> > > > a road I've not yet gone down).
> > > > With regard to the WLM definitions, I followed the advice in the red
> > book
> > > > and I'm reasonably certain I've got it right. Having said that,
> > > cross-refer
> > > > to a thread that I started earlier this week, titled "WLM Query"
> > > > The response to that led to me defining a resource group to cap the
> > > > started task to 10MSU, which resulted in a CPU% Util value of roughly
> > 5%
> > > -
> > > > something I could be happy with.
> > > > Under that cap, the started task ran, yes, but it ran like a
> > three-legged
> > > > dog (my apologies to limb-count-challenged canines).
> > > > Start-up of the task, from the START command to the "server is
> > > > listening..." message took over an hour, and
> > > > STOP-command-to-task-termination took approx. 30 minutes.
> > > > (SSH-ing to the task was a bit of a joke, too. Responses to simple
> > > commands
> > > > like 'docker ps -a' could be seen 'painting' across the screen,
> > > > character-by-character...)
> > > > As a result, I've moved away from trying to limit the task for the
> time
> > > > being. I'm concentrating on attempting to get cadvisor to be a bit
> less
> > > > greedy.
> > > >
> > > > Regards
> > > > Sean
> > > >
> > > > On Wed, 26 Aug 2020 at 13:49, Michael Babcock <bigironp...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I can’t check my zCX out right now since my internet is down.
> > > > >
> > > > > You are running these on zIIP engines correct? Must be nice to
> have 5
> > > > > zIIPs!  And have the WLM parts in place?   Although it probably
> > > wouldn’t
> > > > > make much difference during startup/shutdown.
> > > > >
> > > > > On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann <sean.gle...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Can anyone offer advice, please, with regard to monitoring the
> > system
> > > > > >
> > > > > > resource consumption of a zcx Container task?
> > > > > >
> > > > > >
> > > > > >
> > > > > > I've got a zcx Container task running on a 'sandbox' system
> where -
> > > as
> > > > > yet
> > > > > >
> > > > > > - I'm not collecting any RMF/SMF data. Because of that, my only
> > > source
> > > > of
> > > > > >
> > > > > > system usage is the SDSF DA panel. I feel that the numbers I see
> > > there
> > > > > >
> > > > > > are... 'questionable' is the best word I can think of.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Firstly, the EXCP-count for the task goes up to about 15360
> during
> > > the
> > > > > >
> > > > > > initial start-up phase, but then it stays there until the STOP
> > > command
> > > > is
> > > > > >
> > > > > > issued. At that point, EXCP-count starts rising again, until the
> > task
> > > > > >
> > > > > > finally terminates. The explanation for that is probably because
> > all
> > > > the
> > > > > >
> > > > > > I/O is being handled internally at the 'Linux' level - the task
> > must
> > > be
> > > > > >
> > > > > > doing *some* I/O, right? - but the data isn't getting back to
> SDSF
> > > for
> > > > > some
> > > > > >
> > > > > > reason. Without the benefit of SMF data to examine, I'm wondering
> > if
> > > > this
> > > > > >
> > > > > > is part of a larger problem.
> > > > > >
> > > > > >
> > > > > >
> > > > > > The other thing that troubles me is the CPU% busy value. My
> sandbox
> > > > > system
> > > > > >
> > > > > > has 5 engines defined, and in the 'start.json' file that controls
> > the
> > > > zcx
> > > > > >
> > > > > > Container task, I've specified a 'cpu' value of 4. During the
> > > start-up
> > > > > >
> > > > > > phase for the Container started task, SDSF shows CPU% values of
> > > approx
> > > > > 80%,
> > > > > >
> > > > > > but when the task is finally initialised, this drops to
> 'tickover'
> > > > rates
> > > > > of
> > > > > >
> > > > > > about 1%. I'm happy with that - the initial start-up of *any*
> task
> > as
> > > > > >
> > > > > > complex as a zcx Container is likely to cause high CPU usage, and
> > the
> > > > > >
> > > > > > subsequent drop to the 1% levels is fine by me.
> > > > > >
> > > > > >
> > > > > >
> > > > > > But... Once the Container task is started and I've ssh'd into
> it, I
> > > > then
> > > > > >
> > > > > > want to monitor its 'internal' system consumption. I've been
> using
> > > the
> > > > > >
> > > > > > 'Getting Started...' redbook as my guide throughout all this
> > project,
> > > > and
> > > > > >
> > > > > > it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and
> > > > > "Grafana"
> > > > > >
> > > > > > as tools for this. I've got all those things installed and I can
> > > start
> > > > > and
> > > > > >
> > > > > > stop them quite happily, but I've found that using Cadvisor on
> it's
> > > own
> > > > > can
> > > > > >
> > > > > > drive CPU% levels back up to 80% for the entire time it is
> running.
> > > If
> > > > a
> > > > > >
> > > > > > system is running flat-out when all it is doing is monitoring
> > itself,
> > > > > well,
> > > > > >
> > > > > > there's something wrong somewhere... I'm trying to find an
> idiot's
> > > > guide
> > > > > to
> > > > > >
> > > > > > controlling what Cadvisor does, but as yet I've been
> unsuccessful.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Sean
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > ----------------------------------------------------------------------
> > > > > >
> > > > > > For IBM-MAIN subscribe / signoff / archive access instructions,
> > > > > >
> > > > > > send email to lists...@listserv.ua.edu with the message: INFO
> > > IBM-MAIN
> > > > > >
> > > > > > --
> > > > > Michael Babcock
> > > > > OneMain Financial
> > > > > z/OS Systems Programmer, Lead
> > > > >
> > > > >
> > ----------------------------------------------------------------------
> > > > > For IBM-MAIN subscribe / signoff / archive access instructions,
> > > > > send email to lists...@listserv.ua.edu with the message: INFO
> > IBM-MAIN
> > > > >
> > > >
> > > >
> ----------------------------------------------------------------------
> > > > For IBM-MAIN subscribe / signoff / archive access instructions,
> > > > send email to lists...@listserv.ua.edu with the message: INFO
> IBM-MAIN
> > > >
> > >
> > > ----------------------------------------------------------------------
> > > For IBM-MAIN subscribe / signoff / archive access instructions,
> > > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
> > >
> >
> > ----------------------------------------------------------------------
> > For IBM-MAIN subscribe / signoff / archive access instructions,
> > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
> >
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
>

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: ZCX task monitoring, anyone?

Reply via email to