Hi Jon,

On 11/18/2024 12:58 PM, Jon Turney wrote:
On 13/11/2024 06:21, Mark Geisert wrote:
This program provides an up-to-the-moment load average measurement.  The
user can take 1 sample, or obtain the average of N samples by number or

Sorry about the inordinate time it's take for me to look at this.

No worries, I understand.

So, this seems like two separate things shoved together

* A daemon which calls getloadavg() every 5 seconds
* A tool which exercises the loadavg estimation code

Does it really make sense to bundle them together?

I considered putting the daemon's loop in cygserver as a new thread. That's what one user on the main list suggested; another user suggested a separate program. I thought shoving it into cygserver was less appealing somehow.

The point of the non-daemon mode is in the first line of mine quoted up top. The program outputs a load average number that can be used to gate steps within a multi-step workload. Given the wacky volatility of the counters necessary to calculate a load average, the program can help one decide how many loadavg samples result in a reasonable result given the activity on the user's system.

time duration.  There is a daemon mode to have the global load average
stats updated such that 'uptime' and other tools provide a more reasonable
load average calculation over time.

I'm wondering what's "unreasonable" about the current behavior (estimate is calculated on demand, in a rate-limited fashion)

If it's that working correctly (after your fixes), is this actually needed (i.e. what difference does also having loadavg in daemon mode running make)?

I think the steady-state behaviour of 'uptime' is fairly good. It's the getting it to move up from zero initially that seems to be a problem, per multiple posts on the main list.

I say "fairly good" because the volatility of the counters sabotages the effort. When getloadavg() determines it's time to recalculate, it just reads the counters once. But the counters are being updated at 64Hz; on a system with no compute-bound processes they have a fairly high probability of being zero when read.

BTW that's why I added the "-v" option.  To see the volatility.

A release note is provided for Cygwin 3.5.5.
Documentation for doc/utils.xml is pending.

A man page isn't optional :)

It's now written and waiting to be included in v3 of this patch.

Or, I wonder if it really makes sense to ship this as useful to end- users (as opposed to marking it noinst_ in Makefile.am, so we just keep it around for developers)

I don't think I've seen any Cygwin devs or package maintainers mention load average. They're either cross-building from Linux and/or happy with whatever their systems can provide :-). There was some activity a few weeks ago on the main list and random complaints about 'uptime' reporting zero a couple of times before that, going back a ways. So this tool is a response to the user posts.

Reported-by: Mark Liam Brown <brownmarkl...@gmail.com>
Addresses: https://cygwin.com/pipermail/cygwin/2024-August/256361.html
Signed-off-by: Mark Geisert <m...@maxrnd.com>
Fixes: N/A (new code)

---
  winsup/cygwin/release/3.5.5 |   8 +
  winsup/utils/Makefile.am    |   2 +
  winsup/utils/loadavg.c      | 380 ++++++++++++++++++++++++++++++++++++
  3 files changed, 390 insertions(+)
  create mode 100644 winsup/utils/loadavg.c

[...]
+
+/* estimate the current load based on certain counter values */
+double
+get_load (void)
+{
+  PDH_STATUS ret = PdhCollectQueryData (query);
+  if (ret != ERROR_SUCCESS) {
+    fprintf (stderr, "PdhCollectQueryData %s\n", pdh_status (ret));
+    return 0.0;
+  }
+
+  /* Estimate the number of running processes as
+     (NumberOfProcessors) * (% Processor Time) */
+  PDH_FMT_COUNTERVALUE fmtvalue1;
+  ret = PdhGetFormattedCounterValue (counter1, PDH_FMT_DOUBLE, NULL, &fmtvalue1);
+  if (ret != ERROR_SUCCESS) {
+    fprintf (stderr, "PdhGetFormattedCounterValue#1 %s\n", pdh_status (ret));
+    return 0.0;
+  }
+
+  double ncpus = (double) sysinfo.dwNumberOfProcessors;
+  if (verbose) {
+    fprintf (stdout, "%llu ", GetTickCount64 ());
+
+    fprintf (stdout, "ncpus: %d, %%ptime: %3.2f, ", (int) ncpus,
+             fmtvalue1.doubleValue);
+  }
+  double running = fmtvalue1.doubleValue * ncpus / 100.0;
+
+  /* Estimate the number of runnable threads as
+     (ProcessorQueueLength) / (NumberOfProcessors) */
+  double rql = 0.0;
+  PDH_FMT_COUNTERVALUE fmtvalue2;
+  ret = PdhGetFormattedCounterValue (counter2, PDH_FMT_LONG, NULL, &fmtvalue2);
+  if (ret == ERROR_SUCCESS) {
+    rql = (double) fmtvalue2.longValue;
+    rql /= ncpus; /* make the measure a per-cpu queue length */
+  } else {
+    ++once; /* counter is missing; just print an error message once (below) */

I'd tend to make this static with function scope, rather than a global, but you are entitled to a different opinion.

I like your choice on this.  I'll update.

+  }
+
+  double load = running + rql;
+  if (verbose ) {
+    fprintf (stdout, "running: %3.2f, effrql: %3.2f, load: %3.2f\n",
+             running, rql, load);
+  }
+  if (once == 1)
+    fprintf (stderr, "PdhGetFormattedCounterValue#2 %s\n", pdh_status (ret));
+
+  ++samples;
+  return load;
+}
+
+int
+start_daemon (void)
+{
+  char  buf[132];
+  int   fd = -1;
+  pid_t pid = 0;
+
+top:
+  fd = open (PIDFILE, O_RDWR | O_CREAT | O_EXCL);
+  if (fd == -1) {
+    fd = open (PIDFILE, O_RDONLY);
+    if (fd == -1) {
+      fprintf (stderr, "unable to open pid file\n");
+      return 1;
+    }
+
+    memset (buf, 0, sizeof buf);
+    read (fd, buf, sizeof buf);
+    close (fd);
+
+    pid = atoi (buf);
+    /* does pid still exist? */
+    if (kill (pid, 0) == 0) {
+      fprintf (stderr, "daemon already running as pid %d\n", pid);
+      return 1;
+    }
+    unlink (PIDFILE);
+    goto top;
+  }
+  fchmod (fd, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
+
+  (void) daemon (0, 0);
+
+  /* this code runs in the daemon; the parent process has exited via daemon() */
+  snprintf (buf, sizeof buf, "%d", getpid ());
+  write (fd, buf, strlen (buf));
+  /* don't close(fd).. keep it open to protect against another daemon */
+
+  double loadavg[3];
+  while (1) {
+    (void) getloadavg (loadavg, 3);
+    sleep (5/*seconds*/);
+  }
+  return 0;
+}
+
+int
+stop_daemon (void)
+{
+  char     buf[132];
+  int      fd = -1;
+  ssize_t  len = 0;
+  pid_t    pid = 0;
+  char    *winpath = NULL;
+
+  fd = open (PIDFILE, O_RDONLY);
+  if (fd == -1) {
+    fprintf (stderr, "unable to open pid file\n");
+    return 1;
+  }
+  memset (buf, 0, sizeof buf);
+  read (fd, buf, sizeof buf);
+  close (fd);
+
+  pid = atoi (buf);
+  /* does pid still exist? */
+  if (kill (pid, 0) == -1) {
+    fprintf (stderr, "daemon pid %d already gone\n", pid);
+    unlink (PIDFILE);
+    return 1;
+  }
+
+  /* using kill() to terminate the daemon fails (why?); work around that */

Yeah, why! This is hugely suspicious.

I forget if getloadavg() is sigwrapper wrapped, but if not, that might explain it (or not, the daemon should spend nearly all of it's time in sleep(), right?)

I agree.  I'll try to debug further.  Yes the daemon is mostly in Sleep().
Thanks for your review,

..mark

Reply via email to