Re: vm_pageout_scan badness

2000-12-01 Thread News History File User

Long ago, it was written here on 25 Oct 2000 by Matt Dillon:

> :Consider that a file with a huge number of pages outstanding
> :should probably be stealing pages from its own LRU list, and
> :not the system, to satisfy new requests.  This is particularly
> :true of files that are demanding resources on a resource-bound
> :system.
> :...
> :   Terry Lambert
> :   [EMAIL PROTECTED]
> 
> This isn't exactly what I was talking about.  The issue in regards to
> the filesystem syncer is that it fsync()'s an entire file.  If
> you have a big file (e.g. a USENET news history file) the 
> filesystem syncer can come along and exclusively lock it for
> *seconds* while it is fsync()ing it, stalling all activity on
> the file every 30 seconds.
[...]
> One of the reasons why Yahoo uses MAP_NOSYNC so much (causing the problem
> that Alfred has been talking about) is because the filesystem
> syncer is 'broken' in regards to generating unnecessarily long stalls.
> 
> Personally speaking, I would much rather use MAP_NOSYNC anyway, even with
> a fixed filesystem syncer.   MAP_NOSYNC pages are not restricted by
> the size of the filesystem buffer cache, so you can have a whole
> lot more dirty pages in the system then you would normally be able to
> have.  This 'feature' has had the unfortunate side effect of screwing
> up *THWACK*

Yeah, no kidding -- here's what I see it screwing up.  First, some
background:

I've built three news machines, two transit boxen and one reader box,
with recent INN k0dez, and 4.2-STABLE of a few days ago (having tested
NetBSD, more on that later), and a brief detour into 5-current.

The two transit boxes have somewhere on the order of ~400MB memory
or less; the amount I've put in the reader box has increased up to a
Gig as I try to figure out what's happening.  I'm using the MAP_NOSYNC
on the history database files on all to try to get the NetBSD performance
of not hitting history, and I've made a couple other minor tweaks to
use mmap where the INN history code probably should, but doesn't.

Everything starts out well, where the history disk is beaten at startup
but as time passes, the time taken to do lookups and writes drops down
to near-zero levels, and the disk gets quiet.  And actually, the transit
machines stay that way, while the reader machine gives me problems after
some time.

What I notice is that the amount of memory used keeps increasing, until
it's all used, and the Free amount shown by `top' drops to a meg or so.
Cache and Buf get a bit, but most of it is Active.  Far more than is
accounted for by the processes.

Now, what happens on the reader machine is that after some time of the
Active memory increasing, it runs out and starts to swap out processes,
and the timestamps on the history database files (.index and .hash, this
is the md5-based history) get updated, rather than remaining at the
time INN is started.  Then the rapid history times skyrocket until it
takes more than 1/4 of the time.  I don't see this on the transit boxen
even after days of operation.

Now, what happens when I stop INN and everything news-related is that
some memory is freed up, but still, there can be, say, 400MB still
reported as Active.  More when I had a full gig in this machine to
try to keep it from swapping, all of which got used...

Then, when I reboot the machine, it gives the kernel messages about
syncing disks; done, and then suddenly the history drive light goes
on and it starts grinding for five minutes or so, before the actual
reboot happens.

No history activity happens when I shut down INN normally, which should
free the MAP_NOSYNC'ed pages and make them available to be written to
disk before rebooting, maybe.


I'm also running BerkeleyDB for the reader overview on this machine,
and I just discovered that I had applied MAP_NOSYNC to an earlier
release, but the library linked in had not had this -- I just fixed
that and am running that way now (and see a noticeable improvement)
so now when I reboot, I may see both the overview database disk and
the history disk get some pre-reboot activity, if what I think is
happening really is happening.

What I think is happening, based on these observations, is that the
data from the history hash files (less than 100MB) gets read into
memory, but the updates to it are not written over the data to be
replaced -- it's simply appended to, up to the limit of the available
memory.  When this limit is reached on the transit machines, then
things stabilize and old pages get recycled (but still, more memory
overall is used than the size of the actual file).

I'm guessing that additional activity of the reader machine causes
jumps in memory usage not seen on the transit machines, that is enough
to force some of the unwritten dirty pages to be written to the
history file, as a few megs of swap get used, which is why it does
not sta

Re: vm_pageout_scan badness

2000-12-01 Thread News History File User

> :> Personally speaking, I would much rather use MAP_NOSYNC anyway,
> even with
> :...
> :Everything starts out well, where the history disk is beaten at startup
> :but as time passes, the time taken to do lookups and writes drops down
> :to near-zero levels, and the disk gets quiet.  And actually, the transit
> :...
> :What I notice is that the amount of memory used keeps increasing, until
> :it's all used, and the Free amount shown by `top' drops to a meg or so.
> :Cache and Buf get a bit, but most of it is Active.  Far more than is
> :accounted for by the processes.
> 
> This is to be expected, because the dirty MAP_NOSYNC pages will not
> be written out until they are forced out, or by msync().

I just discovered the user command `fsync' which has revealed a few
things to me, clearing up some mysteries.  Also, I've watched more
closely the pattern of what happens to the available memory following
a fresh boot...  At the moment, this (reader) machine has been up for
half a day, with performance barely able to keep up with a full feed
(but starting to slip as the overnight burst of binaries is starting),
but at last look, history lookups and writes are accounting for more
than half (!) of the INN news process time, with available idle time
being essentially zero.  So...


> :Now, what happens on the reader machine is that after some time of the
> :Active memory increasing, it runs out and starts to swap out processes,
> :and the timestamps on the history database files (.index and .hash, this
> :is the md5-based history) get updated, rather than remaining at the
> :time INN is started.  Then the rapid history times skyrocket until it
> :takes more than 1/4 of the time.  I don't see this on the transit boxen
> :even after days of operation.
> 
> Hmm.  That doesn't sound right.  Free memory should drop to near zero,
> but then what should happen is the pageout daemon should come along
> and deactivate a big chunk of the 'active' pages... so you should
> see a situation where you have, say, 200MB worth of active pages
> and 200MB worth of inactive pages.  After that the pageout daemon
> should start paging out the inactive pages and increasing the 'cache'.
> The number of 'free' pages will always be near zero, which is to be
> expected.  But it should not be swapping out any process.

Here is what I noticed while watching the `top' values for Active,
Inactive, and Free following this last boot (I didn't pay any attention
to the other fields to notice any wild fluctuations there, next time
maybe), on this machine with 512MB of RAM, if it reveals anything:

Following the boot, things start out with plenty of memory Free, and
something like 4MB Active, which seems reasonable to me.  Then I start
things.

As is to be expected, INN increases in size as it does history lookups
and updates, and the amount of memory shown as Active tracks this,
more or less.  But what's happening to the Free value!  It's going
down at as much as 4MB per `top' interval.  Or should I say, what is
happening to the Inactive value -- it's constantly increasing, and I
observe a rapid migration of all the Free memory to Inactive, until
the value of Inactive peaks out at the time that Free drops to about
996k, beyond which it changes little.  None of the swap space has
been touched yet.

As soon as the value for Free hits bottom and that of Inactive has
reached a max, now the migration happens from Inactive to Active --
until this point, the value of Active has been roughly what I would
expect to see, given the size of the history hash/index files, and
the BerkeleyDB file I'm now using MAP_NOSYNC as well for a definite
improvement in overview access times.

Anyway, I don't remember what values exactly I was seeing for Free
and Inactive or Active, since I was just watching for general trends,
but I seem to recall Active being ~100MB, and Inactive somewhat more.

(Are you saying above that this Inactive value should be migrating to
Cache, which I'm not seeing, rather than to Active, which I do see?
If so, then hmmm.)

Now memory is drifting at a fairly rapid pace from Inactive (the
meaning of which I'm not exactly clear about, although there's some
explanation in the `top' man page that hasn't quite clicked into
understanding yet), over to the Active field, at something like 2MB
or so per `top' interval.  Free remains close to 1MB, but Active is
constantly growing, although no processes are clearly taking up any
of this, apart from INN which only accounts for around 100MB at this
time, and isn't increasing at the rate of increase of Active memory.

Anyway, the Active field continues to increase as Inactive decreases
until finally Inactive bottoms out, down from several hundred MB to
a one or two digit MB value (I don't remember exactly), while Active
has increased to almost 400MB.  This is something like 20 minutes
after the reboot, and now the first bit of swap gets hit.  However,
the value of A

Re: vm_pageout_scan badness

2000-12-02 Thread News History File User

> :but at last look, history lookups and writes are accounting for more
> :than half (!) of the INN news process time, with available idle time
> :being essentially zero.  So...
> 
> No idle time?  That doesn't sound like blocked I/O to me, it sounds
> like the machine has run out of cpu.

Um, I knew I'd be unclear somehow.   The machine itself (with 2 CPUs)
has plenty of idle time -- `top' reports typically 70-80% idle, and
INN takes from 20-40% of CPU (being SMP, a process like `perl' locked
to one CPU will appear around 98%, unlike a certain other OS that will
show this percentage for the system total, rather than for a particular
CPU).

What I mean is that the INN process timer, which is basically Joe Greco's
timer that wraps key functions with start/stop timer calls, showing
where INN spends much of its time, is showing little to no idle time
(meaning it couldn't take more articles in no matter how hard I push
them).  Let me show you the timer stats from the time I started things
not long ago on this reader machine, where it's taking in backlogs:


Dec  3 04:33:47 crotchety innd: ME time 300449 idle 376(4577)
 all times in milliseconds: elapsed time^^=5min ^^^idle time (numbers
in parentheses are number of calls; only significant in calls like
artwrite to show how many articles were actually written to spool,
hiswrite to show how many unique articles were received over this
time period, and hishave to show how many history lookups were done)
 artwrite 52601(6077) artlink 0(0) hiswrite 40200(7035) hissync 11(14)
  ^^^ 53 seconds writing articles   ^^ 40 seconds updating history
 sitesend 647(12154) artctrl 2297(308) artcncl 2288(308) hishave 38857(26474)
39 seconds doing history lookups ^^
 hisgrep 70(111) artclean 12264(6930) perl 13819(6838) overv 112176(6077)
 python 0(0) ncread 13818(21287) ncproc 284413(21287) 

Dec  3 04:38:48 crotchety innd: ME time 301584 idle 406(5926) artwrite 55774(6402) 
artlink 0(0) hiswrite 25483(7474) hissync 15(15) sitesend 733(12805) artctrl 1257(322) 
artcncl 1245(321) hishave 22114(28196) hisgrep 90(38) artclean 12757(7295) perl 
14696(7191) overv 136855(6402) python 0(0) ncread 14446(23235) ncproc 284767(23235) 

(as time passes and more of the MAP_NOSYNC file is in memory, the time
needed for history writes/lookups drops)
[...]
Dec  3 04:58:49 crotchety innd: ME time 300047 idle 566(6272) artwrite 59850(6071) 
artlink 0(0) hiswrite 11630(6894) hissync 33(14) sitesend 692(12142) artctrl 324(244) 
artcncl 320(244) hishave 13614(24312) hisgrep 0(77) artclean 13232(6800) perl 
14531(6727) overv 156723(6071) python 0(0) ncread 15116(23838) ncproc 281745(23838) 
Dec  3 05:03:49 crotchety innd: ME time 300018 idle 366(5936) artwrite 56956(6620) 
artlink 0(0) hiswrite 8850(7749) hissync 7(15) sitesend 760(13240) artctrl 255(160) 
artcncl 255(160) hishave 9944(25198) hisgrep 0(31) artclean 13441(7753) perl 
15605(7620) overv 164223(6620) python 0(0) ncread 14783(24123) ncproc 282791(24123) 

Most of the time is spent on the BerkeleyDB overview now.  This is
probably because some reader is giving repeated commands pounding
the overview database.That reader's IP now has a
different gateway address, and won't be bothering me for a while.

Now, for a reference, here are the timings on a transit-only machine
with no readers, after it's been running for a while:


Dec  3 05:22:09 news-feed69 innd: ME time 30 idle 91045(91733)
 a reasonable amount of idle time ^^
 artwrite 48083(2096) artlink 0(0) hiswrite 1639(2096) hissync 33(11)

 sitesend 4291(12510) artctrl 0(0) artcncl 0(0) hishave 1600(30129)

 hisgrep 0(0) artclean 25591(2121) perl 79(2096) overv 0(0) python 0(0)
 ncread 69798(147925) ncproc 108624(147919) 

A total of just over 3 seconds out of every 300 seconds spent on
history activity.  That's reflected by the timestamps on the NOSYNC'ed
history database (index/hash) files you see here:

-rw-rw-r--  1 news  news  436206889 Dec  3 05:22 history
-rw-rw-r--  1 news  news 67 Dec  3 05:22 history.dir
-rw-rw-r--  1 news  news   8100 Dec  1 01:55 history.hash
-rw-rw-r--  1 news  news   5400 Nov 30 22:49 history.index

However, the timings shown by `top' here show from 10 to 20% idle CPU
time, even though INN itself has capacity to do more work.


The problem is that I'm not seeing this on the reader box.  Or if I
do see it, it doesn't last long.  The timestamps on the above files
are pretty much current, in spite of the files being NOSYNC'ed.


> :As is to be expected, INN increases in size as it does history lookups
> :and updates, and the amount of memory shown as Active tracks this,
> :more or less.  But what's happening to the Free value!  It's going
> :down at as much as 4MB per `top' interval.  Or should I say, what is
> :happening to the Inactive value -- it's constan

Re: vm_pageout_scan badness

2000-12-04 Thread News History File User

> ok, since I got about 6 requests in four hours to be Cc'd, I'm 
> throwing this back onto the list.  Sorry for the double-response that
> some people are going to get!

Ah, good, since I've been deliberately avoiding reading mail in an
attempt to get something useful done in my last days in the country,
and probably wouldn't get around to reading it until I'm without Net
access in a couple weeks...

(Also, because your mailer seems to be ignoring the `Reply-To:' header
I've been using, but I'd get a copy through the cc: list, in case you
puzzled over why your previous messages bounced)


> I am going to include some additional thoughts in the front, then break
> to my originally private email response.

I'll mention that I've discovered the miracle of man pages, and found
the interesting `madvise' capability of `MADV_WILLNEED' that, from the
description, looks very promising.  Pity the results I'm seeing still
don't match my expectations.

Also, in case the amount of system memory on this machine might be
insufficient to do what I want with the size of the history.hash/.index
files, I've just gotten an upgrade to a full gig.  Unfortunately, now
performance is worse than it had been, so it looks I'll be butchering
the k0deZ to see if I can get my way.

Now, for `madvise' -- this is already used in the INN source in lib/dbz.c
(where one would add MAP_NOSYNC to the MAP__FLAGS) as MADV_RANDOM --
this matches the random access pattern of the history hash table.
Supposedly, MADV_WILLNEED will tell the system to avoid freeing these
pages, which looks to be my holy grail of this week, plus the immediate
mapping that certainly can't hurt.

There's only a single madvise call in the INN source, but I see that the
Diablo code does make two calls to it (although both WILLNEED and, unlike
INN, SEQUENTIAL access -- this could be part of the cause of the apparent
misunderstanding of the INN history file that I see below).  Since it
looks to my non-progammer eyes like I can't combine the behaviours in a
single call, I followed Diablo's example to specify both RANDOM and the
WILLNEED that I thought would improve things.

The machine is, of course, as you can see from the timings, not optimized
at all, since I've just thrown something together as a proof of concept
having run into a brick wall with the codes under test with Slowaris,
And because a departmental edict has come down that I must migrate all
services off Free/NetBSD and onto Slowaris, I can't expect to get the
needed hardware to beef up the system -- even though the MAP_NOSYNC
option on the transit machine enabled it to whup the pants off a far
more expensive chunk of Sun hardware.  So I'm trying to be able to say
`Look, see? see what you can do with FreeBSD' as I'm shown out the door.


> I ran a couple of tests with MAP_NOSYNC to make sure that the
> fragmentation issue is real.  It definitely is.  If you create a
> file by ftruncate()ing it to a large size, then mmap() it SHARED +
> NOSYNC, then modify the file via the mmap, massive fragmentation occurs

I've heard it confirmed that even the newer INN does not mmap() the
newly-created files for makehistory or expire.  As reported to the
INN-workers mailing list:

: From: [EMAIL PROTECTED] (Richard Todd)
: Newsgroups: mailing.unix.inn-workers
: Subject: Re: expire/makehistory and mmap/madvise'd dbz filez
: Date: 4 Dec 2000 06:30:47 +0800
: Message-ID: <90ehin$1ndk$[EMAIL PROTECTED]>
: 
: In servalan.mailinglist.inn-workers you write:
: 
: >Moin moin
: 
: >I'm engaged in a discussion on one of the FreeBSD developer lists
: >and I thought I'd verify the present source against my memory of how
: >INN 1.5 runs, to see if I might be having problems...
: 
: >Anyway, the Makefile in the 1.5 expire directory has the following bit,
: >that seems to be absent in present source, and I didn't see any
: >obvious indication in the makedbz source as to how it's initializing
: >the new files, which, if done wrong, could trigger some bugs, at least
: >when `expire' is run.
: 
: ># Build our own version of dbz.o for expire and makehistory, to avoid
: ># any -DMMAP in DBZCFLAGS - using mmap() for dbz in expire can slow it
: ># down really bad, and has no benefits as it pertains to the *new* .pag.
: >dbz.o: ../lib/dbz.c
: >   $(CC) $(CFLAGS) -c ../lib/dbz.c
: 
: >Is this functionality in the newest expire, or do I need to go a hackin'?
: 
: Whether dbz uses mmap or not on a given invocation is controlled by the 
: dbzsetoptions() call; look for that call and setting of the INCORE_MEM 
: option in expire/expire.c and expire/makedbz.c.  Neither expire nor
: makedbz mmaps the new dbz indices it creates. 

The remaining condition I'm not positive about is the case of an
overflow, that ideally would not be a case to consider, and is not
the case on the machine now.


> on the file.  This is easily demonstrated by issuing a sequential read
> on the file and noting that the syste

Re: vm_pageout_scan badness

2000-12-05 Thread News History File User

Howdy,
I'm going to breach all sorts of ethics in the worst way by following
up to my own message, just to throw out some new info...  'kay?


Matt wrote, and I quote --
: > However, I noticed something interesting!

Of course I clipped away the interesting Thing, but note the following
that I saw...


: INN after adding the memory, I did a `cp -p' on both the history.hash
: and history.index files, just to start fresh and clean.  It didn't seem
[...]
: > There is an easy way to test file fragmentation.  Kill off everything
: > and do a 'dd if=history of=/dev/null bs=32k'.  Do the same for 
: > history.hash and history.index.  Look at the iostat on the history
: > drive.  Specifically, do an 'iostat 1' and look at the KB/t (kilobytes
: > per transfer).  You should see 32-64KB/t.  If you see 8K/t the file
: > is severely fragmented.  Go through the entire history file(s) w/ dd...
: 
: Okay, I'm doing this:  The two hash-type files give me between 9 and
: 10K/t; the history text file gives me more like 60KB/t.  Hmmm.  It's

Now, remember what Matt wrote, that partially-cached data played havoc
with read-ahead.  That is apparently what I was seeing here, pulling
some bit of data off the disk proper, but then pulling a chunk of data
that was cached, and so on.

I figured that out as I attempted to copy one of the files to create an
unfragmented copy to test transfer size and saw the expected 64K (well
DUH, that was the write size), and then attempted to `dd' these to /dev/null
and saw ... no disk activity.  The file was in cache.  Bummer.

Oh well, I had to reboot anyway for some reason, and did so.  Immediately
after reboot I `dd'ed the two database files and got the expected 64K/t
of an unfragmented file.  I also made copies of them just to push their
contents into memory, because...


: The actual history lookups and updates that matter are all done within
: the memory taken up by the .index and .hash files.  So, by keeping
: them in memory, one doesn't need to do any disk activity at all for
: lookups, and updates, well, so long as you commit them to the disk at
: shutdown, all should be okay.  That's what I'm attempting to achieve.
: These lookups and updates are bleedin' expensive when disk activity
: rears its ugly head.
: 
: Not to worry, I'm going to keep plugging to see if there is a way for
: me to lock these two files into memory so that they *stay* there, just
: to prove whether or not that's a significant performance improvement.
: I may have to break something, but hey...

I b0rked something.  I `fixed' the mlock operation to allow a lowly user
such as myself to use it, just as proof of concept.  (I still need to do
a bit of tuning, I can see, but hey, I got results)

So I attempt to pass all the madvise suggestions I can for both the
history.index and .hash files, and then I attempt to mlock both of them.
I don't get a failure, although the history.hash file (108MB) doesn't
quite achieve the desired results -- I do see Good Things with the
smaller history.index (72MB and don't remind me that 1MB really isn't
100bytes).

Anyway, the number of `Wired' Megs in `top' is up from 71MB to 200+,
and after some hours of operation, look at the timestamps of the two
database files (the .n.* files are those I copied after reboot, and
serve as a nice reference for when I started things)

-rw-rw-r--  1 news  news  755280213 Dec  5 19:05 history
-rw-rw-r--  1 news  news 57 Dec  5 19:05 history.dir
-rw-rw-r--  1 news  news  10800 Dec  5 19:05 history.hash
-rw-rw-r--  1 news  news   7200 Dec  5 08:44 history.index
-rw-rw-r--  1 news  news  10800 Dec  5 08:43 history.n.hash
-rw-rw-r--  1 news  news   7200 Dec  5 08:44 history.n.index

So, okay, history.hash still sees disk activity, but look at a handful
of INN timer stats following the boot:


The last two stats with the default vm k0deZ before restart:

Dec  5 08:30:40 crotchety innd: ME time 301532 idle 28002(120753)
 artwrite 70033(2853) artlink 0(0) hiswrite 49396(3097) hissync 28(6)
^
 sitesend 460(5706) artctrl 296(25) artcncl 295(25) hishave 32016(8923)
^
 hisgrep 45(10) artclean 20816(3150) perl 12536(3082) overv 29927(2853)
 python 0(0) ncread 33729(152735) ncproc 227796(152735) 

80 seconds of 300 spent on history activity...  urk...  on a steady-state
system with a few readers that had been running for some hours.

Dec  5 08:35:37 crotchety innd: ME time 300052 idle 16425(136209) artwrite 77811(2726) 
artlink 0(0) hiswrite 35676(2941) hissync 28(6) sitesend 571(5450) artctrl 454(41) 
artcncl 451(41) hishave 33311(7392) hisgrep 55(14) artclean 22778(3000) perl 
14137(2914) overv 28516(2726) python 0(0) ncread 38832(172145) ncproc 226513(172145) 

[REB00T]

Dec  5 08:59:32 crotchety innd: ME time 300059 idle 62840(189385)
 artwrite 68361(5580) artlink 0(0) hiswrite 8782(6567) hissync 104(12

Re: vm_pageout_scan badness

2000-12-06 Thread News History File User

> :The mlock man page refers to some system limit on wired pages; I get no
> :error when mlock()'ing the hash file, and I'm reasonably sure I tweaked
> :the INN source to treat both files identically (and on the other machines
> :I have running, the timestamps of both files remains pretty much unchanged).
> :I'm not sure why I'm not seeing the desired results here with both files
> 
> I think you are on to something here.  It's got to be mlock().  Run
> 'limit' from csh/tcsh and you will see a 'memorylocked' resource.
> Whatever this resource is as of when innd is run -- presumably however
> it is initialized for the 'news' user (see /etc/login.conf) is going

Yep, `unlimited'...  same as the bash `ulimit -a'.  OH NO.  I HAVE IT
SET TO `infinity' IN LOGIN DOT CONF, no wonder it is all b0rken-like.

The weird thing is that mlock() does return success, the amount of
wired memory matches the two files, and I've seen nothing obvious in
the source code as to why it's different, but I'll keep plugging away
at it.


> History files are nortorious for random I/O... the problem is due
> to the hash table being, well, a hash table.  The hash table 
> lookups are bad enough but this will also result in random-like
> lookups on the main history file.  You get a little better
> locality of reference on the main history file (meaning the system

Ah, but ...  This is how the recent history format (based on MD5 hashes)
introduced as dbz v6 at the time you were busy with Diablo and your
history mechanism there differs from that which you remember -- AI,
speaking of your 64-bit CRC history mechanism, whatever happened to the
links that would get you there from the backplane homepage... -- in this
case, you don't do the random-like lookups to verify message ID presence
in the text file at all.  Everything you do is in the data in the two hash
tables.  At least for transit.  I'm not sure if the reader requests do
require a hit on the main file -- it'd be worth it to point a Diablo
frontend at such a box to see how it does there even when the overview
performance for traditional readership is, uh, suboptimal.  I think it
does but that's a trivial seek to one specific known offset.

I'm sure this is applicable to other databases somehow, for those who
aren't doing news and are bored stiff by this.


> At the moment madvise() MADV_WILLNEED does nothing more then activate
> the pages in question and force them into the process'es mmap.
> You have to call it every so often to keep the pages 'fresh'... calling
> it once isn't going to do anything.  

Well, it definitely does do a Good Thing when I call it once, as you
can see from the initial timer numbers that approach the long-running
values I'm used to (that I tried to simulate by doing lookups on a small
fraction of history entries, in hope of activating a majority of the
needed pages, that wasn't perfect but was a decent hack).  You can see
from the timestamps of the debugging here that while it slows down the
startup somewhat, the work of reading in the data happens quickly and
is a definite positive tradeoff:

Dec  6 07:32:14 crotchety innd: dbz openhashtable /news/db/history.index
Dec  6 07:32:14 crotchety innd: dbz madvise WILLNEED ok
Dec  6 07:32:14 crotchety innd: dbz madvise RANDOM ok
Dec  6 07:32:14 crotchety innd: dbz madvise NOSYNC ok
Dec  6 07:32:27 crotchety innd: dbz mlock ok
Dec  6 07:32:27 crotchety innd: dbz openhashtable /news/db/history.hash
Dec  6 07:32:27 crotchety innd: dbz madvise WILLNEED ok
Dec  6 07:32:27 crotchety innd: dbz madvise RANDOM ok
Dec  6 07:32:27 crotchety innd: dbz madvise NOSYNC ok
Dec  6 07:32:38 crotchety innd: dbz mlock ok

This happens quickly when the data is still in cache, leading me to
believe it's something else affecting the .hash file (I added the
madvise() MADV_NOSYNC call just in case somehow it wasn't happening
in the mmap() for some reason):

Dec  6 09:29:34 crotchety innd: dbz openhashtable /news/db/history.index
Dec  6 09:29:34 crotchety innd: dbz madvise WILLNEED ok
Dec  6 09:29:34 crotchety innd: dbz madvise RANDOM ok
Dec  6 09:29:34 crotchety innd: dbz madvise NOSYNC ok
Dec  6 09:29:34 crotchety innd: dbz mlock ok
Dec  6 09:29:34 crotchety innd: dbz openhashtable /news/db/history.hash
Dec  6 09:29:34 crotchety innd: dbz madvise WILLNEED ok
Dec  6 09:29:34 crotchety innd: dbz madvise RANDOM ok
Dec  6 09:29:34 crotchety innd: dbz madvise NOSYNC ok
Dec  6 09:29:34 crotchety innd: dbz mlock ok


> You may be able to achieve an effect very similar to mlock(), but
> runnable by the 'news' user without hacking the kernel, by 

Yeah, sounds like a hack, but I figured out what was going on earlier
with my mlock() hack -- INN and the reader daemon now use a dynamically
linked library so the nnrpd processes also were trying to mlock() the
files too.  Hmmm.  Either I can statically compile INN (which I chose
to do) or I can further butcher the source by attempting to