from:"Stephen McKay"

Stuck in "objtrm"

1999-07-02 Thread Stephen McKay


I have an old 486 here that I thrash to death occasionally.  Well, at least
I try to get it to page to death.  I started a make world last week and
forgot about it.

Today I noticed that it's been stuck for most of the week.  Almost everything
is fine, but one cc1 process is stuck in "objtrm".  Oh, and I hung a "cat
/proc/31624/map", too, trying to get some details (now stuck in "thrd_sleep").

So, am I just tripping over some old long-fixed bug?  Or is this a new one
worth investigating?  The kernel is from 1999/06/16 (just before the
vfs_cluster.c commit).

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Stuck in "objtrm"

1999-07-05 Thread Stephen McKay

On Friday, 2nd July 1999, Stephen McKay wrote:

>I have an old 486 here that I thrash to death occasionally.  Well, at least
>I try to get it to page to death.  I started a make world last week and
>forgot about it.
>
>Today I noticed that it's been stuck for most of the week.  Almost everything
>is fine, but one cc1 process is stuck in "objtrm".  Oh, and I hung a "cat
>/proc/31624/map", too, trying to get some details (now stuck in "thrd_sleep").
>
>So, am I just tripping over some old long-fixed bug?  Or is this a new one
>worth investigating?  The kernel is from 1999/06/16 (just before the
>vfs_cluster.c commit).

Well, it's happened again, but this time it is a recent -current, less than
a day old.  After a couple hours of heavy paging (yes, this is a slow box),
the make world hangs with cc1 in "objtrm".  All the other processes seem to
be waiting for it to exit.  It's the only cc1 around, by the way, even
though it was a -j5 parallel compile.

All other machine functions are fine.  ps, top, vmstat, et al show normal
looking values.  Does anybody have any hints on how to debug this?  I know
that "objtrm" implies that paging is in progress on some object, even
though there's no paging happening, and so it's probably an accounting
error with object->paging_in_progress.  But other than that, I'm not sure
where to look.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Stuck in "objtrm"

1999-07-06 Thread Stephen McKay


On Tuesday, 6th July 1999, Stephen McKay wrote:

>the make world hangs with cc1 in "objtrm"...

I'm having a fun old conversation with myself here! ;-)

Here's some concrete info:

(kgdb) p/x *(struct vm_object*) 0xc32ea21c
$13 = {object_list = {tqe_next = 0xc3389e58, tqe_prev = 0xc323fdec}, 
  shadow_head = {tqh_first = 0x0, tqh_last = 0xc32ea224}, shadow_list = {
tqe_next = 0xc327b8dc, tqe_prev = 0xc32cb734}, memq = {
tqh_first = 0xc0308e80, tqh_last = 0xc03046ec}, generation = 0x3004, 
  type = 0x1, size = 0x2a7, ref_count = 0x0, shadow_count = 0x0, 
  pg_color = 0x5, hash_rand = 0xfd9a69d7, flags = 0x21c8, 
  paging_in_progress = 0x1, behavior = 0x0, resident_page_count = 0x9, 
  backing_object = 0x0, backing_object_offset = 0x0, last_read = 0x14, 
  pager_object_list = {tqe_next = 0xc323c438, tqe_prev = 0xc323a424}, 
  handle = 0x0, un_pager = {vnp = {vnp_size = 0x16}, devp = {devp_pglist = {
tqh_first = 0x16, tqh_last = 0x0}}, swp = {swp_bcount = 0x16}}}

The high points:
ref_count=0
shadow_count=0
type=1 (OBJT_SWAP)
paging_in_progress=1
resident_page_count=9
flags=0x21c8 (onemapping, mightbedirty, writeable, pipwnt, dead)

A typical memory page from this object:

(kgdb) p/x *(struct vm_page*) 0xc02ffd90
$14 = {pageq = {tqe_next = 0xc0317dc0, tqe_prev = 0xc02f1960}, hnext = 0x0, 
  listq = {tqe_next = 0xc0317dc0, tqe_prev = 0xc02f196c}, object = 0xc32ea21c, 
  pindex = 0x2f, phys_addr = 0x4f4000, queue = 0x41, flags = 0x0, pc = 0x34, 
  wire_count = 0x0, hold_count = 0x0, act_count = 0x8, busy = 0x0, 
  valid = 0xff, dirty = 0xff}

The high points:
queue=inactive
flags=0
wire_count=0
hold_count=0
busy=0
valid=ff
dirty=ff

All 9 of them are like that.  So, no busy or PG_BUSY or anything.  No paging
really in progress after all.  So the object's paging_in_progress count is
out.

Who was watching what code changed recently?  Remember I had this problem
on a kernel from 1999/06/16 too.  So it's an "old" problem.

Off to research the next installment...

Stephen.

PS  I haven't worked out yet how to find the stack of the errant process.
Any hints?  The stack trace should be helpful.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Stuck in "objtrm"

1999-07-06 Thread Stephen McKay


On Tuesday, 6th July 1999, Andrew Gallatin wrote:

>Yes.  say 'proc pidhashtbl[PID & pidhash]->lh_first' in kgdb.
>I suspect that it will be in exit() also..

Magic!

It looks like a plain old exit() to me.

(kgdb) proc pidhashtbl[27157&pidhash]->lh_first
(kgdb) bt
#0  mi_switch () at ../../kern/kern_synch.c:827
#1  0xc014a5bd in tsleep (ident=0xc32ea21c, priority=4, 
wmesg=0xc023db84 "objtrm", timo=0) at ../../kern/kern_synch.c:443
#2  0xc01e9741 in vm_object_terminate (object=0xc32ea21c)
at ../../vm/vm_object.h:230
#3  0xc01e96f1 in vm_object_deallocate (object=0xc32ea21c)
at ../../vm/vm_object.c:382
#4  0xc01e6acb in vm_map_entry_delete (map=0xc3047440, entry=0xc3240190)
at ../../vm/vm_map.c:1680
#5  0xc01e6c89 in vm_map_delete (map=0xc3047440, start=0, end=3217022976)
at ../../vm/vm_map.c:1783
#6  0xc01e6d1d in vm_map_remove (map=0xc3047440, start=0, end=3217022976)
at ../../vm/vm_map.c:1808
#7  0xc0141d20 in exit1 (p=0xc322f0a0, rv=0) at ../../kern/kern_exit.c:220
#8  0xc0141b24 in exit1 (p=0xc322f0a0, rv=-1021614488)
at ../../kern/kern_exit.c:106
#9  0xc020e41a in syscall (frame={tf_fs = 47, tf_es = 137297967, 
  tf_ds = -1078001617, tf_edi = 136021320, tf_esi = 0, 
  tf_ebp = -1077947348, tf_isp = -1020915756, tf_ebx = -1, 
  tf_edx = 135690384, tf_ecx = 136200192, tf_eax = 1, tf_trapno = 12, 
  tf_err = 2, tf_eip = 135656524, tf_cs = 31, tf_eflags = 582, 
  tf_esp = -1077947368, tf_ss = 47}) at ../../i386/i386/trap.c:1056
#10 0xc0202cc0 in Xint0x80_syscall ()
error reading /proc/27157/mem


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Stuck in "objtrm" - live kernel test to run

1999-07-09 Thread Stephen McKay


On Thursday, 8th July 1999, Matthew Dillon wrote:

>There is a way we can find out for sure.  For any of you with processes
>stuck in objtrm, see if you can gdb the kernel and get a backtrace
>of that process to see if it might be in a state where a previous
>call context is holding a PIP count on the object.

Just for completeness, here's mine again, done with your ps trick:

(kgdb) back
#0  mi_switch () at ../../kern/kern_synch.c:827
#1  0xc014a5bd in tsleep (ident=0xc32ea21c, priority=4, 
wmesg=0xc023db84 "objtrm", timo=0) at ../../kern/kern_synch.c:443
#2  0xc01e9741 in vm_object_terminate (object=0xc32ea21c)
at ../../vm/vm_object.h:230
#3  0xc01e96f1 in vm_object_deallocate (object=0xc32ea21c)
at ../../vm/vm_object.c:382
#4  0xc01e6acb in vm_map_entry_delete (map=0xc3047440, entry=0xc3240190)
at ../../vm/vm_map.c:1680
#5  0xc01e6c89 in vm_map_delete (map=0xc3047440, start=0, end=3217022976)
at ../../vm/vm_map.c:1783
#6  0xc01e6d1d in vm_map_remove (map=0xc3047440, start=0, end=3217022976)
at ../../vm/vm_map.c:1808
#7  0xc0141d20 in exit1 (p=0xc322f0a0, rv=0) at ../../kern/kern_exit.c:220
#8  0xc0141b24 in exit1 (p=0xc322f0a0, rv=-1021614488)
at ../../kern/kern_exit.c:106
#9  0xc020e41a in syscall (frame={tf_fs = 47, tf_es = 137297967, 
  tf_ds = -1078001617, tf_edi = 136021320, tf_esi = 0, 
  tf_ebp = -1077947348, tf_isp = -1020915756, tf_ebx = -1, 
  tf_edx = 135690384, tf_ecx = 136200192, tf_eax = 1, tf_trapno = 12, 
  tf_err = 2, tf_eip = 135656524, tf_cs = 31, tf_eflags = 582, 
  tf_esp = -1077947368, tf_ss = 47}) at ../../i386/i386/trap.c:1056
#10 0xc0202cc0 in Xint0x80_syscall ()
error reading /proc/27157/mem

And for extra points:

(kgdb) frame 4
#4  0xc01e6acb in vm_map_entry_delete (map=0xc3047440, entry=0xc3240190)
at ../../vm/vm_map.c:1680
1680vm_object_deallocate(entry->object.vm_object);
(kgdb) p/x *entry
$10 = {prev = 0xc3047460, next = 0xc3249e38, start = 0x81c6000, 
  end = 0x8458000, avail_ssize = 0x0, object = {vm_object = 0xc32ea21c, 
sub_map = 0xc32ea21c}, offset = 0x15000, eflags = 0x0, protection = 0x7, 
  max_protection = 0x7, inheritance = 0x1, wired_count = 0x0}

I haven't made any clever conclusions from this, but you might do better.

>Note:  the process cannot be swapped out, so if you've had a process
>stuck in objtrm for a long time try doing as "ps axfl" to force it's
>upages in and then gdb should be able to backtrace it.  The 'f' in the ps
>does that.

Cute.  After the ps axlf, all the swapped out processes went from 0 to 8 KB
resident.  But the stuck process stayed at 0 KB resident.  It wasn't
swapped out anyway, according to the ps flags, so it should have had some
resident pages.  Seems like a contradiction to me.

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Stuck in "objtrm" - live kernel test to run

1999-07-11 Thread Stephen McKay

On Saturday, 10th July 1999, Matthew Dillon wrote:

>I'm trying to simulate your 486 setup.  You must love pain!  A make -j5
>buildworld on a 16MB-limited machine pages like hell (200-400 pageins/sec
>AND 200-400 pageouts/sec simultaniously, almost continuously).

Maximal pain, maximal gain! :-)  The only reason I'm using a big, powerful
486 is that my 386 here died and there were none left to replace it.  With
NFS src and obj, make world was taking over a week.  No joking.

>Are you
>using any special sysctls or special kernel config options?

I have been using "sysctl -w vm.swap_async_max=2" for a while.  It seems
to help throughput on this machine, and definitely helps interactive
performance.  I suspect that a few extra I/O limiters, or some sort of
I/O rate quota system would help enforce fairness even on faster machines.
For example, we have a performance anomaly with squid on 3.2 that could
be over-eager pagedaemon behaviour flooding the I/O system.

>Also, try the latest -CURRENT and see if you can still get it stuck in
>objtrm.  I haven't had any luck so far in my simulation.  If you still
>get stuck in objtrm then try Alan's patch and see if that has an effect.

Maybe you should send me your latest patch, the atomic_* fixer and I'll give
it a whirl.  It hasn't turned up in the cvs-cur CTM patches yet.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Softupdates reliability?

1999-08-22 Thread Stephen McKay


I had a recent crash on my home box that makes me question the reliablity
of softupdates.  My home box runs 3.2-R, but as far as I can determine,
there have been no reliablity fixes to softupdates since then.  So a failure
here should be relevant to -current.

Hardware: K6-2/300, 64MB ECC SDRAM, Fireport40 (ncr 875j) U/W SCSI,
2xDCAS-34330 + 1xDDRS-39130 disks (all U/W from IBM), Toshiba CD, Exabyte 8200.
This hardware has run happily for a long time, and often experiences high
load.

I was extracting from the Exabyte to the DDRS disk while applying a CTM
update from that disk against one of the DCAS disks when it crashed.  The
Exabyte went wonky (took about 6 goes to get the tape ejected) and the
rest of the disk system locked up.  The SCSI adapter was so confused I
had to power down.

When it recovered, there was an "UNEXPECTED SOFT UPDATE INCONSISTENCY"
which turned out to be a referenced but unallocated inode, and two zero size
directories.  In all, a couple dozen files ended up in lost+found.

So, what do people think is the most likely:

  1) the SCSI adapter told the disk to write crap to a couple places on the
 disk (breaking an inode and some directories)

  2) softupdates can't handle a sudden interruption (leaving many unwritten
 blocks)

If many other people have survived sudden power loss or similar no-sync type
crashes, I'll be happy to believe option 1 caused my problem.  If not, then
perhaps softupdates still has incomplete handling of dependencies.  Of course,
if it is option 1, I'm keen to know what's wrong with the current driver!

Stephen.

By the way, I have a 3.2-stable machine at work on which I installed revision
1.27 of softupdates, instead of the 1.20.2.2 == 1.24 normally included.  It
hasn't crashed ever :-) so I don't know if that will cause me problems.  It
works fine in normal operation.  Perhaps it is time to MFC.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Small fix to netstat argument processing

2000-01-06 Thread Stephen McKay


I've got very used to an alias ns='netstat -f inet' which lets me do all
the things I like to do without annoying me with stuff I don't want to
see.  All the options that don't care about the address family just ignore
that option.  Or, used to.

Recently that changed, and "netstat -f inet -i" in particular changed to
give the -f flag priority over the -i flag.  This makes no sense to me,
so I intend to commit this patch:

--- netstat/main.c.old  Tue Jan  4 16:14:46 2000
+++ netstat/main.c  Thu Jan  6 18:19:24 2000
@@ -460,9 +460,6 @@
 */
 #endif
if (iflag) {
-   if (af != AF_UNSPEC)
-   goto protostat;
-
kread(0, 0, 0);
intpr(interval, nl[N_IFNET].n_value, NULL);
exit(0);
@@ -501,7 +498,6 @@
exit(0);
}
 
-  protostat:
kread(0, 0, 0);
if (af == AF_INET || af == AF_UNSPEC)
for (tp = protox; tp->pr_name; tp++)

It removes the special case that specifically makes "netstat -f inet -i"
act the opposite to the way it used to (and the way I expect).

Any problems, folks?  Is there some bizarre IPv6 impact I've not seen?

Hmm, I've just noticed some small misalignment of column headings in the
default output.  I'll fix that too.

Stephen.

PS Roll on 4.0-RELEASE!


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Small fix to netstat argument processing

2000-01-06 Thread Stephen McKay

On Thursday, 6th January 2000, Yoshinobu Inoue wrote:

>Does these patches fix your problem, or should another better
>fix is desired? Please give me any opinions.

It passes all my tests.  Please commit it.  Thank you!

And earlier you wrote:

>Because now there is interface statistics display mode, when, e.g.
>
>  netstat -s -I bar0 -f inet6
>
>is specified. (though this is inet6 only now.)

I see where you are going now.  The syntax of netstat, already complex,
is becoming even more complex.  More detail in the man page will be
necessary soon.  Also, the "iflag" variable might have too many uses
now.  But this can wait, now that the immediate difficulties have
been resolved.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Crash from ^T during heavy paging

2000-01-09 Thread Stephen McKay


I'm currently giving 4.0 a thrashing in the best way I know.  I run way too
much stuff and let it page madly all day.  Here's how I killed it:

1) pick a 32MB box
2) make -j20 buildworld
3) lean on ^T and let autorepeat go for it

Soon it dies in calcru() called from ttyinfo().  The stack trace showed
that I caught it part way through a fork().  In calcru(), p->p_stats has
a bad value because it is initialised in vm_fork() sometime *after* the
P_INMEM flag is set, and there are some M_WAITOK mallocs between them.

The problem is that calcru() thinks that P_INMEM means that the proc
structure is fully and accurately populated.  But P_INMEM is one of the
first flags set.

A few places test for p->p_stats == NULL but that doesn't look applicable
since p->p_stats is uninitialised in this case.  Hmm.  I can't see any
use for that test at first glance.

So, calcru() and possibly some other places, are looking at a struct proc
before it's all there.  What's the "proper" way to do it?

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Why not a default number of pings?

2000-01-17 Thread Stephen McKay

On Tuesday, 18th January 2000, "Leif Neland" wrote:

>I've been hit by a "forgotten ping" again.
>
>I still do not see a reason for not having a default number of pings, instead 
>of infinite.  The only reason I've seen is "It's always been so".

I find this argument rather odd.  Train yourself to not forget your
running ping.  If you forget ping, then you probably forget to log out,
forget to back up your machine, or forget your car keys.  It's not ping's
fault.

>Even if a default of 4 pings is not acceptable, because windows does it that
>way, why not a large default then?

A large but finite default is a surprise to seasoned users.  That's bad.
A small default is also a surprise, but you get the surprise quickly.

>If somebody _really_ want to ping forever, let them use -t0, and defend the
>rest of us from our blunders of forgetting a ping, keeping the line open
>infinitely.

alias ping='ping -c4'

What's so hard about this?  Why break ping for the rest of us when you have
total control of your own circumstances?

>How about a MAX_PING=3600 in make.conf or so?

Unnecessary cruft.  We have plenty of cruft already, and don't need any more.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

That fix for the ^T crash

2000-01-27 Thread Stephen McKay


Hi, Brian!

I'm concerned that your fix won't make it before the code freeze.  Is
there a problem with it?  I admit I haven't actually tested it. :-(
My excuse is that I assumed you had.

Or should I just do a quick test on your patch (+ bde fixes) and commit
it myself?

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Problems installing FreeBSD 4.0 20000125-CURRENT

2000-01-28 Thread Stephen McKay

On Thursday, 27th January 2000, "Rodney W. Grimes" wrote:

>> < 
>said:
>> 
>> >> 3. On the first reboot after installing, the keyboard was in a funny
>> >> state.

>I have seen this on numerious occasion, but have never tracked it down
>to any one specific thing.  All on desktop and servers, but thats
>only because we don't do laptops.
>
>I have not seen it in quite some time (about a month), so I am thinking
>it has probably been unknowingly fixed someplace.  I'll keep an eye
>out for it.

I had this problem on several machines back around version 3.2.  I assumed
it was a problem between X11 and the keyboard driver.  I added a 2 second
delay before starting xdm and had no problems after that.  I've not seen
the problem without X11 being involved.  I admit I just forgot about it
after I got my workstation going. :-(

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

/dev/random limited to irq < 16

2000-03-01 Thread Stephen McKay


I found out much to my surprise that our SMP box is not collecting ANY
entropy for /dev/random.  All the interesting IRQs are over 16, and
nobody uses the console.

>From sys/i386/i386/mem.c 1.79:

/*
 * XXX the data is 16-bit due to a historical botch, so we use
 * magic 16's instead of ICU_LEN and can't support 24 interrupts
 * under SMP.
 */

Why don't we just flip this from a 16 bit to a 32 bit parameter in time
for 4.0-RELEASE?  Should just require a quick fiddle in mem.c and in
rndcontrol.

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Softupdates reliability?

1999-08-24 Thread Stephen McKay

On Tuesday, 24th August 1999, Peter Jeremy wrote:

>The exact order of events is not clear from this.  In general, I'd say
>that if something managed to upset the SCSI bus sufficiently to
>confuse every target on it, then there's a reasonably likelihood that
>data transfers were also corrupted.  A serious bus corruption during a
>disk write (either command or data phase) would have a reasonable
>chance of resulting in corrupt data on the disk (either the wrong data
>in the right place or the right data in the wrong place).

Yes, I can't tell whether the confused SCSI adapter upset the Exabyte and
maybe zero'd some disk sectors, or whether the Exabyte went bananas first
and took out everything else.  This system gets a LOT of use (I'm using it
right now), but the Exabyte obviously isn't used as often as the disks.
I might move the Exabyte on to an aha1542 as a precaution.

>I'm not sure how to go about isolating the problem.  I don't suppose
>you happened to bump one of the cables, or suffer a power glitch?

No power glitch or bumped cables.  All quality gear, no overclocking, good
cooling, surge suppressors, etc.  I don't like "It was just one of those
things".  That's not how computers work.  I've either got bad hardware or
there are bugs.  To counter the bugs, I'm about to go to the latest -stable.
Bad hardware will show itself eventually.  What I really should do is
build a test system with softupdates and crash it a lot.  (Using DDB
to pause, then switch off, so no partial writes.)  Could take a while...

Oh, and Brian wanted to know the processor revision.  I don't know of any
problems with K6-2/300s, but here's the info:

CPU: AMD-K6(tm) 3D processor (300.68-MHz 586-class CPU)
  Origin = "AuthenticAMD"  Id = 0x580  Stepping=0
  Features=0x8001bf

Stephen

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

K6-2 revisions (was: Re: Softupdates reliability?)

1999-08-24 Thread Stephen McKay

On Tuesday, 24th August 1999, "Brian F. Feldman" wrote:

>On Tue, 24 Aug 1999, Richard Tobin wrote:
>
>> > >   Origin = "AuthenticAMD"  Id = 0x580  Stepping=0
>> 
>> > You have one of the first K6-2s off the line. There were definite problems
>> > with these, and as such, they were specially distinguished by having 66
>> > printed on top.
>> 
>> I have a 0x580 which has had no problems at all.  I'm pretty certain
>> it doesn't have 66 stamped on it.  Are they all supposed to have this,
>> or were they tested and the dodgy ones stamped 66?
>
>It must be the latter. My 0x580 had the 66, so it must be that the dodgy
>ones got labelled 66 and not all the 0x580s were defective.

I think the story went along the lines that AMD were making K6-2/300's for
a while, then went to a less rigorous test procedure for just a short time
until they realised that some of the processors they released wouldn't work
at 100MHz bus speeds, though they were ok at 66MHz.  So they went back to
the better testing procedure for the 100MHz models, but also released some
66MHz only models.

Mine was indeed one of the earliest, but there have been no problem with it,
and during my strange disk crash the CPU kept updating the X11 load graph 
and stuff.  The problem(s) must be elsewhere.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Softupdates reliability?

1999-08-24 Thread Stephen McKay

On Tuesday, 24th August 1999, Wilko Bulte wrote:

>Hmm. I would generally expect SCSI errors etc to occur. Assuming the driver
>reports those one would at least know the bus was whacko.

I saw no errors, but that's not entirely surprising since I was running X11
and by that time xconsole was probably swapped out, and the disk system
was stuck, so it wouldn't have been able to report anything.  I gave up on a
serial console a very long time ago because this machine is so reliable. :-)

Also, I recall (rumour?) that the ncr driver is not as robust in the face
of errors as the adaptec driver, at least with CAM.  Anybody know the facts?
I know, for example, that I can't get bad block lists using my scsi adapter,
but people using adaptecs can.  That shows that the ncr driver is in some
sense incomplete.  I've been meaning to look into that, but you know how
time gets away.

So, after all this, I still don't know if I have any real evidence of anything
at all.  I'll just have to keep at it until it happens again.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

SCSI surprise! (was: Softupdates reliability?)

1999-08-30 Thread Stephen McKay


[I'm trying my first crosspost experiment here.  Please follow up to -scsi.]

A week ago I posted my strange crash and subsequent doubts about the proper
functioning of softupdates.  This is more of the story.

I examined the lost+found directory more closely and of the few files that
I traced, they were all temporary files or newly created directories (ports
actually) created in the CTM update process.  So, maybe I didn't really
lose anything.  Maybe fsck just doesn't recognise one of the safe-but-crashed
modes you get when using softupdates.  But unfortunately, I needed a CVS tree
urgently and restored a backup.  To make up for this, I promise to do serious
destruction testing of softupdates soon.

But, I had another crash almost as soon as I started using the machine again.
Again, the Exabyte was being used (but only rewinding at the time), but the
obvious trigger this time was intense disk activity (from "rm").  The active
file system was not using softupdates, and had a number of fsck -p correctable
errors on reboot.  Conclusions:

1) The Exabyte was not to blame for the crash
2) The crash wasn't a "scribble junk" crash (first one probably wasn't either)
3) Regular mounts are still safer than softupdates

I took the lid off anyway hoping to find anything at all weird and noticed
something I had forgotten.  I was using a Seagate ST51080N 1GB disk earlier
for some experimenting and had disconnected the POWER, but not the SCSI CABLE.
(It's a really noisy drive!) When I also unplugged the SCSI cable, all crashes
stopped.  I've now used the machine intensively for several days (copying over
20GB of small and big files, and read and written several tapes) without
incident.  Conclusions:

4) My stepping of K6-2/300 is just fine
5) My Exabyte really is ok :-)
6) It is NOT safe to have a powered down SCSI device attached to a SCSI chain
7) The world really is a wonderful place ;-)

So, apart from being happy at having stable hardware again, I am intensely
curious about this.  Why is a powered down SCSI device so nasty?  For example,
the first crash locked up my SCSI card so that reset didn't fix it, and the
second crash hung one of my disks so that it had to be powered down to even
be recognised!  Is there a standard for this stuff?

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: optional 'make release' speed-up patch

1999-09-09 Thread Stephen McKay

On Thursday, 9th September 1999, "John W. DeBoskey" wrote:

>   The following patch to /usr/src/release/Makefile allows the
>specification of the variable FASTCLEAN, which instead of doing
>a recursive rm on CHROOTDIR, simply umounts/newfs/mounts.

>+  -device=`df | grep '${CHROOTDIR}' | cut -f1 -d' '` && \
>+  /sbin/umount ${CHROOTDIR} && \
>+  /sbin/newfs  $$device && \
>+  /sbin/mount  ${CHROOTDIR} && \
>+  /bin/df ${CHROOTDIR}

This is going to look like I'm putting the boot in after everyone else has
already expressed a negative opinion, but I want to reinforce why this is
a dangerous option, and I think a bit of unhappiness now will result in
safer future contributions.  I'm really not trying to be a pain.

First up, destroying file systems in a makefile is a very rare event, and
a pretty spectacular trick to use as a performance optimisation.  Admittedly
a make release is heavy stuff already, but destroying file systems is one
step further than expected.  Have you tested this and verified that it is
a major time saving?  I suspect it is not.  Optimising the 10% case instead
of the 90% case just increases the likelihood of bugs, and it is doubly
risky to use the big guns on the small fry.

The proposed code isn't very careful, and would attempt to destroy the
wrong file system if, for example, I had CHROOTDIR set to /d.  (Maybe
I like calling file systems /a, /b, /c, etc like those crazy folks on
freefall.)  I doubt that it would succeed (because of checks for mounted
file systems) but it would try.  So, the code should verify that CHROOTDIR
is a local mounted file system, and of the type you intend to make.

The code runs newfs on the block device.  It really should run on the
character device.  In a dangerous thing like on-the-fly file system
destruction and creation, precision is important, even if only to instill
confidence in the user when it runs.

Defining FASTCLEAN to destroy a file system is a surprise unless you
are intimately familiar with the makefile.  That's a bear trap on the
nature walk.  For example, I used MACHINE all the time in my .profile
until it started screwing up FreeBSD compiles.  FASTCLEAN is probably
somebody's favourite variable for some unimportant thing.  They might
never run make release, but it's lurking there for them when they do.
The variable name should be more descriptive, and require that it be
set to a particular value before it triggers.

So, what's the upside of all this gloom?  Do I really think this is the
most dangerous thing I've ever seen?  Well, no.  I just think it is
inadvisable.  There is nothing stopping you from creating a script that
runs make release for you.  Then you can newfs your filesystem there,
fully aware of the risks, and fully in control.  For everyone else, the
enormous rm is a useful test of the softupdates code.  Most things have
a silver lining if you know how to look at them. :-)

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Fsck follies

1999-11-21 Thread Stephen McKay


I was giving vinum + softupdates a bit of a workout on 4 really old
SCSI disks (Sun shoeboxes, if you must know) attached to an aha1542B.
The rest of the machine is a Pentium 133 with 64MB of parity ram, a
few more disks, and another aha1542B.  It runs -current (about 10 days
old now).

I was copying a newer -current source tree onto the box when I lost power
to my house for maybe half a second.  Being foolish and shortsighted, I
have no UPS.  (An interesting side note: out of the 3 machines in use at
the time, 2 of the keyboards locked up and required a power down to recover.
I was unaware that keyboards could crash.)

When the system came back up, fsck -p didn't like the vinum volume.
No sweat, I ran it manually.  There were many

INCORRECT BLOCK COUNT I= (4 should be 0)

messages.  I assumed this was an artifact of soft updates.  The fsck
completed successfully.

Being paranoid, I reran fsck.  This time it reported a number of
unreferenced inodes (199 to be exact), and linked them in to lost+found.

It is this last item that bothers me.  When the first fsck completed,
the filesystem should have been structurally correct.  But it wasn't.
A third fsck confirmed that 2 runs of fsck were enough.

I seem to recall sagely advice from days gone by to always run fsck twice
and sync thrice.  I thought I could forget all that stuff nowadays.

By the way, I saved the broken old source tree and compared it to the
full tree.  There were no unusual differences, except for the broken
one being incomplete.  So, if fsck were a little better, things would
be fine.  As good as you could expect, given a power failure.

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Fsck follies

1999-11-21 Thread Stephen McKay

On Sunday, 21st November 1999, Christopher Masto wrote:

>On Sun, Nov 21, 1999 at 10:36:32PM +1000, Stephen McKay wrote:
>> When the system came back up, fsck -p didn't like the vinum volume.
>> No sweat, I ran it manually.  There were many
>> 
>> INCORRECT BLOCK COUNT I= (4 should be 0)
>> 
>> messages.  I assumed this was an artifact of soft updates.  The fsck
>> completed successfully.
>> 
>> Being paranoid, I reran fsck.  This time it reported a number of
>> unreferenced inodes (199 to be exact), and linked them in to lost+found.
>> 
>> It is this last item that bothers me.  When the first fsck completed,
>> the filesystem should have been structurally correct.  But it wasn't.
>> A third fsck confirmed that 2 runs of fsck were enough.
>
>Presumably you are using vinum for mirroring?  I have had to stop
>doing so after trashing several filesystems.  There are some serious
>bugs that allow the plexes to get out of sync; as reads from a mirror
>set are round-robin, this can be very bad.

No, I was just striping them (4 x 660 MB disks, 96 KB interleave).  Vinum
had nothing to do with the problem.  I was just reporting all the facts,
just in case.

I think there is a fault in fsck.  Possibly it is because softupdates
changed the rules.  Having run md5 over the good copy and the broken
(power failure interrupted) copy as well as everything in lost+found,
I can say that no corrupted files survived, and everything in lost+found
was a good copy of some file or other.  So softupdates appears to be
doing the right thing.  But fsck didn't fix everything broken by the
power interruption.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Fsck follies

1999-11-23 Thread Stephen McKay

On Monday, 22nd November 1999, Bernd Walter wrote:

>On Mon, Nov 22, 1999 at 02:57:39PM +1000, Stephen McKay wrote:
>> I think there is a fault in fsck.  Possibly it is because softupdates
>> changed the rules.  Having run md5 over the good copy and the broken
>> (power failure interrupted) copy as well as everything in lost+found,
>> I can say that no corrupted files survived, and everything in lost+found
>> was a good copy of some file or other.  So softupdates appears to be
>> doing the right thing.  But fsck didn't fix everything broken by the
>> power interruption.
>> 
>Sometimes fsck tells you that it needs a rerun.
>See /usr/share/doc/smm/03.fsck/paper.ascii.gz for details about fsck.
>Are you shure that this was not the case?

It should print "PLEASE RERUN FSCK" in that case.  It did not do so.  It
looks like other messages like "FILE SYSTEM MARKED DIRTY" are likely in
this case.  It said "FILE SYSTEM MARKED CLEAN" that evening.

Eventually I'll get the time to do repeated powerdowns of my equipment to
try to reproduce this.  I hope you can see why I'm not rushing into this. :-)

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: HEADSUP: wd driver will be retired!

1999-12-10 Thread Stephen McKay

On Friday, 10th December 1999, "Kenneth D. Merry" wrote:

>Brad Knowles wrote...
>> At 3:05 PM -0700 1999/12/10, Kenneth D. Merry wrote:
>> 
>> >  I agree that the CAM integration shouldn't be used as a precedent here.
>> >  I don't agree with your characterization of it as a "debacle", though.
>> >
>> >  On the whole, we gained a whole lot and lost very little.
>> 
>>  Long-term, yes I believe we gained a lot.  Short-term, what I 
>> recall having heard from some of the people who lived through it, 
>> well let's just say it was really ugly and nasty for a certain period 
>> of time.
>
>I don't think it was ugly and nasty at all.  You're basing your opinions
>on second hand hearsay.  If you can produce specific examples of why it
>was "really ugly and nasty", fine, but why not avoid making statements you
>can't support?

This must depend on your perspective.  My first hand view is that it was
ugly and nasty.  This is because I lost support for hardware I was actively
using (some temporarily, some permanently), and because I had no control
over the pace of change.  For a bunch of reasons, there was no way I could
keep up (and that meant porting old drivers to keep up).  It sure felt
ugly to me.  The unnecessary renaming of device files made it worse.

But that shouldn't stop us from moving forward with the ata driver.  I
think that a small slowing of the pace, and a bit more understanding toward
those with unusual hardware will help.  And I support PHK's hard line
stance (except for the rushed pace) toward making the kernel break for
users of wd.  It has to be so, or no one will move.  The wd code will
still be in the CVS tree for desperate people to revive to use, and to
port the missing bits into the ata driver.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: HEADSUP: wd driver will be retired!

1999-12-10 Thread Stephen McKay

On Friday, 10th December 1999, Mike Smith wrote:

>The same mentality that made the CAM cutover a "debacle" is making the 
>ata cutover a "debacle".  

This "mentality" might be an unavoidable part of human nature.  I found
my first reaction was "How dare they take away something I have now?!"
and it took some careful thinking to see that my loss was actually very
small against future benefits.  It might be that these things have to be
predicted by -core and handled "touchy feely" like:

core:   What if we do this  ?
public: Um, sounds scary.  When will you do it?  Will I lose anything?
core:   We think a month from now, and you will lose support for  and .
public: We don't use  and  any more, so fine.

instead of the current (caricatured for emphasis):

core:   We will do  soon.  Probably today.
public: Oh my God!
core:   It's for your own good.  You always complain and make it difficult!
public: We don't want to change anything, ever!  It's so hard!  You must
support all my hardware for ever and ever!

>Fortunately, the CAM folks persisted despite the criticism, and I'm glad 
>to see that Soren is taking the same stance.

Not everything improved with CAM.  Personally I'm only receiving the
benifits of CAM now, about a year after it replaced the old system.
On the balance it has been good for FreeBSD, but you have to remember
that there will be small pockets of users that will get the short end
of the stick.  How the project deals with the losers in these deals is
important for its long term health.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

dc driver and underruns (was: Strangeness with 4.0-S)

2000-07-13 Thread Stephen McKay

On Monday, 10th July 2000, Stefan Esser wrote:
>On 2000-07-09 20:52 +1000, Stephen McKay <[EMAIL PROTECTED]> wrote:
>> On Saturday, 8th July 2000, Stefan Esser wrote:
>> 
>>>Oh, there are renegotiations after each overrun ???

>> The code at the point that an underrun is detected is:
>> 
>>  printf("dc%d: TX underrun -- ", sc->dc_unit);
>>  if (DC_IS_DAVICOM(sc) || DC_IS_INTEL(sc))
>>  dc_init(sc);
>>  
>> After that, it sets the new threshold, or store and forward mode.  That
>> conditional (which resets the DE-500 style cards I own), looks deliberate
>> since it is so specific.  Either that, or Bill was being conservative.
>> When I get a chance, I will experiment with removing it.
>
>Well, the DE Driver (DEC 21x4x) has (relevant lines marked ***):
>
> [SNIP: code showing de driver does not reset chip]

I've now read the 21143 chip manual from Intel.  What the de driver does
is illegal (the transmitter must be idle when the threshold is changed).
I don't know if it works in practice, the de driver didn't work well for
me.  What the dc driver does is overkill.  I will implement some changes,
based on the documentation, and see what happens.

Of course, Bill, if you have direct experience that contradicts the
documentation (as if I've never seen incorrect doco...) then I'm all
ears.  I also have a very limited range of test hardware.

>I agree, that for chips that need to be completely re-initialized, the
>default might be store-and-forward ...

>There are so many DEC 21x4x clones, all slightly different, and it seems
>that at least a few need the chip reset.

There is already a convenient store-and-forward-only flag that is set
for one of the supported chips.  I propose that this flag be set on all
hardware that cannot have the threshold changed without a reset.

>> It hides the problem very well for me.  I really can't see the tiniest
>> of performance loss with store and forward.  Maybe it's something that
>> only shows up on benchmarks.
>
>Guess it will show up if you measure latencies (or your application is
>doing lots of RPCs). But as soon as there is a cheap 100baseT switch in
>the path to the destination, there will be store-and-forward at work ;-)

Does anyone here actually measure these latencies?  I know for a fact
that nothing I've ever done would or could be affected by extra latencies
that are as small as the ones we are discussing.  Does anybody at all
depend on the start-transmitting-before-DMA-completed feature we are
discussing?

Lastly, some people really want to keep the messages.  Is hiding them
behind bootverbose enough?  Or do I have to add a flag/hint?  No, I
haven't looked at the new hint system, so I don't know if I should
be afraid or not. :-)

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: dc driver and underruns (was: Strangeness with 4.0-S)

2000-07-13 Thread Stephen McKay

On Thursday, 13th July 2000, "Rodney W. Grimes" wrote:

>>On Thu, 13 Jul 2000, Stephen McKay wrote:
>> 
>>>Does anyone here actually measure these latencies?  I know for a fact
>>>that nothing I've ever done would or could be affected by extra latencies
>>>that are as small as the ones we are discussing.  Does anybody at all
>>>depend on the start-transmitting-before-DMA-completed feature we are
>>>discussing?
>> 
>> I don't like the idea of removing that feature.  Perhaps it should be a
>> sysctl or ifconfig option, but it should definitely remain available.
>> Those minute latencies are critical to those of us who use MPI for
>> complex parallel calculations.
>
>I have to agree here.  The store and forward adds an approximate
>11uS (by theory under ideal conditions 1500bytes@132MB/s = 11uS,
>practice actually makes this worse as typical PCI does something
>less than 100MB/s or 15uS) to a 120uS packet time on the wire (again,
>ideal, but here given that switches, and infact often cut-through
>switches, are used for these types of things, ideal and practice
>are very close.)
>
>I don't think these folks, nor myself, are wanting^H^H^H^H^H^H^Hilling
>to give up 12.5%.

OK.  It seems that repairing the feature, rather than disabling it is
the most popular option.  Still, I am quite interested in finding anyone
who actually measures these things, and is affected by them.  These very
same people might be able to trace why we get the underruns in the first
place.  I suspect an interaction between the ATA driver and VIA chipsets,
because other than the network, that's all that is operating when I see
the underruns.  And my Celeron with a ZX chipset is immune.

Back to the technical, for a moment.  I have verified that stopping the
transmitter on the 21143 is both sufficient and necessary to enable the
thresholds to be set.  I have code that works on my machine.  I intend
to commit it when I think it looks neat enough.

Getting even more technical, it appears to me that the current driver
instructs the 21143 to poll for transmit packets (ie a small DMA)
every 80us even if there are none to be sent.  I don't know what percentage
of bus time this might be, or even how to calculate it (got some time Rod?)
but it looks unnecessary to me.  I think the transmitter could be turned
off regularly.  At the moment, the driver leaves it on all the time.

And to the non technical: Do the messages go or stay?  I've heard both
sides.  For most people they are just annoying fluff.  For those who
actually care about the latency, it might be informative, and thus
too useful to be hidden behind bootverbose.  Opinions?

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: dc driver and underruns (was: Strangeness with 4.0-S)

2000-07-14 Thread Stephen McKay

On Friday, 14th July 2000, Matthew Jacob wrote:

>
>> That theory is not correct, I have seen multiple Alpha machines reporting 
>> buffer underruns as well. No ATA disk in sight there..
>
>This has been a reported feature of the tulip chip and alphas (de driver
>usually) forever forever forever.

And there's no guarantee that there is just one cause.  If the dc driver
with BX and ZX chipsets never has an underrun, and the 2 VIA chipsets I've
tried always cause underruns, there might be something we can fix.  Even
if we never manage to fix it on Alphas.

>It's not a bug, per se, IMO.

In the i386 case, there's some sort of PCI bus starvation.  Maybe we can
fix it.  Maybe not.  We can at least try to categorise it.  Maybe it's
as simple as a BIOS option we should tweak.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: dc driver and underruns (was: Strangeness with 4.0-S)

2000-07-16 Thread Stephen McKay

On Friday, 14th July 2000, "Rodney W. Grimes" wrote:

>>  I suspect an interaction between the ATA driver and VIA chipsets,
>> because other than the network, that's all that is operating when I see
>> the underruns.  And my Celeron with a ZX chipset is immune.
>
>I've seen them on just about everything, chipset doesn't seem to matter,
>IDE or SCSI doesn't seem to matter.

Well, maybe they are just a fact of life.  But using just my vague knowledge
of how PCI works, it doesn't look inevitable to me.  So I see bugs. :-)

>> Getting even more technical, it appears to me that the current driver
>> instructs the 21143 to poll for transmit packets (ie a small DMA)
>> every 80us even if there are none to be sent.  I don't know what percentage
>> of bus time this might be, or even how to calculate it (got some time Rod?)
>
>I'll have to look at that.  If it is a simple 32 bit read every 80uS
>thats something like .1515% of the PCI bandwidth, something that shouldn't
>matter much.  (I assumed a simple 4 cycle PCI operation).  Just how big
>is this DMA operation every 80uS?

I believe it is just one 32 bit read.  But I don't understand that aspect
of the hardware very well yet.  I also suspect that this polling adds
to the latency, but again, I haven't got to the end of that either.
Sometimes other things can distract you from even the most interesting
technical matter. :-)

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Ugly, slow shutdown

2000-08-05 Thread Stephen McKay


I'm off in a few days for a couple months of tourism in Europe (no, no need
for sympathy!), so I'm dumping these couple ideas on you and running.

I think shutdown time has gotten uglier and slower than it needs to be.
I want to apply these patches (well, at least the first one) before I escape
radar range.  Your job is to not object much. :-)

Patch 1 replaces:

  Waiting (max 60 seconds) for system process `bufdaemon' to stop...stopped

with

  Stopping bufdaemon

Also:

  syncing disks... 10 10 3
  done

returns to the traditional

  syncing disks... 10 10 3 done

Patch 2 is smaller and possibly controversial.  Normally bufdaemon and
syncer are sleeping when they are told to suspend.  This delays shutdown
by a few boring seconds.  With this patch, it is zippier.  I expect people
to complain about this shortcut, but every sleeping process should expect
to be woken for no reason at all.  Basic kernel premise.

I've been running these patches on a 4.x machine for a while now.  No
problems except I am now surprised by the slow and ugly shutdown of
unpatched machines. :-)

I apologise that I've not tested these against -current.  That's the bit
that I've skipped because I'm out of time.  There should be no difference
between 4.x and -current in this area though.  These patches will apply
cleanly against both.

Cheers,

Stephen.

Patch 1:
Index: kern_shutdown.c
===
RCS file: /cvs/src/sys/kern/kern_shutdown.c,v
retrieving revision 1.76
diff -u -r1.76 kern_shutdown.c
--- kern_shutdown.c 2000/07/04 11:25:22 1.76
+++ kern_shutdown.c 2000/07/06 15:02:21
@@ -247,7 +247,6 @@
sync(&proc0, NULL);
DELAY(5 * iter);
}
-   printf("\n");
/*
 * Count only busy local buffers to prevent forcing 
 * a fsck if we're just a client of a wedged NFS server
@@ -261,6 +260,8 @@
bp->b_vp->v_mount, mnt_list);
continue;
}
+   if (nbusy == 0)
+   printf("\n");
nbusy++;
 #if defined(SHOW_BUSYBUFS) || defined(DIAGNOSTIC)
printf(
@@ -593,12 +594,11 @@
return;
 
p = (struct proc *)arg;
-   printf("Waiting (max %d seconds) for system process `%s' to stop...",
-   kproc_shutdown_wait, p->p_comm);
+   printf("Stopping %s", p->p_comm);
error = suspend_kproc(p, kproc_shutdown_wait * hz);
 
if (error == EWOULDBLOCK)
-   printf("timed out\n");
+   printf(": timed out\n");
else
-   printf("stopped\n");
+   printf("\n");
 }


Patch 2:
Index: kern_kthread.c
===
RCS file: /cvs/src/sys/kern/kern_kthread.c,v
retrieving revision 1.5
diff -u -r1.5 kern_kthread.c
--- kern_kthread.c  2000/01/10 08:00:58 1.5
+++ kern_kthread.c  2000/08/05 15:32:06
@@ -116,6 +116,12 @@
 */
if ((p->p_flag & P_SYSTEM) == 0)
return (EINVAL);
+   /*
+* The target process is probably just snoozing.  Wake it up so
+* that it will notice that it should suspend itself.
+*/
+   if (p->p_wchan != NULL)
+   wakeup(p->p_wchan);
SIGADDSET(p->p_siglist, SIGSTOP);
return tsleep((caddr_t)&p->p_siglist, PPAUSE, "suspkp", timo);
 }

TheEnd


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Ugly, slow shutdown

2000-08-07 Thread Stephen McKay

> * Mike Smith <[EMAIL PROTECTED]> [000807 01:25] wrote:
> > > * Stephen McKay <[EMAIL PROTECTED]> [000805 08:49] wrote:
> > > > 
> > > > ... every sleeping process should expect
> > > > to be woken for no reason at all.  Basic kernel premise.
> > > 
> > > You better bet it's controversial, this isn't "Basic kernel premise"
> > 
> > Actually, that depends.  It is definitely poor programming practice to 
> > not check the condition for which you slept on wakeup.
> 
> Stephen's patches didn't give them that option, the syncer could be
> in some other part of vfs that doesn't expect to be woken up, perhaps
> in uniterruptable sleep... perhaps waiting for a DMA transfer?
> 
> How does one check if the data filled into a buffer is actually from
> the driver and not just stale?

The time honoured standard is:

raise cpu priority
while (we do not have exclusive use of some item) {
set some sort of "I want this item" flag (optional)
sleep on a variable related to the item
}
use the item/data we waited for
lower cpu priority

A typical example from vfs_subr.c:

s = splbio();
while (vp->v_numoutput) {
vp->v_flag |= VBWAIT;
error = tsleep((caddr_t)&vp->v_numoutput,
slpflag | (PRIBIO + 1), "vinvlbuf", slptimeo);
if (error) {
splx(s);
return (error);
}
}
... the code plays a little with vp here ...
splx(s);

A simpler example from swap_pager.c:

s = splbio();

while ((bp->b_flags & B_DONE) == 0) {
tsleep(bp, PVM, "swwrt", 0);
}
... code uses bp here ...
splx(s);

Both of these examples are safe from side effects due to waking up early.
This is how all such code should be.  To do otherwise is to introduce possible
race conditions.

At your prompting, though, I've looked at more code and have found an example
that violates this principle.  I assume it is a bug waiting to bite us.  In
the 4.1.0 source (sorry, that's all I have on operational computers at this
moment) line 581 of vfs_bio.c sleeps without looping.  It would seem that
Alfred's assertion of lurking danger is correct.  This stuff should be fixed.

> > > *boom* *crash* *ow* :)
> > 
> > Doctor:  So don't do that.
> > 
> > In this case, the relevant processes just need to learn to check whether 
> > they've been woken in order to die.
> 
> No, they need to signify that it's safe to wake them up early.

When I return to the land of FreeBSD I'll offer a speedup that does not wake
processes in arbitrary places (to avoid tickling lurking bugs).  To do this
I would make processes that want to use the suspension mechanism call a
routine in kern_kthread.c for their just-loafing-about sleep.  Then that
module will have enough information to do the job quickly.

And back to the simpler bit (the bike shed bit).  Does everyone else actually
*like* the verbose messages currently used?  And the gratuitous extra newline
in the "syncing..." message?

Stephen.

PS My main machine has blown its power supply.  Contact with me will be patchy.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Ugly, slow shutdown

2000-08-11 Thread Stephen McKay


Well, I've failed in my main objective (to deuglify the shutdown messages),
but an interesting debate has resulted instead, so I can't feel too bad.

I did a little research to support my position on sleep/wakeup, and here's
the best I have.  This is pretty long, and unlikely to shake your world view,
so those of you with drooping eyelids can just head over to slashdot, or
something. :-)

Some pseudo code from "The Design of the Unix Operating System", by Maurice
Bach, page 33 shows how sleep() is used:

while (condition is true)
sleep (event: the condition becomes false);
set condition true;

and the next page shows how wakeup() is used:

set condition false;
wakeup (event: the condition is false);

In the description, it says `Thus, the "while-sleep" loop insures that at most
one process can gain access to a resource.'

Not the most convincing evidence, but on the other hand, he does not mention
the idea of *not* protecting against sudden wakeup.

>From "Writing a Unix Device Driver", by Egan and Teixeira, on page 92 we find

It is not uncommon for several processes to sleep on the same
channel.  They may be competing for the same resource, or they
may be waiting for different reasons that have been associated
with the same channel value.  In this situation a single wakeup
call on the common channel will cause all the sleeping processes
to become executable; ...  A driver routine must not assume that
it can proceed after a return from a sleep call.  It should check
to see whether the event it was waiting for has actually occurred;
if it has not it should sleep again, and repeat this cycle until
the awaited event has actually occurred.

The book is oriented rather towards I/O, so perhaps not all possible uses
of routines are covered.  But again, no mention of *not* using a while loop.
Quite the opposite.

Also "Magic Garden Explained" points out that you really want to sleep on
an "event", but all you have is the address of some data.  So, you often
have multiple semantically different events represented by the same integer
wakeup channel.  A good reason to program defensively, I think.

But the best evidence is from kern_synch.c from 4.2 BSD, line 98, in the
header comment of the sleep() routine:

* Callers of this routine must be prepared for
* premature return, and check that the reason for
* sleeping has gone away.

That comment on sleep() is present from 4.0 BSD up to and including 4.3 tahoe,
but disappears in 4.3 reno, when the 4.4 style tsleep() was introduced.  After
a bit of searching through the PUPS archive, I see it is even present in
Edition 6, character for character, in a file called slp.c.

Well, I knew I wasn't a senile old fart yet, and Kirk's BSD CD compendium
and the PUPS archive show that I remember some things correctly still.  For
a considerable portion of Unix history, sleep() could return for no good reason
at all, and was documented to do so (if only in the source code).

Now, how does this relate to the current day?  Nobody in the BSD world uses
plain sleep() any more.  Once tsleep() appeared, the rules seem to have changed.
Perhaps some people had gotten away with ignoring the dire warnings in the
sleep() code, and decided that unexpected wakeups weren't such a useful part
of the API.  I hope Kirk or other BSD veterans can be coaxed into offering
an opinion.  I'd offer at least one beer for this purpose. :-)

Regardless of the history of it all, FreeBSD is full of places where
unexpected wakeups can stuff you right up.  Should we regard tsleep() like
the older sleep() call, as suspect, and program defensively?  Should we
be pragmatic, admit "We've gotten away with it so far", and document the
"no sudden wakeups" behaviour?

I quite like the general principle outlined in one of the earlier replies,
that a while loop can be shown to be correct through a local code reading,
but a simple conditional must be verified by reading all the rest of the
code.  That's close to the same argument I use against global variables.
Their use is too hard to verify as correct.  In short, I'd like to see
all cases where tsleep() is not carefully used in a loop repaired.

Practically speaking, though, I can't see that happening, especially if
we have any major players against the idea (DG for example).  Given that,
I'd like as a minimum a bit more of the history of sleep() in the tsleep()
manual page, and a discussion of when a while-loop protected tsleep() is
mandatory, and when it is optional.  Some sort of pronouncement against
issuing wakeup() calls against arbitrary addresses would help too.

I would do that right now, except I'm escaping computing for a few months.
Almost heresy nowadays, I suppose.  And I won't be the first in line for
a brain implanted net connection either. ;-)

Stephen.

PS By the time you read this, I've probably unsubscr

Fix for broken "burncd msinfo" PR#27593

2001-12-25 Thread Stephen McKay


A number of people have complained that "burncd msinfo" returns the wrong
value when there are already multiple sessions on a CD.  This is true,
and is bug bin/27593.

Since I burn a lot of multisession CDs, and have been working out the mkisofs
-C values by hand with the help of "cdcontrol info", I thought now would be a
good time to fix this bug.

Unfortunately, I've found that burncd won't work with SCSI burners, and
the only ATAPI burner I have is at work, and well, it's Christmas and all
that.  So this is completely untested, though I believe it should work.

I hope this can make it into 4.5.

Stephen.

PS How much work would it be to add the CDRIO* ioctls to the SCSI cd driver?


Index: burncd.c
===
RCS file: /cvs/src/usr.sbin/burncd/burncd.c,v
retrieving revision 1.19
diff -u -r1.19 burncd.c
--- burncd.c2001/12/24 03:20:10 1.19
+++ burncd.c2001/12/25 13:45:48
@@ -149,10 +149,14 @@
break;
}
if (!strcasecmp(argv[arg], "msinfo")) {
+   struct ioc_toc_header header;
struct ioc_read_toc_single_entry entry;
 
+   if (ioctl(fd, CDIOREADTOCHEADER, &header) < 0)
+   err(EX_IOERR, "ioctl(CDIOREADTOCHEADER)");
bzero(&entry, sizeof(struct ioc_read_toc_single_entry));
entry.address_format = CD_LBA_FORMAT;
+   entry.track = header.ending_track;
if (ioctl(fd, CDIOREADTOCENTRY, &entry) < 0) 
err(EX_IOERR, "ioctl(CDIOREADTOCENTRY)");
if (ioctl(fd, CDRIOCNEXTWRITEABLEADDR, &addr) < 0) 

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Another tweak to "burncd msinfo"

2002-01-05 Thread Stephen McKay


Now that "burncd msinfo" returns the correct values I noticed another small
problem: it displays the result on stderr instead of stdout.

Since very few people (nobody?) would be using this option yet because
of the previous problem, it seems like nobody would be adversely affected
by changing the output to stdout.  Also, removing the whitespace in the
output would help script writers.

Can I commit the obvious patch?

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Another tweak to "burncd msinfo"

2002-01-05 Thread Stephen McKay

On Saturday, 5th January 2002, Søren Schmidt wrote:

>It seems Stephen McKay wrote:
>> Now that "burncd msinfo" returns the correct values I noticed another small
>> problem: it displays the result on stderr instead of stdout.
>
>Hmm, that was intentional...

Could you explain why?  The most obvious practical use would be:

$ mkisofs -r -C `burncd msinfo` -M /dev/acd0c -o new.iso goodies

Writing to stderr means this doesn't work, and you have to add 2>&1 to it.
Also the white space means you have to use extra quoting.

>> Can I commit the obvious patch?
>
>Could you just hang on for now, since I'm doing large changes to
>burncd just now in order to support other things, and keeping
>everybody changes to the stock sources is not making things 
>easier...

Are these changes intended for 4.5?  I'm hoping the small change I
proposed would be accepted into 4.5, before anybody starts using
"burncd msinfo" in practice.  I think this is sensible, even if
a much improved burncd is scheduled for 4.6.

Regardless of this, I do not intend to commit any unwelcome changes.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Another tweak to "burncd msinfo"

2002-01-05 Thread Stephen McKay

On Saturday, 5th January 2002, Søren Schmidt wrote:

>It seems Stephen McKay wrote:
>> 
>> Are these changes intended for 4.5?  I'm hoping the small change I
>> proposed would be accepted into 4.5, before anybody starts using
>> "burncd msinfo" in practice.  I think this is sensible, even if
>> a much improved burncd is scheduled for 4.6.
>
>You should ask permission from the release engineer to commit it
>to 4.5, but it really should be committed to -current first.

Of course!  But given how simple the change is, just a couple of days
in -current would be sufficient testing.  I am asking your approval
to commit to -current, then I'll ask the REs about -stable.

Does this mean you've decided that it is a beneficial change and won't
intefere with your other work?

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Another tweak to "burncd msinfo"

2002-01-05 Thread Stephen McKay


On Saturday, 5th January 2002, Søren Schmidt wrote:

>I forgot to say that I already committed the change to current...

:-)

I try to keep up with -current, but that's too current for me!

I'll hassle the REs tomorrow about permission to merge.

Thanks,

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Whatever happened to CTM?

2001-03-21 Thread Stephen McKay

On Tuesday, 20th March 2001, Ulf Zimmermann wrote:

>On Mon, Mar 19, 2001 at 04:53:33PM -0800, John Baldwin wrote:
>> 
>> On 20-Mar-01 Michael C . Wu wrote:
>> > For all connections greater than 9600baud modems, we recommend
>> > using CVSup to get src-all and ports-all updated. At the worst case, 
>> > be able to CVSup a ports-all collection within an hour, with heavy
>> > packet loss and low bandwidth.
>> > 
>> > i.e. CTM sucks, don't use it. :)

On the contrary, I prefer CTM over CVSup, even on a fast connection (which
I don't currently have).  On a slow or intermittent connection, CTM beats
CVSup by a large margin.

>> cvsup is not available via e-mail for those who may only have e-mail access
>> for one reason or another.

Firewalls make CTM style delivery essential.  (No, Stefan, I don't like
your tunneling idea. :-)

>I have been hosting the machine which ran ctm,

And many thanks indeed for your service!

>unfortunatly my provider
>cut me off and I just got some access back, but not for the location
>the ctm machine is located at.
>
>At this time I do not know yet when it will have access again.

Surely FreeBSD Inc (or whatever it is that owns the freebsd.org machines)
could spring for a box.  Assuming Ulf is still keen, it shouldn't be too
hard for him to remote administer it.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Whatever happened to CTM?

2001-03-22 Thread Stephen McKay

On Thursday, 22nd March 2001, Bruce Evans wrote:

>On Wed, 21 Mar 2001, Stephen McKay wrote:
>> On the contrary, I prefer CTM over CVSup, even on a fast connection (which
>> I don't currently have).  On a slow or intermittent connection, CTM beats
>> CVSup by a large margin.
>
>I'm not sure about that.  CTM may be faster, but it works less
>automatically, especially when it breaks, and it breaks often, at both
>the server and client levels (mainly downtime problems for the server
>and disk-full problems for the client.  I used to use it until the
>server broke one time too many last year.

CTM's advantages outweigh the disadvantages for me.  I don't run out of
disk space(*), and the server failures have been rare.  Certainly, the 
reliability
of CTM delivery exceeded the reliability of all of the M$ systems the guys
in the neighbouring cubicles managed at my previous employer.  Until now,
of course.

What we need now is someone to supply hardware and some connectivity.  I still
think CTM has sufficient advantages to justify its continued existence.

I think the project should fund it.

Stephen.

(*) The tangle you get in after ctm croaks from lack of disk space were 
supposed
to have been fixed.  I don't think they have been.  It shouldn't be too 
difficult
though.  All those md5 checksums make repairs trivial to automate, in theory.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Implications of stdio changes (was Re: cvs commit: src/include stdio.h src/lib/libc Makefilesrc/lib)

2001-08-20 Thread Stephen McKay

On Tuesday, 14th August 2001, Daniel Eischen wrote:

>> > So do we allow FILE to be extended only after bumping the library
>> > version once (after 5.0-release)?  And thereafter all extensions to
>> > FILE do not need a version bump?
>> 
>> We've already bumped libc for 5.x.  Assuming this works ok, we shouldn't need
>> any further bumps for extending FILE.
>
>True.  I guess the real problem is the other libraries that reference
>stdin, stdout, stderr.  These need to be rebuilt with the new stdio.h
>and libc in order to avoid any impact from future FILE changes.

I might sound like the harbinger of doom, but you have to bump the major
number on every library that uses stdio to solve the "FILE has changed
size" problem.  It's the same sort of problem that changing errno caused.
That was "solved" by the switch to elf, which caused global recompilation.

People are hoping to do this by just waiting.  Eventually most libraries
will experience a major version bump.  Similarly, most useful programs will
be recompiled (either against bumped libraries, or recompiled old ones).
But some programs will not be recompiled, and will fail in mysterious ways.
I often use really old binaries, so odds are it will happen to me. :-)

To prevent old binaries from going bad, the libraries they link to must
use the old version of stdio.  Definite ideas of the offset in __sF of
stdout and stderr are embeded in both the old programs, and the old
libraries (and of course, the old version of stdio).  If you recompile
the libraries against the new stdio, you break the old binaries.  The
solution is to not do that.

In short, when FILE changes size (and hence __sF offsets change), then
every consumer(*) of stdio must be bumped.  The recent __stdinp (and friends)
addition prevents this problem happening again in the future, but does not
solve the current problem of old binaries and old libraries knowing the
internals of stdio.

Stephen.

(*) OK, technically only uses of "stdout" and "stderr" variables screw up
when FILE changes size.  Uses of macros (like getc variants that are
sometimes macros) will screw up if offsets change, but that's easier
to avoid.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

A tiny Perl bug?

2000-11-22 Thread Stephen McKay


I was trying to get FreeBSD 4.2-BETA to compile under FreeBSD 3.4 when
I found that the use of the new setresgid() and setresuid() system
calls were causing the perl5 compile to fail.  I got around this using
NOPERL=yup but while investigating I noticed an apparent bug in the use
of setresgid() and propose this patch:

Index: mg.c
===
RCS file: /cvs/src/contrib/perl5/mg.c,v
retrieving revision 1.1.1.4
diff -u -r1.1.1.4 mg.c
--- mg.c2000/08/20 08:42:14 1.1.1.4
+++ mg.c2000/11/22 12:01:32
@@ -1926,7 +1926,7 @@
(void)setregid((Gid_t)PL_gid, (Gid_t)-1);
 #else
 #ifdef HAS_SETRESGID
-  (void)setresgid((Gid_t)PL_gid, (Gid_t)-1, (Gid_t) 1);
+  (void)setresgid((Gid_t)PL_gid, (Gid_t)-1, (Gid_t)-1);
 #else
if (PL_gid == PL_egid)  /* special case $( = $) */
(void)PerlProc_setgid(PL_gid);

I assume this was just a typo.  I can't think of any reason to try to
set the saved uid to daemon.  I'd whip in and commit this myself, but
I'm sure there are "vendor branch considerations", and I've never
found out what's involved with that.

And piggybacking a slightly wider issue:  The cross-tools section of
Makefile.inc1 is supposed to address the use of new system calls and
such in build tools, right?  Can we forget about the old "try to use
the new syscall and do something else if it isn't there" code?  And all
we need to do to fix my migration problem is to MFC marcel's miniperl
cross-build fix?  Right?

Otherwise I have all this blather I was going to say about using fancy
new syscalls in perl just to emulate old syscalls we already have, and
the way that makes upgrading harder.  But I don't have to go on about
that, it seems. :-)

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Is compatibility for old aout binaries broken?

2000-12-17 Thread Stephen McKay

On Saturday, 16th December 2000, "Donald J . Maddox" wrote:

>The other day, on a whim, I decided to try running an old binary
>of SimCity (the same one found in the 'commerce' directory on
>many FBSD cds), and it failed in a odd way...

You and I may be the only people in the world that run old binaries.
This has been broken for new users for some time. :-(  Those of us
upgrading from source have been immune to this problem, because we
retain the old a.out ld.so binary.

>/usr/libexec/ld.so: Undefined symbol "___error" called from sim:/usr/X11R6/lib
>/aout/libX11.so.6.1 at 0x20160644

When errno became a function that returns a pointer (previously it was
a simple integer variable), recompiled libraries became incompatable with
old binaries.  So, I hacked the a.out loader (ld.so).  The fix was in 3.0.
Well, Nate called it a horrible hack, so maybe I should say "the hack was
in 3.0".

>Am I overlooking something obvious here, or is something actually
>broken with respect to running old aout binaries?

I found that rtld-aout won't compile.  That's kinda broken.
(It's probably something simple.  Looks like the a.out version of
a pic library just isn't around any more).  I'll try harder later.
What's certain is that it isn't compiled by default.

I poked about with my old FreeBSD CD collection and found that
version 3.0 through 3.2 have a fully functioning (fully hack enabled)
ld.so, but an older binary has been substituted in 3.3 and onward,
including 4.0 and 4.1, and most likely 4.2 also.

I can only guess that some anonymous release engineer (nobody we know :-)
picked the wrong CD at some point to get the master copy of ld.so once
it stopped compiling.  (Or at least stopped being easily compiled.)

Ideally, rtld-aout would be compiled fresh for every release.  Until then,
you can repair your system by retrieving ld.so from a 3.3 CD (in the
compat22 section), or from a 3.2 live filesystem CD.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Is compatibility for old aout binaries broken?

2000-12-17 Thread Stephen McKay

On Sunday, 17th December 2000, "Donald J . Maddox" wrote:

>Under the circumstances, it seems silly to have aout conpat
>bits installed at all, seeing as how they cannot work.

Old programs that don't depend on recompiled libraries are fine.  I can't
guess at the percentages though.  Also, nearly everybody has recompiled
for elf, where this problem never occurred.

>Like you, I normally upgrade from source --  This box has
>been -current ever since 2.0.5 or so was -current, but I
>had to reinstall from scratch a while back by installing
>4.2-RELEASE and then cvsupping back to -current, so I
>guess I lost my working aout ld.so in the process.  Bummer :(

I expected some build tool expert to say "Just compile with these
options".  But they haven't.  So I'll see if the bits have rotted,
or whether we can keep building ld.so instead of just including
an age old binary.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Is compatibility for old aout binaries broken?

2000-12-18 Thread Stephen McKay

On Monday, 18th December 2000, "Donald J . Maddox" wrote:

>On Mon, Dec 18, 2000 at 04:41:17PM +1000, Stephen McKay wrote:
>> 
>> I expected some build tool expert to say "Just compile with these
>> options".  But they haven't.  So I'll see if the bits have rotted,
>> or whether we can keep building ld.so instead of just including
>> an age old binary.

>Well, if you do manage to uncover the lost magic, please let me know :)

It's getting a little more magic every day to generate a.out stuff,
but not all that bad.  Basically I built lib/csu/i386, gnu/lib/libgcc,
lib/libc and libexec/rtld-aout, in order, with these settings:

NOMAN=yup DESTDIR="" OBJFORMAT=aout MAKEOBJDIRPREFIX=/usr/obj/aout

In each directory, I used make obj, make, make install.  (By the way,
there are a lot of twisty little passages in /usr/share/mk.  One of
them required me to add DESTDIR="", which should be a NOP.)

The generated ld.so has bloated a bit :-) but works fine.  So we could
in principle build ld.so for every release.  It's just a question of
whether we should.  I think we should.  But it might be just as easy
to copy it off the 3.3 CD every time.  It's dead end stuff after all.

Does the release engineer have an opinion?

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Is compatibility for old aout binaries broken?

2000-12-18 Thread Stephen McKay


On Tuesday, 19th December 2000, Stephen McKay wrote:

>But it might be just as easy to copy it off the 3.3 CD every time.

Oops!  As I wrote earlier, 3.3 and onward have the broken ld.so.  Good
copies are found on 3.0 though to 3.2.

Sorry for veering off the road there. :-)

Stephen.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Is compatibility for old aout binaries broken?

2000-12-18 Thread Stephen McKay

On Monday, 18th December 2000, Jordan Hubbard wrote:

>> The generated ld.so has bloated a bit :-) but works fine.  So we could
>> in principle build ld.so for every release.  It's just a question of
>> whether we should.  I think we should.  But it might be just as easy
>> to copy it off the 3.3 CD every time.  It's dead end stuff after all.
>> 
>> Does the release engineer have an opinion?
>
>If it's just for the compat3x distribution, I say check it into that
>part of lib/compat and be done with it.  Uudecoding it each time is a
>lot easier than building it.  Or are we talking about ld.so in some
>different context?

I hadn't noticed all the uuencoded things in lib/compat before.  This
is obviously the way to fix it.

By the way, it's the compat22 distribution that needs fixing, and, as
previously noted, it's the 3.2 CD that has the last fully working ld.so.

I'll get onto committing a fix.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: No cable modems??

2000-12-19 Thread Stephen McKay

On Tuesday, 19th December 2000, "Donald J . Maddox" wrote:

>Why are you (or your ISP) refusing to accept mail from people
>with cable modems?  Enquiring minds want to know... ;-)

>   - Transcript of session follows -
>... while talking to frmug.org.:
 MAIL From:<[EMAIL PROTECTED]>
><<< 550 no cable modems here
>554 5.0.0 [EMAIL PROTECTED] Service unavailable

It's a spam reduction move.  I'm surprised hub.freebsd.org accepts your
mail!  You should funnel your mail through your ISP's central mail hub.

Followups to -chat, I think.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Is compatibility for old aout binaries broken?

2000-12-20 Thread Stephen McKay

On Wednesday, 20th December 2000, "David O'Brien" wrote:

>On Mon, Dec 18, 2000 at 02:58:16AM +1000, Stephen McKay wrote:
>> This has been broken for new users for some time. :-(  Those of us
>> upgrading from source have been immune to this problem, because we
>> retain the old a.out ld.so binary.
>> 
>> >/usr/libexec/ld.so: Undefined symbol "___error" called from
>> >sim:/usr/X11R6/lib /aout/libX11.so.6.1 at 0x20160644
>>
>> When errno became a function that returns a pointer (previously it was
>> a simple integer variable), recompiled libraries became incompatable with
>> old binaries.  So, I hacked the a.out loader (ld.so).  The fix was in 3.0.
>> Well, Nate called it a horrible hack, so maybe I should say "the hack was
>> in 3.0".
>
>src/lib/libc/sys/__error.c suggests this was the case for 2.2.7+.

No, you want rev 1.10 of sys/sys/errno.h.  That was when it affected all
a.out binaries.  Until then it was just threaded binaries, a vanishingly
small proportion.  Rev 1.10 was in 3.0.  Rev 1.5 was in the 2.2.x releases.

>What is out of sync is the X11 a.out libs.  They are probably built on a
>2.2.7 or 2.2.8 box, thus they refer to `___error' vs. `errno'.  These
>libs are wrong for the SimCity binary.  They are a.out yes, but not
>proper for compat20 use.  Since SimCity needs `libgcc.so.261', I'll
>assume it was built that long ago.

Correcting slightly for your slightly off assumption: The X11 libs were 
probably
built on a 3.x box.  Their problem is that being newer than libc.so.2.2 (or was
it libc.so.3.0) they use ___error but libc does not supply it.  My patches
to rtld-aout (that first appeared in FreeBSD 3.0) supply ___error in this case.
This is the only full fix for this situation.

>The problem isn't as much ld.so, as it should match the libc.so, et.al.
>you are using from the compat2[01] dist (needed to satisfy ``ldd
>lib/SimCity/res/sim'').  And `ld.so' and the shared libs would be
>consistent on the system the a.out program was built on.

There was an enormous thread in -current (I think) at the time (mid 1998).
The end result was that the ld.so hack was the only solution other than
mandating a major bump to every library in existence.  Nobody liked either
of those solutions :-) but I put the ld.so hack in and the problem disappeared.
Emphasis again: the workaround ld.so was only found in 3.0 and onward, so
just using a 2.2.x ld.so isn't enough.

>What I would feel most comfortable with, is doing a MFC to RELENG_2_2 of
>the rtld-aout changes since then, building a new `ld.so' and putting that
>in the compat2? dists.  Problem is I don't have access to a 2.2-STABLE
>box.

I have built a binary on 4.2-RELEASE.  I think I prefer that because any
security fixes in libc (or whatever) will be reflected in the resulting
ld.so.  In fact, I think we should build ld.so from source until such
time as a.out building capability is removed (5.0 perhaps).

On the other hand, merging back to 2.2.x and rebuilding should provide
a working (and hack enabled) ld.so that has no more problems than the
old binaries it is supporting.

>> I poked about with my old FreeBSD CD collection and found that
>> version 3.0 through 3.2 have a fully functioning (fully hack enabled)
>> ld.so, but an older binary has been substituted in 3.3 and onward,
>> including 4.0 and 4.1, and most likely 4.2 also.
>
>Are you sure?  src/lib/compat/compat2[012]/ld.so.gz.uu are all at
>rev 1.1.  So there has been no change to them over the lifetime of their
>existence.  All three are identical -- having the same MD5 checksum.
>Well, looking at the release tags compat22/ld.so was in 3.2.
>compat2[01]/ld.so was added for 3.3.

This very fact is bothering me a lot.  Get out your 3.2 disks and verify
that they do not match these uuencoded binaries.  Check the 3.0 and 3.1
disk 2 (live file system) and see that they don't match them either.

>> I can only guess that some anonymous release engineer (nobody we know :-)
>> picked the wrong CD at some point to get the master copy of ld.so once
>> it stopped compiling.  (Or at least stopped being easily compiled.)
>
>Not quite.  I seem to remember that JKH was makeing a tarball of a.out
>libs from what ever was on his box at the time (thus probably the last
>a.out ld.so just before E-day on 3-CURRENT).

Something like this must have happened up to and including the 3.2 release.

>When I committed the
>compat2? bits, I took ld.so from a 2.2.x release as this is the compat2?
>dist, not compat3.aout dist.  Which is what you're suggesting should have
>been done.

You missed the fact that fixes were added to ld.so after those releases
even though the purpose of ld.so is to run binaries that date fr

Re: Is compatibility for old aout binaries broken?

2000-12-20 Thread Stephen McKay

On Wednesday, 20th December 2000, "Donald J . Maddox" wrote:
>> > Looks good.  Can you install the XFree896-aoutlib port?  You may have
>> > seen were someone posted the a.out libs from 3.3.6 are known to not be
>> > the the best to use for compatibility use.
>
>Interesting.  After I installed the XFree86-aoutlibs port, SimCity
>works fine for me (on an 8-bit display)...
>
>It didn't work with the X libs built by the port when aout libs
>are requested, and it didn't work with the X libs from 3.3.6, but
>it works with these.

If the XFree896-aoutlib libraries are old enough, they will not call ___error.
That is sufficient to solve your particular problem, but not to solve the
general case.  

I'm now wondering if the reason that people don't like the XFree86 3.3.6
a.out libraries is the problem with ___error and the older ld.so supplied
with recent FreeBSD releases.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Is compatibility for old aout binaries broken?

2000-12-20 Thread Stephen McKay

On Wednesday, 20th December 2000, "Donald J . Maddox" wrote:

>On Wed, Dec 20, 2000 at 10:14:09AM -0800, David O'Brien wrote:
>> On Wed, Dec 20, 2000 at 11:15:55PM +1000, Stephen McKay wrote:
>> > Correcting slightly for your slightly off assumption: The X11 libs were
>> > probably built on a 3.x box.  Their problem is that being newer than
>> > libc.so.2.2 (or was it libc.so.3.0) they use ___error but libc does not
>> > supply it.  My patches to rtld-aout (that first appeared in FreeBSD
>> > 3.0) supply ___error in this case.  This is the only full fix for this
>> > situation.
>> 
>> Why is not changing the XFree86-aoutlibs port to offer libs built on
>> 2.2.x not the right fix?
>
>I was under the impression that this was already the case...  The libs
>in the XFree86-aoutlibs port ARE from 2.2.x.  My problem was that I
>was using libs built on 3.x.

(I think I can save a lot of typing by replying to this message.  I'm just
about to leave town.)

My whole point is that generating a.out binaries and libraries didn't stop
the instant that 3.0 hit the streets.  To support the mixture of old binary
plus new library you need a hacked ld.so.  We have to supply it somehow,
or simply say we don't care about certain binaries dying with obscure
error messages.  This XFree86-aoutlibs vs libs built on 3.x example supports
my theme.

I can't reconcile your naming convention (ie compat22 bits originated on
a 2.2.x box) with my version (compat22 is used to support 2.2.x binaries).

I'm also not afraid that a binary generated on 4.2 would have hidden
defects.  I'm more worried that one generated on 2.2.x would have defects
we've forgotten about.

If you don't mind pausing the whole argument for about 4 days, I can
rejoin.  :-)

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Fixing a.out compatibility

2000-12-26 Thread Stephen McKay


I'll try to summarise the position so far:

1) Legacy a.out executable support is broken for a subset (size unknown)
of such executables.

2) We can ignore this or repair this.

3) We can build a new binary or just look around on old 3.x CDs until
we find one that works.

4) We can generate a working binary on 4.x or on 2.2.8-stable (after some
fixing).

5) We can generate ld.so anew each release, or generate it (or find it)
once and commit a binary.

I don't think there's any doubt about point 1.  All a.out executables that
use libc.so.2.2 and another recompiled library will fail because of a
missing routine (__error) required by the recompiled library and not
supplied by libc or by executable or by the existing ld.so.

All these executables come from the 2.2.x era or earlier.  Those built in
the 3.x era use libc 3.1 and don't have this problem.  Urk...  Actually,
it's slightly more complicated than that since the libc.so.3.1 built on
2.2.6 (for example) didn't contain __error() but the one built on 2.2.7
did.  (At least according to the cvs logs).  I'm most annoyed that I can't
find my 2.2.6 CDs.  2.2.5 had libc 3.0 (without __error) and 2.2.7 had
libc 3.1 (with __error) but the cvs logs say that 2.2.6 should have had
a different libc 3.1 (without __error).  So, the exact "version" of
version 3.1 of libc could be important.  Yuck.

We don't normally ignore things we can fix, so point 2 is resolved in
favour of fixing this, right?

We need to build a new binary since we (collectively) have forgotten
where the working 3.0 through 3.2 binaries came from. :-(  Can we,
for example, prove that revision 1.57 made in into any release?

It seems feasable to generate a new binary on a recent or an old patched
FreeBSD version.  The question is which is better.  I think the newer
the better.  Otherwise, who is going to build the 2.2.8-stable box
to make this one binary?  I've already built a binary on 4.2-release
that works.

We disagree a bit over point 5.  I think it is feasable and desirable
to build ld.so at each release.  If we don't build it for each release,
how will fixes to rtld-aout and required libraries (eg libc) be incorporated?
I say keep building it fresh until a.out builds are impossible.  Or are
you suggesting that each advance in 4.x and beyond be backported to
2.2.8-stable so that we can build one binary?

So, where to from here?  Despite all my arguments, I could just commit
the binary I have to the lib/compat2* areas and leave it at that.

Stephen.

PS Thanks for all the "old_RELENG_2_2" etc tags now available in rtld-aout.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Fixing a.out compatibility

2000-12-26 Thread Stephen McKay

[Noted that you don't like being cc:'d, David.  On the other hand,
I like to be kept in the cc: list.]

On Tuesday, 26th December 2000, "David O'Brien" wrote:

>On Wed, Dec 27, 2000 at 02:01:24AM +1000, Stephen McKay wrote:
>> I'll try to summarise the position so far:
>> 
>> 1) Legacy a.out executable support is broken for a subset (size unknown)
>> of such executables.
>
>Define "legacy".  I have been speaking specifically about FreeBSD 2.2
>support.  That just happens to a.out based.
>
>You seem to mean it to be any a.out binary.

Not really.  If I generate an a.out binary right now, it can't suffer
from this problem, even though it uses ld.so when it runs.  Only a
certain set of old a.out binaries are affected.

>From my standpoint only bits generated on a 2.2.x host can go into the
>compat22 distribution.  When compat1x was created (being the first it
>gets to imply the intention of the compat dists) it gave the ability to
>run FreeBSD 1.x binaries; not 2.0 a.out ones, not any binary after the
>last 1.x release.  Thus why I claim compat22 is *not* about being an
>a.out compat dist, but one to properly run 2.2.x binaries on a later
>version of FreeBSD.  If 3.0 had been a.out based, there still would be a
>compat22 dist.

We almost completely agree, but...

The only a.out binaries with problems come from that 2.2/2.1 era.  To
support them we need an ld.so from *after* that era.  I can't see how you
get around this.  That working ld.so was in 3.0 and was certainly no
generated on a 2.2.x host.  I think your restriction on compat contents
is a useful guideline, but to be broken when necessary.

>Thus someone that still has access to a 2.2.8-stable box needs to merge
>the changes in src/libexec/rtld-aout (in -current) to
>src/gnu/usr.bin/ld{,/rtld} and build a new binary for inclusion in the
>compat22 dist.

I'll build one if I have to.  I'm trying to avoid unnecessary work, since
I expect there are few others bothered enough to fix this problem.

>Note, when the bits were CVS repo copied into rtld-aout, all the tags
>were stripped.  I spent the time to add them all back to make the merge
>easier for someone.  Whoever does this should please CVSup before
>starting.

Could very well be me.  But I would be patching the old location, surely?

>> I don't think there's any doubt about point 1.  All a.out executables that
>> use libc.so.2.2 and another recompiled library will fail because of a
>> missing routine (__error) required by the recompiled library and not
>> supplied by libc or by executable or by the existing ld.so.
>
>Agreed, but "and another recompiled library", means this a.out
>executable was not built on a 2.2.x host.  Otherwise there would be no
>way to have this inconsistency.

This is the fundamental point of this problem.  The executable was built
on a 2.2.x or 2.1.x box and originally used libraries compiled then or
earlier.  The whole problem is the fact that libraries were recompiled
later and did not change version numbers.  There was no way to force
external parties to update version numbers, and folks round here didn't
feel like bumping all the FreeBSD library version numbers.

This is why I keep the words "executable" and "library" separate.  The
library is newer than the executable, and this causes the executable
to fail.  This is the fact that I'm not at all sure that you understand.

>Actually one problem is I put the 2.2.8 ld.so in the compat2[01] dist.
>That was wrong of me.  I can correct that.  SimCity (the binary used as
>an example) required me to install the comapt20 and compat21 dists.  The
>other problem is we don't have a compat2[01] XFree86 libs dist.  We only
>have an a.out one that is intended to cover all a.out binaries, and it
>doesn't correctly.

We can only install one ld.so.  It has to cover all bases.  Are you
suggesting that each compat2x dist install a different ld.so?  This
is consistent with your claim that "compat2x bits come from 2.x", but
not very useful in practice.  Should I assume you meant to delete
ld.so from all but one compatxx dist?

>> 2.2.5 had libc 3.0 (without __error) and 2.2.7 had libc 3.1 (with
>> __error) but the cvs logs say that 2.2.6 should have had a different
>> libc 3.1 (without __error).  So, the exact "version" of version 3.1 of
>> libc could be important.  Yuck.
>
>The compat22 dist used the 2.2.8 bits, so I don't see how it wasn't the
>``exact "version" of version 3.1 of libc''.

What I was going on about here is that important changes occurred to
libraries without a version bump, and one such library was libc.  It
is making my attempt to describe the boundary of the problem very
difficul

panic: vm_object_qcollapse(): object mismatch

1999-02-04 Thread Stephen McKay

Hardware: 486DX2/66 16Mb ram, aha1542CF, 2x1Gb SCSI disks
Software: 4.0-current 1-2 days old, softupdates
  (vm_map.c is at rev 1.146, for example)

I was running 'make -j5 buildworld'.  It swaps like crazy when I do this. :-)

Here's what gdb -k tells me:

...
#9  0xf01425e0 in panic (
fmt=0xf0225c1f "vm_object_qcollapse(): object mismatch")
at ../../kern/kern_shutdown.c:446
#10 0xf01e0772 in vm_object_qcollapse (object=0xf2f001d0)
at ../../vm/vm_object.c:1011
#11 0xf01e08d6 in vm_object_collapse (object=0xf2f001d0)
at ../../vm/vm_object.c:1102
#12 0xf01ddae2 in vm_map_copy_entry (src_map=0xf2f4aa00, dst_map=0xf2f4ad00, 
src_entry=0xf2ed0e10, dst_entry=0xf2f8edc0) at ../../vm/vm_map.c:2284
#13 0xf01ddd73 in vmspace_fork (vm1=0xf2f4aa00) at ../../vm/vm_map.c:2411
#14 0xf01da833 in vm_fork (p1=0xf2f7db20, p2=0xf2d751e0, flags=20)
at ../../vm/vm_glue.c:231
#15 0xf013d4f0 in fork1 (p1=0xf2f7db20, flags=20) at ../../kern/kern_fork.c:447
#16 0xf013ce65 in fork (p=0xf2f7db20, uap=0xf3021f94)
at ../../kern/kern_fork.c:99
#17 0xf01fe783 in syscall (frame={tf_es = 134807599, tf_ds = -272695249, 
  tf_edi = 134750909, tf_esi = 134935201, tf_ebp = -272643652, 
  tf_isp = -217964572, tf_ebx = 4, tf_edx = 672250004, tf_ecx = 19, 
  tf_eax = 2, tf_trapno = 12, tf_err = 2, tf_eip = 671826564, tf_cs = 31, 
  tf_eflags = 662, tf_esp = -272651296, tf_ss = 47})
at ../../i386/i386/trap.c:1100
#18 0xf01f4e9c in Xint0x80_syscall ()
...
(kgdb) p *p
$1 = {pageq = {tqe_next = 0xf02c5240, tqe_prev = 0xf02e4e00}, hnext = 0x0, 
  listq = {tqe_next = 0xf02e59d0, tqe_prev = 0xf2f69cc8}, object = 0xf2f69cb0, 
  pindex = 30, phys_addr = 15065088, queue = 4, flags = 1, pc = 0, 
  wire_count = 0, hold_count = 0, act_count = 27 '\e', busy = 0 '\000', 
  valid = 255 'ÿ', dirty = 255 'ÿ'}
(kgdb) p object
$2 = (struct vm_object *) 0xf2f001d0
(kgdb) p *object
$3 = {object_list = {tqe_next = 0xf2fdc2b8, tqe_prev = 0xf2f69c3c}, 
  shadow_head = {tqh_first = 0x0, tqh_last = 0xf2f001d8}, shadow_list = {
tqe_next = 0x0, tqe_prev = 0xf2f69cb8}, memq = {tqh_first = 0xf02dbcb0, 
tqh_last = 0xf02cc86c}, generation = 11690, type = OBJT_DEFAULT, 
  size = 32, ref_count = 2, shadow_count = 0, pg_color = 0, 
  hash_rand = -136756254, flags = 8576, paging_in_progress = 0, behavior = 0, 
  resident_page_count = 6, cache_count = 0, wire_count = 0, 
  backing_object = 0xf2f69cb0, backing_object_offset = 0x, 
  last_read = 0, pager_object_list = {tqe_next = 0xf2f69000, 
tqe_prev = 0xf0252f10}, handle = 0x0, un_pager = {vnp = {
  vnp_size = 0x}, devp = {devp_pglist = {tqh_first = 0x0, 
tqh_last = 0x0}}, swp = {swp_bcount = 0}}}
(kgdb) p *(p->object)
$4 = {object_list = {tqe_next = 0xf2f915e4, tqe_prev = 0xf30fd0e8}, 
  shadow_head = {tqh_first = 0xf2f001d0, tqh_last = 0xf2f001e0}, 
  shadow_list = {tqe_next = 0x0, tqe_prev = 0xf30fef04}, memq = {
tqh_first = 0xf02e7170, tqh_last = 0xf02cff5c}, generation = 10219, 
  type = OBJT_SWAP, size = 32, ref_count = 3, shadow_count = 1, pg_color = 0, 
  hash_rand = -136000830, flags = 384, paging_in_progress = 0, behavior = 0, 
  resident_page_count = 4, cache_count = 1, wire_count = 0, 
  backing_object = 0x0, backing_object_offset = 0x, 
  last_read = 29, pager_object_list = {tqe_next = 0xf30fad24, 
tqe_prev = 0xf30f0814}, handle = 0x0, un_pager = {vnp = {
  vnp_size = 0x0001}, devp = {devp_pglist = {tqh_first = 0x1, 
tqh_last = 0x0}}, swp = {swp_bcount = 1}}}


I'll keep this dump around.  What other details do people want?

I'm not likely to even get to look at this let alone solve it.  Bummer.

Stephen.

To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Re: Possible fix for rc.conf

1999-03-21 Thread Stephen McKay

On Sunday, 21st March 1999, Richard Wackerbarth wrote:

>Why do we need to have ANY of the file inclusion in /etc/defaults/rc.conf?
>Shouldn't that file simply be definitions of variables?
>IMHO, the "logic" should be in "rc" itself.

Yeah!  What he said!

Having code in rc.conf sucks.  If there is no logic, there can be no
recursion.  If you are going to mix code into rc.conf you may as well
just suck it back into /etc/rc and get rid of it entirely. (*)

Stephen.

(*) Which is silly, of course.

To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Fatal trap 1: privileged instruction fault while in kernel mode

1999-04-03 Thread Stephen McKay

I've just got what seems an unlikely panic.  How could I get a privileged
instruction fault while in kernel mode?

This is from a week old 4.0-current kernel on a 16Mb 486.  It has an AHA1542CF
a slow SCSI-1 disk, and a rebadged TDC4200 (2GB QIC).  I run soft updates
but nothing else fancy.  I was doing a make buildworld, and rewinding a
tape at the time.  It seemed like the panic occurred when the tape stopped,
but I wasn't actually watching at the time.

The fatal instruction is:
0xc016abc9 :movl   $0xc023355c,0xffdc(%ebp)
which looks pretty ordinary.

Can this be a software bug?  Or has my hardware gone funny?  It has done a
good number of make worlds in the last few months with only the normal
(software) troubles you expect from -current.

Stephen.

Here are the gory bits, in case anyone can offer any hints:


IdlePTD 2834432
initial pcb at 248774
panicstr: from debugger
panic messages:
---
Fatal trap 1: privileged instruction fault while in kernel mode
instruction pointer = 0x8:0xc016abc9
stack pointer   = 0x10:0xc30a6c20
frame pointer   = 0x10:0xc30a6c54
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 12245 (sh)
interrupt mask  = bio 
panic: from debugger
panic: from debugger

dumping to dev 30401, offset 163840
dump 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 
---
#0  boot (howto=260) at ../../kern/kern_shutdown.c:287
287 dumppcb.pcb_cr3 = rcr3();
(kgdb) where
#0  boot (howto=260) at ../../kern/kern_shutdown.c:287
#1  0xc0148095 in panic (fmt=0xc021bda8 "from debugger")
at ../../kern/kern_shutdown.c:448
#2  0xc0129575 in db_panic (addr=-1072256055, have_addr=0, count=-1, 
modif=0xc30a6acc "") at ../../ddb/db_command.c:432
#3  0xc0129515 in db_command (last_cmdp=0xc023558c, cmd_table=0xc02353ec, 
aux_cmd_tablep=0xc0245f8c) at ../../ddb/db_command.c:332
#4  0xc01295da in db_command_loop () at ../../ddb/db_command.c:454
#5  0xc012b95b in db_trap (type=1, code=0) at ../../ddb/db_trap.c:71
#6  0xc01f9bea in kdb_trap (type=1, code=0, regs=0xc30a6be4)
at ../../i386/i386/db_interface.c:157
#7  0xc0203a00 in trap_fatal (frame=0xc30a6be4, eva=0)
at ../../i386/i386/trap.c:938
#8  0xc02034b0 in trap (frame={tf_es = 16, tf_ds = -1023541232, 
  tf_edi = -1023478144, tf_esi = 0, tf_ebp = -1022727084, 
  tf_isp = -1022727156, tf_ebx = -1024550784, tf_edx = 12245, 
  tf_ecx = -1023478144, tf_eax = 0, tf_trapno = 1, tf_err = 0, 
  tf_eip = -1072256055, tf_cs = -1072300024, tf_eflags = 66194, 
  tf_esp = -1024550784, tf_ss = -1024177152}) at ../../i386/i386/trap.c:586
#9  0xc016abc9 in vclean (vp=0xc2ee9880, flags=8, p=0xc2fef680)
at vnode_if.h:835
#10 0xc016adb7 in vgonel (vp=0xc2ee9880, p=0xc2fef680)
at ../../kern/vfs_subr.c:1830
#11 0xc01698b1 in getnewvnode (tag=VT_UFS, mp=0xc05c5e00, vops=0xc05aa000, 
vpp=0xc30a6d04) at ../../kern/vfs_subr.c:467
#12 0xc01d4e69 in ffs_vget (mp=0xc05c5e00, ino=20442, vpp=0xc30a6d84)
at ../../ufs/ffs/ffs_vfsops.c:1082
#13 0xc01d8b1a in ufs_lookup (ap=0xc30a6ddc) at ../../ufs/ufs/ufs_lookup.c:538
#14 0xc01dd38d in ufs_vnoperate (ap=0xc30a6ddc)
at ../../ufs/ufs/ufs_vnops.c:2309
#15 0xc0166978 in vfs_cache_lookup (ap=0xc30a6e38) at vnode_if.h:55
#16 0xc01dd38d in ufs_vnoperate (ap=0xc30a6e38)
at ../../ufs/ufs/ufs_vnops.c:2309
#17 0xc0168dc1 in lookup (ndp=0xc30a6eb8) at vnode_if.h:31
#18 0xc0168894 in namei (ndp=0xc30a6eb8) at ../../kern/vfs_lookup.c:152
#19 0xc016e124 in stat (p=0xc2fef680, uap=0xc30a6f94)
at ../../kern/vfs_syscalls.c:1651
#20 0xc0203c83 in syscall (frame={tf_es = 134873135, tf_ds = -1078001617, 
  tf_edi = 0, tf_esi = 134890328, tf_ebp = -1077946844, 
  tf_isp = -1022726172, tf_ebx = 134890372, tf_edx = -1077946944, 
  tf_ecx = 134890388, tf_eax = 188, tf_trapno = 22, tf_err = 2, 
  tf_eip = 134656096, tf_cs = 31, tf_eflags = 518, tf_esp = -1077946968, 
  tf_ss = 47}) at ../../i386/i386/trap.c:1101
#21 0xc01fa53c in Xint0x80_syscall ()
#22 0x804b5f1 in ?? ()
#23 0x804a879 in ?? ()
#24 0x804a7f3 in ?? ()
#25 0x804a7f3 in ?? ()
#26 0x804a6fe in ?? ()
#27 0x804a7f3 in ?? ()
#28 0x804ab8a in ?? ()
#29 0x804a7ab in ?? ()
#30 0x804aa23 in ?? ()
#31 0x804a812 in ?? ()
#32 0x8051257 in ?? ()
#33 0x8051183 in ?? ()
#34 0x80480e9 in ?? ()
(kgdb) 

The End.


To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Re: EGCS breaks what(1)

1999-04-05 Thread Stephen McKay

On Monday, 5th April 1999, Matthew Dillon wrote:

>:char sccs[] = { '@', '(', '#', ')' };
>:char version[] = blahhhfoo;
>:Was contiguous.

>'what' is broken.  C does not impose any sort of address ordering
>restriction on globals or autos that are declared next to each other.   

Well, it's really an abuse of 'what', and not anything wrong with 'what'
ifself.  It will continue to work fine doing the job it was designed to do.

The NetBSD folks faced this problem some time ago, and I believe their
solution was to duplicate the version information.  So, version[] is the
same as it used to be, and sccs[] is 4 bytes longer than version[] to hold
a complete copy, and the @(#) prefix.  This is then completely portable.

Alternately, we could jimmy around with the current hack, and prefix it
with 4 NULs, and see what happened.  Sorry, I haven't tested this idea, as
I've not yet made the EGCS jump.

Stephen.

To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Slightly wonky auto memory probe + fix

1999-04-06 Thread Stephen McKay

[I posted this to -current because the technology is the same in -current
even though this box will never run -current.  Bear with me.]

We've just got a new Dell PowerEdge (very nice) with 512MB of ram.  By
default, 3.1-stable sees only 64MB.  Looking carefully, it sees 8KB less
than 64MB, so it doesn't probe for the rest.

I applied this patch, which fiddles the "Hmm got 64MB so probe for the
rest" heuristic.  With this patch, it found all 512MB, to the exact byte.
Unfortunately, it kinda changes it from a "heuristic" to a "hack". :-(


--- machdep.c   Fri Feb 19 15:31:36 1999
+++ /tmp/sgm/machdep.c  Tue Apr  6 23:40:36 1999
@@ -1428,7 +1428,7 @@
 * the MAXMEM option or the npx0 "msize", then don't do the speculative
 * memory probe.
 */
-   if (Maxmem >= 0x4000)
+   if (Maxmem >= 0x3f00)
speculative_mprobe = TRUE;
else
speculative_mprobe = FALSE;
@@ -1538,7 +1538,7 @@
if (phys_avail[pa_indx] == target_page) {
phys_avail[pa_indx] += PAGE_SIZE;
if (speculative_mprobe == TRUE &&
-   phys_avail[pa_indx] >= (64*1024*1024))
+   phys_avail[pa_indx] >= (63*1024*1024))
Maxmem++;
} else {
pa_indx++;

Stephen.


To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Re: have live system with NFS client cache problems what do i do?

1999-04-11 Thread Stephen McKay

On Sunday, 11th April 1999, Alfred Perlstein wrote:

>On Sun, 11 Apr 1999, Matthew Dillon wrote:
>
>> doing a 'file cd9660_bmap.o' on laptop (NFS client) gives me a 
>> cd9660_bmap.o: MS Windows COFF Unknown CPU
>> 
>> An MS Windows binary?  Do you have any msdos mounts on
>> the client or server?  How is /usr/obj mounted?

>no i have no msdos mounted filesystems, i do however have an
>unmounted win98 partition and a cdrom with joliet extentions mounted
>however the cdrom only contains mp3s.

This is a red herring:

$ dd if=/dev/zero of=foo count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.000114 secs (4487949 bytes/sec)
$ file foo
foo: MS Windows COFF Unknown CPU
$

Look for the usual pack-of-nulls corruption instead.

Stephen.


To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Re: have live system with NFS client cache problems what do i do?

1999-04-11 Thread Stephen McKay

On Sunday, 11th April 1999, Brian Feldman wrote:

>This has nothing to do with DOS. In case you didn't get my other hint:
>{"/home/green"}$ dd if=/dev/zero count=1 2>/dev/null | file -
>standard input:  MS Windows COFF Unknown CPU

Don't ya just hate it when your mail is slow!  Sigh...

Stephen.


To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Re: ctm-mail cvs-cur.5292.gz 18/82

1999-05-04 Thread Stephen McKay

On Sunday, 2nd May 1999, Chuck Robey wrote:

>On Mon, 3 May 1999, Jean-Marc Zucconi wrote:
>
>> This one did not arrive in my mailbox. Can someone send it to me? I
>> would like to avoid downloading 6Mbytes again.
>
>I'm going to mail it to you separately, but it might not look like it
>came from me.

I also did not receive part 18.  Are the individual parts kept anywhere
for anonymous ftp access?

Failures are rare, but they hit the big updates disproportionately and
have a bigger effect on bigger updates, so it's a double lose.

Stephen.

To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Re: Uncommitted dc0 fixes ...

2002-09-09 Thread Stephen McKay

On Wednesday, 4th September 2002, Martin Blapp wrote:

>And this patch here together with patch III made the annoying messages (dc0:
>failed to force tx and rx to idle mode) go away. And I can use now my card
>without to replug the cable over again)

I've been meaning to remove the annoying message for ages.  Sorry about that.

>+   if (DC_IS_INTEL(sc)) {
>+   for (i = 0; i < DC_TIMEOUT; i++) {
>+   isr = CSR_READ_4(sc, DC_ISR);
>+   if (isr & DC_ISR_TX_IDLE &&
>+   (isr & DC_ISR_RX_STATE)
>+   == DC_RXSTATE_STOPPED)
>+   break;
>+   DELAY(10);
>+   }
>+   }

Conditionalising on DC_IS_INTEL() means most cards no longer wait until
the TX and RX are idle.  I don't have enough different if_dc cards to
know if this is safe.

On the other hand, every test I've done on my Intel and Macronix cards
shows zero calls to DELAY() in this loop.  The loop may as well not be
there for those card types.

Indeed, it isn't there at all in if_de and in a Linux driver I looked
at.  From this I'm guessing that no 21143 (real or clone) needs this check,
though I've got no real proof.

Out of all this fuzzy evidence, I guess the most sensible option is the
patch you've proposed.  If nobody else is interested, I'll commit this part
of your patch cluster on the weekend.  I suppose I could do the ADMtek
auto tx underrun recover patch too, as it seems harmless to other cards.
The other stuff I can't test at all.

This driver represents a counterintuitive state of affairs.  I was impressed
when Bill Paul managed to support so many clone cards with one driver.  But
now nobody has enough hardware on hand to test any change properly.  There's
some sort of lesson to be learnt here.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: dc(4) patch

2002-09-20 Thread Stephen McKay

On Thursday, 19th September 2002, John Baldwin wrote:

>--- if_dc.c 4 Sep 2002 18:14:17 -   1.77
>+++ if_dc.c 19 Sep 2002 20:57:03 -
>@@ -1366,7 +1370,8 @@
>for (i = 0; i < DC_TIMEOUT; i++) {
>isr = CSR_READ_4(sc, DC_ISR);
>if (isr & DC_ISR_TX_IDLE &&
-   (isr & DC_ISR_RX_STATE) == DC_RXSTATE_STOPPED)
>+   ((isr & DC_ISR_RX_STATE) == DC_RXSTATE_STOPPED ||
>+(isr & DC_ISR_RX_STATE) == DC_RXSTATE_WAIT))
>break;
>DELAY(10);
>}

Sadly this change is insufficient to satisfy all cards.

The PNIC 82c169 does not idle the transmitter (stays in DC_TXSTATE_WAITEND),   
though the receiver goes idle OK.  The Davicom DM9102 does not idle the 
receiver when asked (seems to get stuck in DC_RXSTATE_ENDCHECK) though it 
stops the transmitter OK.  Your card does yet another thing.

I know these things through 3rd party reports, not because I have any
hardware to test.

So at this point I think the best idea is to do the checks only on Intel
hardware.  At least I can verify that works on a real card I can see with
my own eyes.

Another valid option is to send me one of every dc(4) supported card,
except genuine Intel and the Macronix 98715AEC.

Stephen.

PS The Intel manual says that one should check bit 8, not the receiver
state bits, to see if the receiver is idle.  That makes the test:

(isr & DC_ISR_TX_IDLE && isr & DC_ISR_RX_READ)

It doesn't help though since the uncooperative cards don't set that bit
either.  Also, I think DC_ISR_RX_READ should be spelled as DC_ISR_RX_IDLE.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: dc(4) patch

2002-09-20 Thread Stephen McKay

On Friday, 20th September 2002, John Baldwin wrote:

>On 20-Sep-2002 Stephen McKay wrote:
>> Sadly this change is insufficient to satisfy all cards.
>
>Well.  I think we can keep the check for TX going idle and just not do
>the check for RX going idle.  The original code basically did this until
>you submitted a patch to wpaul@ that fixed a logic bug (used || above
>instead of &&) that effectively didn't do the RX idle check.

Not quite.  Davicom cards (and your card) fail to idle the receiver.
PNIC cards fail to idle the transmitter.  So it makes just as much
sense as any other idea to check those bits only on cards that document
that you have to check those bits.  My documentation only covers Intel. :-)

>Perhaps we should do the same here?  This would be similar to what we do in
>dc_tx_underrun() where we only make sure the TX is idle.

Except that the documentation states you have to idle the TX and RX to
change the full duplex bit, whereas you only have to idle the TX to
change the transmit fifo threshold.  And in dc_tx_underrun() only
the genuine Intel chips are treated specially.  Clones seem to work
without idling the transmitter.  Except the poor Davicom, which gets
reset on every underrun (if anyone has one, and it gets underruns, you
could try including it with the DC_IS_INTEL(sc) case and see what happens).

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: dc(4) patch

2002-09-20 Thread Stephen McKay

On Friday, 20th September 2002, John Baldwin wrote:

>On 20-Sep-2002 Stephen McKay wrote:
>> Not quite.  Davicom cards (and your card) fail to idle the receiver.
>> PNIC cards fail to idle the transmitter.  So it makes just as much
>> sense as any other idea to check those bits only on cards that document
>> that you have to check those bits.  My documentation only covers Intel. :-)
>
>Hmm, what if we went back then to waiting until at least one of either
>TX or RX went idle?  Did only waiting for one actually break any 21143
>cards?

Well that's the funny thing.  It's documented to be necessary on Intel
21143 chips, but I've never seen a non-zero delay between asking for
the TX and RX to idle, and observing them to be idle.  So we could
probably delete the test-and-delay loop entirely.

Waiting for just one of them to go idle, like we have in -stable, is just
silly.  Would you test for condition "A" and assume that means "B" is OK in
any other part of the kernel?  It's really hoping that idling the TX and RX
take about the same time when there's no reason to believe that.  I think
the test in -stable is pretty much equivalent to having no test at all.

The only solid documentation I've got demands *both* must be idle.  But
that's from Intel and describes the original chips.  Hence, my view that
we should test the bits on Intel chips and forget about it on the clones.
Clones tend not to bother implementing all the limitations of the original
anyway.  If we find a clone that turns out to need the tests, we can enable
them for that clone too.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: cvs commit: src/sys/pci if_dc.c

2002-09-22 Thread Stephen McKay

On Friday, 20th September 2002, Martin Blapp wrote:

>I think we would have to test all cases with all cards. What cards
>do you have Stephen, with which clone Chipsets ? Can you make a list
>of them ?

I've only got DE500 (genuine Intel 21143) and Macronix 98715AEC cards.
Nothing PCMCIA or CardBus.  Not a very big selection, I know.  A lot
of us will have to band together to test changes.

>I've got somewhere another dc card which made problems. I guess
>it was PNIC.

PNIC is still a problem with the -current driver.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: cvs commit: src/sys/pci if_dc.c

2002-09-22 Thread Stephen McKay

On Friday, 20th September 2002, Martin Blapp wrote:

>mbr 2002/09/20 08:18:13 PDT
>
>  Modified files:
>sys/pci  if_dc.c 
>  Log:
>  Fix the support for the AN985/983 chips, which do not set the
>  RXSTATE to STOPPED, but to WAIT. This should fix hangs which
>  could only be solved by replugging the cable.

John's already mentioned we are still thinking about the right way to
handle this but...

>  MFC after:  2 weeks

... I thought I should explicitly mention that merging this particular
change as it stands is a bad idea because PNIC and Davicom cards (at least)
are not yet correctly handled.  The code in -stable is the old broken but
apparently harmless code.  This new code is attempting to be more correct
but breaks support for some cards.  Odd situation, no?

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

if_dc broken in -current

2002-03-22 Thread Stephen McKay


It's been quite a while since I updated my -current box, but when I did,
I was surprised to find that my DE500 network card (21143 chip) had stopped
working.  The switch showed no link.  Ifconfig showed "no carrier".

After some fiddling, I reverted revision 1.56 (removal of mii_pollstat call)
of sys/pci/if_dc.c and the DE500 went back to normal.  It auto-negotiated
100Mbit full duplex, and now works fine.

I expect the problem is actually in mii/dcphy.c but since I have very little
understand of how this mii stuff is supposed to work, I have to leave that
to others.  If no one is available to give me a hand here, I'll have to
go with plan B which is to simply back out rev 1.56 of if_dc.c.  (That's
not such a bad plan really, just slightly inefficient.)

On a different dc driver note, I'm interested in knowing if anyone is using
either a PNIC or Davicom with -current.  There is a slight difference between
-current and -stable, and the code in -current caused problems with PNIC and
Davicom cards when it was briefly in -stable.  I'm assuming that nobody is
using such cards, and the little bit of code is going to annoy a few people
when they try the 5.0 prerelease.  I'd like to fix this before it causes
too much trouble.

For those who are curious, the troublesome piece of code is lines 1339 and
1340 (in rev 1.69):

if (isr & DC_ISR_TX_IDLE &&
(isr & DC_ISR_RX_STATE) == DC_RXSTATE_STOPPED)

which waits for confirmation that the transmitter and receiver are both
idle before some configuration registers are fiddled with.  With PNIC
and Davicom cards, one or the other of these conditions never occurs.
Or at least that was the trouble when this was in -stable, back in August.
Could this problem have "magically" gone away?

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: if_dc broken in -current

2002-03-25 Thread Stephen McKay

On Friday, 22nd March 2002, "Ilmar S. Habibulin" wrote:

>On Sat, 23 Mar 2002, Stephen McKay wrote:
>
>> It's been quite a while since I updated my -current box, but when I did,
>> I was surprised to find that my DE500 network card (21143 chip) had stopped
>> working.  The switch showed no link.  Ifconfig showed "no carrier".
>
>I've had the simular problem. Now i have media option set to needed value
>in ifconfig_dc0 variable. This helped.

What sort of card do you have?  The output of dmesg would help.  Have you
tried 4.5 on this machine?

Of course the dc driver should autonegotiate (and does so when I revert
rev 1.56).  Your info could help trace this problem.

Stephen.

PS I'm now assuming the number of -current users that use PNIC and Davicom
cards with the dc driver is exactly zero.  Oh well.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: if_dc broken in -current

2002-03-26 Thread Stephen McKay

On Monday, 25th March 2002, "Ilmar S. Habibulin" wrote:

>On Mon, 25 Mar 2002, Stephen McKay wrote:
>
>> What sort of card do you have?  The output of dmesg would help.  Have you
>> tried 4.5 on this machine?
>I have some noname nic with Intel 21143 chip. dmesg attached. I'm using
>only trustedbsd_mac branch on my ws.

Yours seems to be the same as mine (from a chip and phy point of view)
although mine has a DEC assigned ethernet address and yours is from
Telebit.  I don't think that difference matters.

>> Of course the dc driver should autonegotiate (and does so when I revert
>> rev 1.56).  Your info could help trace this problem.
>Well, i don't think this is the problem. Hardware became too much
>inteligent now a days, so one have to use his own hands to make this
>hardware work like user wants it to work. Maybe just put some FAQ about
>dc(4) and autoconfigurable hubs/switches?

Some things can be blamed on attempted intelligence gone wrong.  But not
this one.  This is a simple bug.  My card works perfectly under 4.5.0
on the same machine.  It fails with -current.  But with one change
reverted, it works again.  Now all I have to do is work out what is
the real underlying cause, since the current code looks right at first
glance.  At least I have the old DEC datasheets, and some info on some
of the clones.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: if_dc broken in -current

2002-03-26 Thread Stephen McKay

On Monday, 25th March 2002, Robert Watson wrote:

>I think I have an identical problem involving a Linksys ethernet card
>using if_dc.  I have to force it to negotiate 10mbps, since it fails to
>negotiate anything higher with my 10/100 switch.  No idea why at all.
>
>dc0:  port 0xe800-0xe8ff mem
>0xfebfff00-0xfebf irq 10 at device 19.0 on pci0
>dc0: Ethernet address: 00:a0:cc:35:3e:56
>miibus0:  on dc0
>dcphy0:  on miibus0
>dcphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
>
>dc0: flags=8843 mtu 1500
>inet6 fe80::2a0:ccff:fe35:3e56%dc0 prefixlen 64 scopeid 0x1 
>inet 192.168.11.150 netmask 0xff00 broadcast 192.168.11.255
>ether 00:a0:cc:35:3e:56 
>media: Ethernet 10baseT/UTP
>status: active
>
>If I set it to auto-negotiate or hard-set to 100mbps, no packets go back
>or forth.  I've had this problem for at least a year, if not longer.  I
>have the same problem with 4.4-STABLE using an identical card on different
>hardware: if it tries to negotiate 100mbps, then it simply doesn't work.
>If I force it to 10, it's fine.

After careful consideration, I think this has to be a different problem.

My problem is that auto-negotiation doesn't start at boot (when an address
is assigned to dc0).  If I explicitly set a speed, that speed works.  Most
bizarrely, if I misspell the media option, that causes a successful
autonegotation!  I mean, I type "ifconfig dc0 media 10baset" immediately
after boot, and autonegotiation takes over.  (If I spell it "10baset/utp"
it goes into 10Mbit half-duplex mode, like you expect.)  So it's just a
hair's breadth away from working properly, and reverting rev 1.56 is enough
for full operation to be restored.

Since you explicitly set 100Mbit half-duplex and it doesn't work, then that
must be something else.  We could have a go at finding that bug too, but
it will be harder, since I don't have a PNIC II here.  I do have some info
on the Macronix 98715A, which Bill Paul says is almost the same.  Maybe
we can get lucky.

Stephen.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Enhancing the user experience with tcsh

2012-02-10 Thread Stephen McKay

On Friday, 10th February 2012, Eitan Adler wrote:

>-alias la   ls -a
>+alias la   ls -aF
> alias lf   ls -FA
>-alias ll   ls -lA
>+alias ll   ls -lAF
>+alias ls   ls -F
>
>Two people didn't like these changes but didn't explain why. This is
>incredibly helpful, especially for a new user.  If you dislike the
>alias change please explain what bothers you about it?

You should never, ever alias over a standard command in a default profile.
It will only train new users incorrectly.  Having to use \ls to get the
real ls is not an answer.  If you think -F should be the default behaviour
of ls, commit it directly to the ls source.  Then run away fast! :-)

As for the other ls aliases, I don't see the point given "lf" already
exists.  My only advice for your overall .cshrc changes is to be minimal
and aim low.  You may have a chance at consensus then.  Good luck!

By the way, one of the nice things about FreeBSD vs Linux is that less
shell configuration is set up by default, so less work is needed to
undo it all before you can get your own settings done.  Every "helpful"
thing that is set in /.cshrc or any other global config file is something
someone somewhere will have to discover and turn off.  Try not to make
it too hard for them.

Stephen.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

70 matches

Mail list logo