32/64bit KSE issues?

2007-03-30 Thread David E. Cross
I recently ran into a problem where the 32bit JVM won't run on a 64bit 
host.  I, and at least one other person in -java thinks it has to do with 
32 bit KSE on a 64bit kernel (I have a vague memory on this somewheres WAY 
back).  Is this still the issue?  Could someone point me in the general 
direction of the specifics of the problem (if they exist, if not, I may 
try to create a simpler test case then java)?

I tried a few searches, but nothing matching what I remembered came up.

David E. Cross
USFS (User Space File System)

1999-07-17 Thread David E. Cross
I am looking at a project that will require a user based process to interact
with the system as if  it were a filesystem.  The traditional way I have seen
this done is as the system NFS mounting itself (ala AMD).  I would really like
a more clean approach to this.  What I am interested in is a 'User Space
File System' that would interact with a user process in a similiar manor
to how nfsd's do.  A process would issue a mount (ok, this is different than
NFSDs), then it would make a special system call with a structure, that
call would return whenever a request was pending with the structure filled in
with the appropriate information.  The user process would fulfill the request,
pack the return data into the structure and call kernel again.

I have a number of questions on more specific ideas (like caching, inode/vnode
interaction, etc).  But I am just feeling arround for what people think
about this.  Any ideas/comments?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: USFS (User Space File System)

1999-07-18 Thread David E. Cross
> :
> :Look into the portal filesystem. This is what you want :)
> :
> : Brian Fundakowski Feldman  _ __ ___   ___ ___ ___  
> : gr...@freebsd.org   _ __ ___ | _ ) __|   \ 
> Actually, it isn't quite.  All the portal filesystem will allow you
> to do is pass back a descriptor.  It does not allow you to simulate
> a filesystem.
> But something similar to what the portal filesystem does would be
> cool -- maybe a real protocol to pass the VOP requests down to a 
> user process and get responses & data.

Portal FS did give me a couple of starting points.. It looks interesting.
Just for my own clarification... how would this be different than NFS
(specifically local NFS)?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: PAM & LDAP in FreeBSD, and userfs too.

1999-07-19 Thread David E. Cross
I thought now would be a good time to chime in on some of my wild schemes...

The reason I am interested in 'userfs' is to enable me to write a version
of 'nsd'.  Those of you familiar with Irix  will recognize it.  For others,
what it does is to present the name-space on a machine as filespace.
The advantages of this is that we can greatly simplify out libc to use the
file/namespace that nsd provides.  For example 'getpwent()' now becomes
file accesses to /ns/.local/passwd/NAME.  Another advantage that this
abstraction provides is that it allows transparent alterations of the
databases in use, even to the extent of NOT having to restart each client
that may be using a specific database.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: PAM & LDAP in FreeBSD

1999-07-19 Thread David E. Cross
> Mike Smith wrote:
> > On Mon, Jul 19, 1999 at 06:13:51PM +0200, Dag-Erling Smorgrav wrote:
> > > Oscar Bonilla  writes:
> > > > the idea is to have an entry in the /etc/passwd enabling LDAP lookups.
> > > > the Entry would be of the form
> > > > 
> > > > ldap:*:389:389:o=My Organization, c=BR:uid:ldap.myorg.com
> > > 
> > > Horrible idea.
> > > 
> > 
> > suggestions?
> Use PAM.

PAM isn't going to cut it.  This is outside of its realm.  Things like ps,
top, ls, chown, chmod, lpr, rcmd, who, w, (the list goes on) need to be able
to pull 'passwd' entries from the LDAP server, and unless we PAM all of those
(I think that is a very bad idea), then a person will be able to login but
will be dead in the water without a UID <->Username mapping.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: PAM & LDAP in FreeBSD, and userfs too.

1999-07-19 Thread David E. Cross
> > Lovely.  Sounds like a much better way to do the Solaris/Linux (and
> > NetBSD?) /etc/nsswitch.conf stuff.  On Solaris at least, this is
> > implemented using masses of weird shared objects...
>The plan for NetBSD is that things will also be handled with dynamic
>modules, but those dynamic modules will be glued into a `nscd'[*] (if you
>use Solaris, you're familiar with the name :-).
>[*] We are planning on not having all of the problems that the Solaris
>nscd has, and that people often complain about.
>This will allow libc to simply make a call to nscd (or fallback onto
>traditional `files' lookup), and nscd will handle all but the `files'
>case.  This allows system-wide caching, and puts all of the complexity
>in one place.
>Involving one or more user mode file systems seems like ... the wrong
>approach for a name service switch.
Tomato, Tomatoe.

The difference between the 2 methods is in their interaction with the database
itself.  You will be providing a socket-ish interface to the cache, my plan
is for a filesystem interface, heck it could probably do both.

I personally prefer the FS approach in dealing with both on Solaris and Irix.
What Irix does well, Irix does very well.  The FS method also allows more
complex permission checking on access to various databases, like shadow, 
because the node in a directory had the added granularity of being group
readable.  It also gives you the flexibility of the shell, or a web-browser,
to get at the data.  Another idea I had considered was placing something
ala 'rc.conf' into a database to allow easy distribution throughout many
computers (this would obviously be configuration later in the boot process). 
Having a FS style interface makes it much easier for programs to get at the
data in a clear, consistent manor.

These are just my ramblinngs, and it seems we are quickly converging on the
same basic idea with slightly different (but perhaps compatible)

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: PAM & LDAP in FreeBSD

1999-07-20 Thread David E. Cross
> Couldn't we do this with /etc/auth.conf? What's the real purpose of this
> file? From the man page: "auth.conf contains various attributes important to 
> the authentication code, most notably kerberos(5) for the time being."
> Isn't this what PAM is about? authentication? or does auth.conf cover the 
> "other" part of authentication, basically the getpw* stuff?

This is bigger than just authentication.  This is about the various databases
that the machine needs to keep in touch with.. hosts, passwd, ethers, services,
protocols, group, etc...   For example using auth.conf how would one [cleanly]
instruct the system that for group information it should use NIS, for hosts,
DNS, and for passwords NIS (for the passwd entry) and Kerberos (for the
password).  What you would have when you are done would be very similar to
'nsswitch.conf'.  With the exception that even nsswitch.conf cannot do
everything, you still need auth.conf (shouldn't this really be pam.conf?) to
tell the system to use kerberos (or whatever) to authenticate the user.

BTW: To clear up some possible misunderstanding from earlier, I am 100% 
in support of /etc/nsswitch.conf for FreeBSD.  My "FreeNSD" ;)  'nsd' server
would read /etc/nsswitch.conf for its configuration, just like the Irix
version does.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

wcs stuffs...

1999-07-20 Thread David E. Cross
Yes, I am still working on it, don't despair ;)

This is the case of project creep... I am now working on the 'isw*()'
functions, and I have a couple of questions regarding locale support in
FreeBSD.  Namely, how the heck do I get access to the database?  I see
that the LC_* databases have all the information I want in them, but I have
0 clue on how to cleanly access the data.  any pointers (esp pointers to man
pages) would be greatly appreciated.  Thanks :)

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: amandad zombies (fwd)

1999-07-20 Thread David E. Cross
We had a similiar problem here.  We had meant to submit-pr it but forgot.
In our case it was because inetd had only the amanda line in it (inetd was
not responsible for any other services.  Our guess was that it is an off by
one error in inted somewhere, but we never traced it down further.  Our 
work-arround was to enable a second service.

Let me know if this was your problem.
David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

linking question...

1999-07-20 Thread David E. Cross
I have a program (part of CDE)... we will call it 'foo', 

"foo" has library dependancies: libtt.so, libX11.so, libXt.so, libXext.so, and
libwcs.so(this last one is mine).

libtt.so depends on iswalpha() and iswspace()  (which are defined in libwcs.so)

If I link with all of those I get an error that iswspace and iswalpha are
undefined, yet:  nm /usr/local/lib/libwcs.so | grep isw returns:
> 1358 T iswalpha
> 13b0 T iswprint
> 1384 T iswspace

And if I change the '-lwcs' line to '/usr/local/lib/libwcs.a' it links fine.


David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: amandad zombies (fwd)

1999-07-20 Thread David E. Cross
Nope, that is all that we had time to track down.  We were fighting NFS panics
arround the same time, stuff got lost in the shuffle :)

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: linking question...

1999-07-20 Thread David E. Cross
> > I have a program (part of CDE)... we will call it 'foo',
> > 
> > "foo" has library dependancies: libtt.so, libX11.so, libXt.so, libXext.so, 
> > and
> > libwcs.so(this last one is mine).
> > 
> > libtt.so depends on iswalpha() and iswspace()  (which are defined in 
> > libwcs.so)
> > 
> > If I link with all of those I get an error that iswspace and iswalpha are
> > undefined, yet:  nm /usr/local/lib/libwcs.so | grep isw returns:
> > > 1358 T iswalpha
> > > 13b0 T iswprint
> > > 1384 T iswspace
> > 
> > And if I change the '-lwcs' line to '/usr/local/lib/libwcs.a' it links fine.
> Did you remember -L/usr/local/lib so the linker will know to search for
> libraries there?  Do you have a bogus libwcs.a in /usr/lib?

The problem indeed was conflicting libraries... (in /usr/X11R6/lib).. however
I did place on the line *immediately* before the -lwcs a -L/usr/local/lib,
however it appeared to take the /usr/X11R6/lib (which was in a previous -L
statement) version instead.  Is this correct?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

1999-07-21 Thread David E. Cross
I updated a system to -CURRENT last night and got a panic with alot of
messages about UDMA failing (I don't have the exact messages, I can get
them if need be).  I backed down the wdc0/wdc1 controller flags from
0xa0ffa0ff to 0x0 and everything is happy.  I figured its -CURRENT, and that
is to be expected.  

I updated another system to -STABLE as of earlier today, and I got the same
thing... *eeak*.  again backing down from 0xa0ffa0ff to 0x0 works like a

The messages came right after init(8) started, and before any of the 
filesystems were mounted r/w (it happened most during the fsck).

I hope someone else has seen this (sorry I am so skimpy on the details, I
will be able to provide more soonish.)

uname -a:
FreeBSD phoenix.cs.rpi.edu 3.2-STABLE FreeBSD 3.2-STABLE #0: Wed Jul 21 
15:17:27 EDT 1999 r...@phoenix.cs.rpi.edu:/usr/src/sys/compile/PHOENIX  i386

ide_pci0:  rev 0x00 on pci0.7.1

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: UDMA broken in -CURRENT/-STABLE? *CRITICAL*!!!!!

1999-07-21 Thread David E. Cross
> I was in the UDMA code yesterday
> (but mostly in the CYRIX code... (changes elsewhere should have been
> mostly cosmetic).
> can you get the exact error message?
> julian

I got it..., I happened to be working on something else at the time
and I let it sit unattended for awhile.. it ate my disk partition rather

Here is the error I see:
wd0: DMA failure, DMA status 5

Thank goodness I tested this -STABLE on my desk machine before I was going to
place it on the home directory server for the department...

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Filesystem question...

1999-07-22 Thread David E. Cross
Since I am planning on writing userfs in order to impliment 'nsd' (and
some other ideas I have hatching too :).  I need to know how filesystem
accesses work.  Can they be queued up, and responded to out of order?

For example... I have a request come in (via the filesystem), that request
is going to take awhile, so I thread off a handler.  Now another filesystem
request comes in, will that be delivered to me, or will that block waiting
for the previous request to be honored first?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

mbuf leakage in NFSv3 writes, possbile?

1999-07-22 Thread David E. Cross
I have 2 NFS servers.  One is primarily read-only, the other read-write, they
service the same clients (the read-only services more).  They are (were) of
the same build.  I have a problem on the read/write server where it chews
through mbuf clusters (it goes through about 3k in a day).  Especially late
at night the machine is not busy.  And now it is also not busy, yet every
minute or so it goes through a few mbuf clusters.  The rate is about 108
minutes for 300 clusters.  Does it sound reasonable that there is a mbuf leak
in the NFS code somewhere?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: mbuf leakage in NFSv3 writes, possbile?

1999-07-22 Thread David E. Cross
Well, I just -STABLED the server to see if it fixed it, but I was certainly
running out.  the server had only 3000-ish mbuf chains, and it would go through
them all in a day.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: mbuf leakage in NFSv3 writes, possbile?

1999-07-23 Thread David E. Cross
Ok, here are some real stats

"w" is the read-only machine, it services everything that "s" (the
read-write machine) does... in fact it services more.

*w crossd $ strings -a /kernel | grep \^___maxusers
___maxusers 96
*w crossd $ uname -a
FreeBSD w.cs.rpi.edu 3.2-STABLE FreeBSD 3.2-STABLE #1: Tue Jun 29 09:36:32 EDT 
1999 r...@w.cs.rpi.edu:/usr/src/sys/compile/WOBBLE  i386
*w crossd $ uptime
 1:43PM  up 24 days,  2:08, 3 users, load averages: 0.00, 0.00, 0.00
*w crossd $ netstat -m
106/2688 mbufs in use:
85 mbufs allocated to data
21 mbufs allocated to packet headers
64/426/2048 mbuf clusters in use (current/peak/max)
1188 Kbytes allocated to network (11% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

*s crossd $ uname -a
FreeBSD s.cs.rpi.edu 3.2-STABLE FreeBSD 3.2-STABLE #0: Thu Jul 22 18:12:21 EDT 
1999 r...@phoenix.cs.rpi.edu:/usr/src/sys/compile/STAGGER  i386
*s crossd $ strings -a /kernel | grep \^___maxusers
___maxusers 512
*s crossd $ uptime
 1:43PM  up 19:23, 2 users, load averages: 0.02, 0.01, 0.00
*s crossd $ netstat -m
3629/4096 mbufs in use:
3621 mbufs allocated to data
8 mbufs allocated to packet headers
3550/3660/8704 mbuf clusters in use (current/peak/max)
7832 Kbytes allocated to network (96% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: mbuf leakage in NFSv3 writes, possbile?

1999-07-23 Thread David E. Cross
Well, backing out now is not really an option...  But given my past history
with NFS, and knowledge of this site I think I have a fair idea where the
leak is...  I think it is in the nfsv3 "commit" handler.  

Why do I think this?  Simple, this problem started when a user started running
a large job on out origin 2k, prior to that our server had been up for 30-ish
days sans any problems, since his start it requires a boot-a-day (mbuf
clusters are up to 8k).  Also supporting this is the fact that the clusters
are used at a fairly constant rate.  Now (following that hunch), I did a 
tcpdump against that host for tcp traffic, and noticed a fairly steady
stream of "commit" NFS traffic.

I realize none of this is a smoking gun, but that is where my "hunch" lies.
How is mbuf cluster cleanyup done?  If I knew I might have a shot in heck
at locating this problem.

BTW: updated netstat -m for the machine:
4855/5344 mbufs in use:
4848 mbufs allocated to data
7 mbufs allocated to packet headers
4774/4850/8704 mbuf clusters in use (current/peak/max)
10368 Kbytes allocated to network (97% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

That's alot of buffer ;)

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

mbuf leakage

1999-07-23 Thread David E. Cross
Well, it doesn't appear to be commit() :(.

Any-who, is there a way I can get a look at the raw mbuf/mbuf-clusters?
I have a feeling that seeing the data in them would speak volumes of
information.  Preferably a way to see them without DDB/panic would be ideal.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

mbuf leak found... for real this time.

1999-07-23 Thread David E. Cross
I found it... our favorite function... nfsrv_create()!!! :)

The problem was/is a create of an already existing file (with O_EXCL|O_CREATE,
I would bet, but I don't have anyway to tell) returns *nothing* to the sender.
The last time I had this problem it was because nfsrv_create() was not clearing
error before its return (signalling to the caller that there was a more
severe error and the packet should not be responded to.  I have looked
through the code, and arround line 1759 there should be a 
"goto nfsmreply0".   Clearly we need to set error to 0 before we depart from
this function with this kind of condition, I am just not sure the
'correct' way to do it.

Any-who, I am not able to reproduce this reliably since the OS (all OS's I 
have tried, including the troubled machine) issue a getattr() to see if the
file exists as a first stage, not even attempting the create call for the
first try.  This looks like a race condition waiting for us to loose it.

As another aside... I really do think that on returning with an error
condition, it may be a good idea to free those mbuf/mbuf-clusters.  I cannot
see a reason to keep them lying arround.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: mbuf leak found... for real this time.

1999-07-24 Thread David E. Cross
PS: I was down to only 3k mbuf-clusters free on the server, so I 'rm'-ed
the troublesome file and the create went through and no more mbuf-leaking.

On the downside, I cannot reproduce this problem any longer with any

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: mbuf leak found... for real this time.

1999-07-24 Thread David E. Cross
> Hmmm.  Interesting.  An EEXIST error occuring at that point for an
> NFSV3 mount will execute the correct nfsm_reply(), but since it is 
> NFSV3 the nfsm_reply() macro will not jump to a return(0) ... when
> it finishes constructing the reply it falls through instead.
> In this case I believe the nfsm_reply() call on line 1761 is correct,
> but that we are failing to clear the error afterwords and this is
> resulting in a non-zero return().  Thus the reply packet is being
> properly formatted but not being transmitted.
> I think all we need to do is set error = 0 for the NFSV3 case after
> the nfsm_srvwcc_data() call.  See the enclosed patch.  We definitely
> do not want to call nfsm_reply(0), because we already correctly call 
> nfsm_reply() on line 1761 (in STABLE).
> I really appreciate the effort you've put into tracking down these 
> problems, Dave!  You are virtually the only one who has enough of a 
> mix of NFS clients to truely test the server-side code.  The only
> testing I can do is between FreeBSD boxes!
> In anycase, please try the enclosed patch.   The patch, if correct,
> should be applied to all branches.
> And if there is anyone else up on NFS I would appreciate a review of
> the patch!  Remember that nfsm_reply() deals with errors differently
> between NFSv2 and NFSv3.
Yes, I concur with your patch whole-heartedly.  Apparently last night I
was too-tired, and not intoxicated enough to understand the nfs_serv.c code :)

I alas will not be able to test it.  The machine is up and stable with 3k
mbufs in reserve.. maybe later :)

As an aside, what about getting rid of that mbuf leak if a nfs-service
routine returns with error!=0?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

userfs help needed.

1999-07-28 Thread David E. Cross
I am wading through the portalfs and nullfs source, but I am desperately
lost.  I would love to be able to find out who would be willing to help out
with questions.  I feel I would be spamming far too many people by just sending
to -hackers.  Some of the topics I am curious about are general fs-style
questions, what the various vop/vfs calls do.  Also I would like to know how
to setup a shared memory segment between kernel and user space (as matt
dillon suggested).  Finally I would like to know how the buffer-cache interacts
with the filesystem layer.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: So, back on the topic of enabling bpf in GENERIC...

1999-07-30 Thread David E. Cross
Here is a pro vote for enabling BPF in GENERIC:

It will let us use a dhcp client in the install programs, this is of tremendous
use to many people as DHCP starts to become much more popular.  I cannot
net install a machine at home since that is on a DHCP cable modem service.

Also, if root is compromised on a system, even if you don't have bpf installed
you would be a fool to believe that they are not sniffing packets/passwords.
At the very least Mr. Pragmatic(sp?) has shown the world the power and 
flexability of KLDs... I am sure someone could write a KLD to impliment the
functionality of a packet sniffer.  Also  an attacker, once obtaining root,
could certainly trojan ftpd/sshd/telnetd/login/whatever.  I think disabling
bpf for "security reasons" is a false sense of security.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

host byte order in networkin routines?!?

1999-08-07 Thread David E. Cross

A friend writing some portable network tunneling software ran into an
interesting thing... when you specify "IP_HDRINCL" with SOCK_RAW,  and
IPPROTO_RAW you need to construct the outgoing packet in host byte order.

This seems wonderfully inconsistent with all of the other socket based
networking interface in FreeBSD, and it is also inconsistent with other
Operating Systems.   Would it be possible to get this changed?  I can provide
diffs if need be.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

FreeBSD 3.2 on a ThinkPad 360c [keyboard not working]

1999-08-12 Thread David E. Cross
I am attempting to get FreeBSD 3.2 and/or 4.0 to go on a TP 360c.  The 
problem I am having is that the keyboard works all the way up to sysinstall.
I can use the keyboard in the visual kernel config/etc.  I searched and found
under 2.2 they suggested setting flags 0x10 on syscons.  0x10 isn't documented
to do anything uner 3/4 but I tried anyway, nothing.  I also noticed that
flags 0x04 and 0x02 may be some use (on atkbc).  I tried 0x4, 0x2, and 0x6 to
no avail.  help?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: FreeBSD 3.2 on a ThinkPad 360c [keyboard not working]

1999-08-12 Thread David E. Cross
> I am attempting to get FreeBSD 3.2 and/or 4.0 to go on a TP 360c.  The 
> problem I am having is that the keyboard works all the way up to sysinstall.
> I can use the keyboard in the visual kernel config/etc.  I searched and found
> under 2.2 they suggested setting flags 0x10 on syscons.  0x10 isn't documented
> to do anything uner 3/4 but I tried anyway, nothing.  I also noticed that
> flags 0x04 and 0x02 may be some use (on atkbc).  I tried 0x4, 0x2, and 0x6 to
> no avail.  help?

Here are some additional details... I tried the 2.2.8-RELEASE install with
the flags  of '0x10' on sc0.  That worked OK.  I dug through the CVS repo
and I have discovered that those are the XT keyboard options (flags 0x04
on atkbd).  so I went into the CLI config on the 3.2-STABLE bootdisk at
turned those flags on BOTH atkdb0 at atkbdc0 (just in case), still no luck.
I have looked at the source for 2.2 syscons and 3.2 atkbd and I can not see
what the difference is in the codeset initialization and keyboard translation
for the 2 types.  I would like to try 3.0-RELEASE, but I cannot find anything
that old ;)


David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: FreeBSD 3.2 on a ThinkPad 360c [keyboard not working]

1999-08-12 Thread David E. Cross
> You are quite right that the code in question was just moved from sc
> to atkbd and there is essentially no difference between the two
> versions.
> This is the first time that I hear the flag 0x10 for sc works in 2.X,
> but the flag 0x4 for atkbd does not in 3.1 or later :-(  I think
> I heard just last month that the flag works for ThinkPad 360CE...
> You say the keyboard works the kernel config menu and up to sysinstall,
> but it does not work in sysinstall and you cannot install the OS.
> Would you see if hitting the CAPS LOCK key changes the CAPS LED light?
> Kazu

I have tried all of the keys, none of them function as labeled (not even
Caps Lock).  The left shift seems to be a double-enter or similiar.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: Kerberos 5 integration.

1999-08-17 Thread David E. Cross
I offered (to Theo T'So) before our (Computer Science Department at RPI)
resources to setup a RO CVS repo for Kerberos V.  He accepted out offer
but things stagnated after that on setting up the details.  My fault mostly
for not taking the tourch that has been passed.  I am [now] offering
again, and I think we can do it.  If someone can contact me we can get this
setup ASAP.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: Kerberos 5 integration.

1999-08-17 Thread David E. Cross
I am terribly sorry.  I had 2 messages about kerboers5 come in at the same
time (one from -hackers, one from mit), I replied to to wrong one.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

CDROM boot on a ThinkPad 600E...

1999-08-19 Thread David E. Cross
I have been attempting to track down why cdrom boots will not work with
/boot/loader, but do just fine with the boot-block.  I have come to the 
following wild speculation, and stab in the dark.  /boot/loader uses some
int13 stuff, which I found while reading in the boot0inst man page may cause
trouble on certain machines.  I believe this may be our smoking gun, but I lack
the time and experience to actually track it down.

I could use boot0cfg to set 'packet' mode, but I am affraid that by doing so
I may loose all access to my machine and need to attack it with a boot floppy.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

PCI programming woes.

1999-08-19 Thread David E. Cross
I am trying to write a very kludgey/monolithic driver for a CardBus ethernet
adapter.  I have run into a bit of a stumbling block on some issues.  One such
issue is the attach (I need to map some registers of the adapter into memory
space so I can read/write values.).  Anyway if someone could explain some
of the following I would be very thankfull.

Take your average run-to-the mill PCI network driver... like FPA or FXP.  Now
look for the attach routines... there are *2* of them, with the exact same
function name, and different arguments?!?!

Huh, what is going on here?  Help?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

1999-08-20 Thread David E. Cross
I have been writing a nasty kludge to treat a CardBus bridge as a standard
PCI bridge (with static config)  .  I have
it to the point where I can (after the system is booted) 'pciconf -r
pci5:0:0 0' and get scan information (neat, huh :).  Welll, I thought it would
then just be a simple matter of 'device_add_child(dev, "pci", 5, 0);' to get
the bus to show up at PCI5: at bootup, but it seems to ignore it.  following
from pcisupport.c I also tried to 'bus_generic_attach()' it after
device_add_child() finished.  no go.  Any suggestions?

David Cross   | email: cro...@cs.rpi.edu 
David Cross   | email: cro...@cs.rpi.edu 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Tulip device driver question

1999-08-31 Thread David E. Cross
I am modifying the tulip device driver to support this xircom card.  I have it
almost entirely working, *except* that it goes into infinite re-neogitiate
loops.  The card probes correctly at bootup, but any attempt to change 
information via ifconfig ("ifconfig de0 inet ..." and "ifconfig de0 up",
and "ifconfig de0 media 10baseTX" will all do it)  results in it probing,
then resetting, then probing again. over and over in an infinite loop.  

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

panic.. 3.2-STABLE-OLD...

1999-09-08 Thread David E. Cross
Well, it has been a long time since I have needed to write an email with that
tagline.  Our primary NFS server had been up for almost 2 months with no
panics.  We did need to reboot it for a network change, but it was up for 28
days at that point.  Anyway here are the details:

dev = 0x20014, block = 2096, fs = /exports/home3
panic: ffs_blkfree: freeing free block
#0  0xc014b6cb in boot ()
#1  0xc014b950 in at_shutdown ()
#2  0xc01d50ef in ffs_blkfree ()
#3  0xc01d992c in indir_trunc ()
#4  0xc01d9688 in handle_workitem_freeblocks ()
#5  0xc01d7c28 in softdep_process_worklist ()
#6  0xc016f9b4 in sched_sync ()
#7  0xc013e56a in kproc_start ()
#8  0xc020328a in fork_trampoline ()

0 0 0   0 -18  0 00 sched  DLs   ??0:00.00  (swapper)
0 1 0   0  10  0   4960 wait   Is??0:00.00  (init)
0 2 0   0 -18  0 00 -  RL??0:00.00  (pagedaemon
0 3 0   0  18  0 00 psleep DL??0:00.00  (vmdaemon)
0 4 0 272  -6  0 00 -  RL??0:00.00  (syncer)
042 1   0  10  0 2629000 mfsidl ILs   ??0:00.00  (mount_mfs)
0   140 1   0   2  0   8160 -  Rs??0:00.00  (syslogd)
David Cross   | email: cro...@cs.rpi.edu 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

softupdates panic in 3.3-RC

1999-09-13 Thread David E. Cross
Our ftp server crashed early this morning with what appears to be a softupdates

> Sep 13 09:56:19 stumble /kernel: pid 41477 (perl), uid 0 on 
> /exports/share3/ftp/.2: file system full
> panic: softdep_write_inodeblock: indirect pointer #0 mismatch 0 != 15597568
> syncing disks... panic: softdep_lock: locking against myself

'perl' would have been the nightly mirror(1) run to sync up our ftp site.

What additional details would be usefull?  We didn't have crashdumps enabled
on this machine, so a backtrace is not fully possible, although it would seem
the contextual evidence for what went wrong is strong.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

perl stangeness on 3.3-RC

1999-09-15 Thread David E. Cross
We have a very hetergenous environment here (even among the FreeBSD boxes).
Each PC tends to be just a little bit different.  This expecially causes
problems since we wish to have XDM on each machine on boot and have X
on a NFS partition.  TO alleviate this we invented a simple Perl script
to replace /usr/X11R6/bin/X to run the correct program on each machine:

#!/usr/bin/perl -w
use Sys::Hostname;

print STDERR "host is $host\n";
$display= $host . $screen;
print STDERR "display is $display\n";
if ($server{$display}) {
 $commandline = join (' ', $server{$display}, @ARGV);
elsif ($server{$host}) {
 $commandline = join (' ',$server{$host}, @ARGV);
exec $commandline;

sub read_servers {
 open (XSERVERLIST, "/usr/local/etc/xservers");
 while ($hostline = ) {
  chomp ($hostline);
  @fields = split ' ', $hostline, 2;
  $server{$fields[0]} = $fields[1];

This worked fine up until about 3.3-RC, then it stopped with the following

Use of uninitialized value at /usr/libdata/perl/5.00503/Sys/Hostname.pm line 
100,  chunk 13.
Use of uninitialized value at /usr/libdata/perl/5.00503/Sys/Hostname.pm line 
109,  chunk 13.
Can't exec "/com/host": No such file or directory at 
/usr/libdata/perl/5.00503/Sys/Hostname.pm line 115,  chunk 13.
Cannot get host name of local machine at /usr/X11R6/bin/X line 6

Note that this is *only* a problem when this script is run by xdm by init.
If I run the script by hand, or I run xdm by hand it works OK.  Also consider
the results of a ktrace on init (PID 1) when xdm was started (alot of stuff
has been deleted to ease readability:

   445 perl RET   execve 0

   445 perl forks and execs 'hostname'

   447 hostname RET   execve 0
   447 hostname CALL  __sysctl(0xbfbfdcd0,0x2,0xbfbfdcf8,0xbfbfdcd8,0,0)
   447 hostname RET   __sysctl 0
   447 hostname CALL  fstat(0x1,0xbfbfd9e4)
   447 hostname RET   fstat 0
   447 hostname CALL  readlink(0x8050a1c,0xbfbfd9e4,0x3f)
   447 hostname NAMI  "/etc/malloc.conf"
   447 hostname RET   readlink -1 errno 2 No such file or directory
   447 hostname CALL  mmap(0,0x1000,0x3,0x1002,0x,0,0,0)
   447 hostname RET   mmap 671424512/0x28052000
   447 hostname CALL  break(0x8055000)
   447 hostname RET   break 0
   447 hostname CALL  break(0x8059000)
   447 hostname RET   break 0
   447 hostname CALL  write(0x1,0x8055000,0x10)
   447 hostname GIO   fd 1 wrote 16 bytes
   445 perl GIO   fd 1 read 16 bytes
   445 perl RET   read 16/0x10
   445 perl CALL  read(0x1,0x80e,0x4000)
   447 hostname RET   write 16/0x10
   447 hostname CALL  exit(0)
   446 sh   RET   wait4 447/0x1bf
   446 sh   CALL  exit(0)
   445 perl GIO   fd 1 read 0 bytes
   445 perl RET   read 0
   445 perl CALL  close(0x1)
   445 perl RET   close 0
   445 perl GIO   fd 2 wrote 106 bytes
   "Use of uninitialized value at /usr/libdata/perl/5.00503/Sys/Hostname.p\
m line 100,  chunk 13.
   448 sh   NAMI  "/bin/uname"
   448 sh   RET   stat -1 errno 2 No such file or directory
   448 sh   CALL  stat(0x809ccb8,0xbfbfdc54)
   448 sh   NAMI  "/usr/bin/uname"
   448 sh   RET   stat 0
   448 sh   CALL  break(0x80a3000)


   449 unameRET   execve 0
   449 unameGIO   fd 1 wrote 16 bytes
   445 perl GIO   fd 1 read 16 bytes
   448 sh   RET   wait4 449/0x1c1
   448 sh   CALL  exit(0)
   445 perl GIO   fd 1 read 0 bytes
   445 perl RET   read 0
   445 perl CALL  close(0x1)
   445 perl RET   close 0
   445 perl GIO   fd 2 wrote 106 bytes
   "Use of uninitialized value at /usr/libdata/perl/5.00503/Sys/Hostname.p\
m line 109,  chunk 13.
   450 perl CALL  execve(0x80dd140,0x808c2e0,0x8076e80)
   450 perl NAMI  "/com/host"
   450 perl RET   execve -1 errno 2 No such file or directory
   450 perl CALL  write(0x2,0x8070e00,0x81)
   450 perl GIO   fd 2 wrote 129 bytes
   "Can't exec "/com/host": No such file or directory at /usr/libdata/perl\
/5.00503/Sys/Hostname.pm line 115,  chunk 13.
   450 perl RET   write 129/0x81
   450 perl CALL  exit(0x1)
   "Cannot get host name of local machine at /usr/X11R6/bin/X line 6

Any ideas what is going on here?  It looks like it gets what it wants and then
just ignores it?!?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

To Unsu

Re: softupdates panic in 3.3-RC

1999-09-15 Thread David E. Cross
> Softupdates has known bugs relating to filesystem full conditions which
> I believe Kirk is working on.  There isn't much you can do until then
> other then either disable softupdates or work to avoid the disk-full 
> condition.  The panic does not occur very frequently so working
> to avoid the disk-full condition is what I would recommend.

Ok, sort-a what I figured.  Thanks.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: perl stangeness on 3.3-RC

1999-09-15 Thread David E. Cross
> Umm, you can edit /usr/X11R6/lib/X11/xdm/Xservers to configure xdm to
> run say /usr/config/X (which would be stored on the local machiens hard
> drive) instead of /usr/X11R6/bin/X.  This is a much simpler solution.
> :)  (Just symlink /usr/config/X to /usr/X11R6/bin/XF86_Whatever.)

Simpler?  It has modifications made on each machine rather in one file in 
a central location.  Plus many things expect to find /usr/X11R6/bin/X
(ala startx and xinit), and we use this on multiple architectures, and some
network/diskless booting systems where mutliple machines share the same
root partition.   This is kinda moot however, since I am more interested in
what caused it to stop working in the first place.  It really seems to be an
a bug.  At glancing through the perl module that does this (Sys/hostname.pm)
it would appear that there is no PATH environ variable set when init is run,
and that is causing the last statemnt in each function block to fail, thus
making the whole block fail.  It is interesting however that the syscall
method isn't working.  FreeBSD doesn't have a gethostname _system_ call, but
it does have the gethostname() library call (which uses sysctl(2)).  Any
ideas how to get perl to use this?

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer | Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science| Fax: 518.276.4033
I speak only for myself.  | WinNT:Linux::Linux:FreeBSD

Re: 1GB, kvm issues.

1999-05-14 Thread David E. Cross
> It's been noted on several occasions that with large (> 256MB) of RAM, one
> has to be "careful" with the configuration (NMBCLUSTERS, MAXUSERS) to
> prevent the box from falling over every few days due to kvm problems.
> Can somebody be more specific?  I'm just about to order a really, really
> expensive machine and I want to be sure I can get it to work .. :) 
> Chuck Youse 

This was fixed somewhere in the 3.1-STABLE branch, you will not need to worry
about this at all with 3.2-BETA/3.2-RELEASE.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

Re: 1GB, kvm issues.

1999-05-14 Thread David E. Cross
> > > has to be "careful" with the configuration (NMBCLUSTERS, MAXUSERS) to
> > > prevent the box from falling over every few days due to kvm problems.
> > > 
> > > Can somebody be more specific?  I'm just about to order a really, really
> > > expensive machine and I want to be sure I can get it to work .. :) 
> > > 
> > > Chuck Youse 
> > 
> > This was fixed somewhere in the 3.1-STABLE branch, you will not need to 
> > worry
> > about this at all with 3.2-BETA/3.2-RELEASE.
> > 
> I thought so as well, however I added a 64 meg dimm and ran into the same
> problems he is describing. After I remembered the problem. I lowered the
> value of maxusers and everything is back to normal. This is on 4.0 -
>>  Current. btw..
> 1 - 128 meg dimm
> 1 - 64  meg dimm
> and for the record
> 750 meg of swap space

Well, this is my current config: Dual P2-400, 256M RAM, 256M SWAP, 3.2-BETA,
maxusers 256.  I have yet to have any wierdness.  Note we also run servers
based off of the late 3.1-STABLE branch and we *used* to see KVA problems
all of the time.  For awhile we rolled in the KVA patches by hand, but since
the change was MFCed we have not done that and have made 0  patches and 
everything continues to work great.

I am 99.%  positive that the changes were made to -current 
(I know I saw the CVS logs go through on our mirror)  I am also 99.99% sur
-STABLE is patched too... If not we should really take care of that now.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

Re: ifconfig: changing mac address

1999-05-16 Thread David E. Cross
> > > > It seems there's a need, and the possibility.  Would somebody like to 
> > > > suggest a syntax?
> > >
> > > The precedent would be the socket ioctls SIOCGIFHWADDR and
> > > SIOCSIFHWADDR.  The Linux emulator suppors the get-only version
> > > already.
> >It's already been mentioned that some adapters support multiple  
> > unicast media addresses (the DEC parts; the on-board enets for (most)  
> > PowerPC Macs; ...).  It would be good to support aliases for media  
> > addresses as well.  The 'alias' keyword for ifconfig could be  
> > overloaded for this, no?  The driver could fail the request if it  
> > didn't support it; or if it has run out of slots for aliases.  There  
> > should also be (I think) a way to tell the driver to go to  
> > promiscuous mode to emulate this (an "I really want this" request?),  
> > but I'm not sure it should be the default response to the "set  
> > hardware address" request.
> An alias would be nice. A standby system must be reachable before it
> will be active and will need another MAC to be.
> But I don't see any sence in having more than one MAC on one IP-Address.
> So talking on IP it should be an optional argument to the ip-alias.

The linux method reeks of layer violation.  I would suggest implimenting
AF_LINK and using the already supported SIOC*ADDR routines on the AF_LINK
socket.  With that those IOCTLs there is already support for getting, setting
and adding aliases.

We do obviously need to support the other in linux-compat mode.

David Cross   | email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

1999-05-16 Thread David E. Cross
I dug through the archives  and found peopel with similiar problems to
what I am experiencing, but I didn't find any answers that have worked 
for me.  Here are the problem I am having:

1: The built-in SCSI ROM is v2.01, there was mention of BIOS 1008 including
   2.11.  I applied the 1008 flash and I am still v2.01  (I don't know if
   this matters at all)

2: Top doesn't work.  I see all of my processes gaining CPU time, but none
   show any percentage of CPU in top.  When I *first* boot my system I see
   percentages for a short time, then they degrade to 0.00%.  I also see
   0.00% of my CPU is in use (the very head of the top screen) 0.00% idle,
   0.00% system, 0.00% nice, 0.00% user.  Again, this works briefly after
a reboot.  The suggestion I found was that this was fixed some time 
   after 3.1-STABLE.  I am running 3.2-BETA from earlier today.

3: Performance.  It seems after awhile my second CPU stops responding.
   I run setiathome v1.1 and check the output of each, they start out in
   sync, but eventually one of them stops doing work; yet it still
   accumulates CPU time?!?  

4: Responsiveness.  It sucks.  I am convinced this has to do with the
   earlier problems... my processes get sent to the stuck CPU for awhile.

It also seems that the secondary CPU is running much cooler to the touch
than the first, this would seem to support the theory it is diong less 
work, although it does have *slightly* better circulation.

Of course as I say this I cannot proove that one of the CPUs is stuck
(both my SETIs are in sync and staying that way).  The system still
feels very much slower than it should, with long pauses often.

Below are my dmesg and config.

Copyright (c) 1992-1999 FreeBSD Inc.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.
FreeBSD 3.2-STABLE #0: Wed May 17 06:26:30 EDT 2000
Timecounter "i8254"  frequency 1193182 Hz
CPU: Pentium II/Xeon/Celeron (686-class CPU)
  Origin = "GenuineIntel"  Id = 0x652  Stepping=2
real memory  = 268435456 (262144K bytes)
avail memory = 257888256 (251844K bytes)
Programming 24 pins in IOAPIC #0
FreeBSD/SMP: Multiprocessor motherboard
 cpu0 (BSP): apic id:  1, version: 0x00040011, at 0xfee0
 cpu1 (AP):  apic id:  0, version: 0x00040011, at 0xfee0
 io0 (APIC): apic id:  2, version: 0x00170011, at 0xfec0
Preloaded elf kernel "kernel" at 0xc02e8000.
Probing for devices on PCI bus 0:
chip0:  rev 0x03 on pci0.0.0
chip1:  rev 0x03 on pci0.1.0
chip2:  rev 0x02 on pci0.4.0
ide_pci0:  rev 0x01 on pci0.4.1
chip3:  rev 0x02 on pci0.4.3
ahc0:  rev 0x00 int a irq 19 on pci0.6.0
ahc0: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs
xl0: <3Com 3c900-TPO Etherlink XL> rev 0x00 int a irq 18 on pci0.10.0
xl0: Ethernet address: 00:60:08:a9:db:e2
xl0: selecting 10baseT transceiver, half duplex
Probing for devices on PCI bus 1:
vga0:  rev 0x04 int a irq 16 on pci1.0.0
Probing for devices on the ISA bus:
sc0 on isa
sc0: VGA color <12 virtual consoles, flags=0x0>
atkbdc0 at 0x60-0x6f on motherboard
atkbd0 irq 1 on isa
psm0 irq 12 on isa
psm0: model Generic PS/2 mouse, device ID 0
fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
fdc0: FIFO enabled, 8 bytes threshold
fd0: 1.44MB 3.5in
wdc0 at 0x1f0-0x1f7 irq 14 flags 0xa0ffa0ff on isa
wdc0: unit 0 (wd0): , DMA, 32-bit, multi-block-16
wd0: 6149MB (12594960 sectors), 13328 cyls, 15 heads, 63 S/T, 512 B/S
wdc0: unit 1 (wd1): , DMA, 32-bit, multi-block-16
wd1: 4924MB (10085040 sectors), 10672 cyls, 15 heads, 63 S/T, 512 B/S
wdc1 at 0x170-0x177 irq 15 flags 0xa0ffa0ff on isa
wdc1: unit 0 (atapi): , removable, intr, iordis
acd0: drive speed 689KB/sec, 128KB cache
acd0: supported read types:
acd0: Audio: play, 255 volume levels
acd0: Mechanism: ejectable tray
acd0: Medium: no/blank disc inside, unlocked
wdc1: unit 1 (atapi): , removable, dma, iordy
acd1: drive speed 0KB/secacd1: supported read types:
acd1: Mechanism: caddy
acd1: Medium: CD-ROM unknown medium
ppc0 at 0x378 irq 7 on isa
ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/9 bytes threshold
ppb0: IEEE1284 device found /NIBBLE
Probing for PnP devices on ppbus0:
1 3C5x9 board(s) on ISA found at 0x310
ep0 at 0x310-0x31f irq 11 on isa
ep0: utp[*UTP*] address 00:a0:24:12:8d:10
npx0 on motherboard
npx0: INT 16 interface
vga0 at 0x3b0-0x3df maddr 0xa msize 131072 on isa
APIC_IO: Testing 8254 interrupt delivery
APIC_IO: routing 8254 via pin 2
IP packet filtering initialized, divert enabled, rule-based forwarding 
disabled, unlimited logging
Waiting 2 seconds for SCSI devices to settle
SMP: AP CPU #1 Launched!
da0 at ahc0 bus 0 target 0 lun 0
da0:  Fixed Direct Access SCSI-2 device 
da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da0: 8683MB (17783250 512 byte sectors: 255H 63S/T 1106C)
ffs_mountfs: superblock updated for 

Repeatable kernel panic for 3.2-RELEASE NFS server

1999-05-17 Thread David E. Cross
First, I would like to take this opportunity the thank Matt Dillon for
his excellent work with NFS/TCP.  Wow, way to go :)

Now on to the real problem :)

One of our users way able to reliably crash an NFS server 3 times today.
I have since copied his program and have reliably crashed a seperate and
unloaded machine with the exact same panic, "lockmgr: locking against
myself".  I check the recent DG patches that went in after -RELEASE and they
do not appear to affect this part of the code.  I have a full debugging
kernel compiled, yet when I issue a 'gdb -k kernel.0 vmcore.0' (where
kernel.0 is either the debugging or strip-debug kernel), I receive
an unresolved symbol error for "gd_curpcb", so I cannot provide additional
information at this time.

In the morning I will try to distill the code down to a more potent
David Cross   |  email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

Re: Repeatable kernel panic for 3.2-RELEASE NFS server

1999-05-21 Thread David E. Cross
> > One of our users way able to reliably crash an NFS server 3 times today.
> > I have since copied his program and have reliably crashed a seperate and
> > unloaded machine with the exact same panic, "lockmgr: locking against
> > myself".  I check the recent DG patches that went in after -RELEASE and they
> Are you sure this is NFS related?
> I can certainly reliably reproduce that and other panics (reported in
> kern/11629, includes a fix).

Ok, well, it just happened again.  I am certain that NFS is tripping this,
as it is the only access to the box in question (yes, the panic may be
elsewhere, but it comes through the NFS subsystem).  After I received your
email I applied the pathc and attempted my situation again.  *wham*.

It would be really good to get a stack trace from these crashes, but I cannot
due to the error mentioned before:

gdb -k kernel.1 vmcore.1
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-unknown-freebsd"...
(no debugging symbols found)...
IdlePTD 2998272

kernel symbol `gd_curpcb' not found.

Any ideas?

David Cross   |  email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

Re: Repeatable kernel panic for 3.2-RELEASE NFS server

1999-05-21 Thread David E. Cross
> > gdb -k kernel.1 vmcore.1
> > GNU gdb 4.18
> > Copyright 1998 Free Software Foundation, Inc.
> > GDB is free software, covered by the GNU General Public License, and you are
> > welcome to change it and/or distribute copies of it under certain 
> > conditions.
> > Type "show copying" to see the conditions.
> > There is absolutely no warranty for GDB.  Type "show warranty" for details.
> > This GDB was configured as "i386-unknown-freebsd"...
> > > (no debugging symbols found)...
> IdlePTD 2998272
> > 
> > > kernel symbol `gd_curpcb' not found.
> > (kgdb) 
> > 
> > Any ideas?
> > 
> You need to use gdb comes with 3.2-RELEASE.
I have tried gdb from 3.2-BETA, 3.2-RELEASE, and 3.2-STABLE, as well as the 
gdb that was built from the exact same CVS checkout as the kernel owas from,
they all give the same error.

David Cross   |  email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

Re: Repeatable kernel panic for 3.2-RELEASE NFS server

1999-05-21 Thread David E. Cross
> Another possibly re: debugging.  If you compile up a kernel with 
> options DDB and options BREAK_TO_DEBUGGER, the kernel will break into
> DDB when the panic occurs.  You can then issue a 'trace' command to get
> a backtrace.
> This may be good enough to determine what the problem is because
> the cause of a 'lockmgr: locking against myself' panic tends to be
> entirely contained within the current stack trace.
All my kernels are now DDB kernels :)  But since I do almost all of
my work remotely they are DDB_UNATTENDED, and the machine I am panic-ing
is not on the serial console server (sorry).  I do have another question
about DDB, I unstalled -STABLE as of today (from releng3.fre...) and I 
compiled the kernel with DDB, and DDB_UNATTENDED per usual.  Now when I
C-A-E to get into the debugger and type 'panic' it drops me at another
debugging prompt.  If I type panic from that I get the real thing, any ideas?

My next email will hopefully have the stack trace for this panic.
David Cross   |  email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

Re: ISA LM78 driver help

1999-05-22 Thread David E. Cross
I have done simple drivers before.  I would be interested in working with you
on this (it would benefit me as well).  If you couild provide a web site with
more information that would help too.

David Cross   |  email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

Re: Need help recovering from major mistake

1999-05-22 Thread David E. Cross
> Using my FreeBSD CD-ROMs, I've been able to go into fixit mode and mount
> the root filesystem of the drive, but I'm not sure where to go from there.
> How can I figure out what my old disklabel was? Is there some way I can
> search the raw disk for the locations of the file systems?
> Any help will be GREATLY appreciated. Please email me directly with your
> responses as I'm not subscribed to the FreeBSD mailing lists.
I had NT kindly overwrite my disklable once, I got arround the problem by
scanning the disk for the magic numbers that signifies the start of a 
FreeBSD sub-partition.  You then have to do some math based on the raw block
numbers to figure out the start and lenght.  you are lucky in that FreeBSD
will tell you if you get the lenght wrong (you need to get the start correct);
it tells you the correct length.  After you have that information go into
'disklabel -e disk' and re-enter the values.

I no longer have the program, but the magic values to look for are easily
gotten by examining the the first  block or 2 from a subpartition.
'dd bs=512 if=/dev/rdsk count=2 | less' did the trick for me.

David Cross   |  email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

still problems with kernel debugging

1999-05-24 Thread David E. Cross
I have just installed 3.2-19990521-STABLE from releng3.freebsd.org.  I
compiled a debugging kernel, forced a panic with DDB, and created a crashdump,
on bootup it saved it to /var/crash/[kernel|vmcore].0.  I then ran a
'gdb -k kernel.0 vmcore.0' and received the happy fun 'gd_curpcb' symbol
not found error.  What am I doing wrong?

David Cross   |  email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

Re: Repeatable kernel panic for 3.2-RELEASE NFS server

1999-05-24 Thread David E. Cross
Is there anything I can do on my end (we have a complete CVS repository,
synced every 4 hours.  I would like to use the *exact* same gdb that we
compiled the world from if possible, I am affraid of using a more recent
gdb will result in not being able to read the core (I am unable to use
a 3.1-STABLE gdb to read the core from a 3.2-STABLE system).

David Cross   |  email: cro...@cs.rpi.edu 
Systems Administrator/Research Programmer |  Web: http://www.cs.rpi.edu/~crossd 
Rensselaer Polytechnic Institute, |  Ph: 518.276.2860
Department of Computer Science|  Fax: 518.276.4033
I speak only for myself.  |  WinNT:Linux::Linux:FreeBSD

repeatable 3.2 panic, new and improved with backtrace

1999-05-24 Thread David E. Cross
Here it is:
(I am not sure that the source didn't get updated since the kernel was
compiled, so the line numbers may be meaningless.  The second backtrace is
from and earlier kernel and has 0 line number information in it.

IdlePTD 2985984
initial pcb at 264eac
panicstr: lockmgr: locking against myself
panic messages:
panic: lockmgr: locking against myself

syncing disks... 77 65 47 26 7 done

dumping to dev 20001, offset 303392
dump 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 
70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 
44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 
#0  boot (howto=256) at ../../kern/kern_shutdown.c:285
285 dumppcb.pcb_cr3 = rcr3();
(kgdb) bt
#0  boot (howto=256) at ../../kern/kern_shutdown.c:285
#1  0xc014b3f4 in at_shutdown (
<__set_sysuninit_set_sym_M_KTRACE_uninit_sys_uninit+154>, arg=0x10002, 
queue=-949495424) at ../../kern/kern_shutdown.c:446
#2  0xc01470f8 in lockmgr (lkp=0xc14ab000, flags=16842754, 
interlkp=0xc767d9f0, p=0xc743eb20) at ../../kern/kern_lock.c:326
#3  0xc016cfbc in vop_stdlock (ap=0xc7482a64) at ../../kern/vfs_default.c:209
#4  0xc01e4fad in ufs_vnoperate (ap=0xc7482a64)
at ../../ufs/ufs/ufs_vnops.c:2299
#5  0xc0175d97 in vn_lock (vp=0xc767d980, flags=65538, p=0xc743eb20)
at vnode_if.h:811
#6  0xc016f93f in vget (vp=0xc767d980, flags=2, p=0xc743eb20)
at ../../kern/vfs_subr.c:1274
#7  0xc016bac7 in vfs_cache_lookup (ap=0xc7482b24)
at ../../kern/vfs_cache.c:439
#8  0xc01e4fad in ufs_vnoperate (ap=0xc7482b24)
at ../../ufs/ufs/ufs_vnops.c:2299
#9  0xc016e079 in lookup (ndp=0xc7482d94) at vnode_if.h:31
#10 0xc01b769c in nfs_namei (ndp=0xc7482d94, fhp=0xc7482d0c, len=0, 
slp=0xc0f7e600, nam=0xc1688fc0, mdp=0xc7482c48, dposp=0xc7482c44, 
retdirp=0xc7482c2c, p=0xc743eb20, kerbflag=0, pubflag=0)
at ../../nfs/nfs_subs.c:1642
#11 0xc01a068f in nfsrv_lookup (nfsd=0xc1542b00, slp=0xc0f7e600, 
procp=0xc743eb20, mrq=0xc7482e34) at ../../nfs/nfs_serv.c:396
#12 0xc01b90f6 in nfssvc_nfsd (nsd=0xc7482e94, argp=0x8071af4 "", p=0xc743eb20)
at ../../nfs/nfs_syscalls.c:656
#13 0xc01b8a11 in nfssvc (p=0xc743eb20, uap=0xc7482f94)
at ../../nfs/nfs_syscalls.c:342
#14 0xc020bd5f in syscall (frame={tf_es = 39, tf_ds = 39, tf_edi = 8, 
  tf_esi = 0, tf_ebp = -1077944892, tf_isp = -951570460, tf_ebx = 0, 
  tf_edx = -1077945288, tf_ecx = 0, tf_eax = 155, tf_trapno = 12, 
  tf_err = 2, tf_eip = 134518940, tf_cs = 31, tf_eflags = 642, 
  tf_esp = -1077945280, tf_ss = 39}) at ../../i386/i386/trap.c:1100
#15 0xc0202b8c in Xint0x80_syscall ()
#16 0x80480e9 in ?? ()

Here is the dump from a previous build, no line numbers in this one, sorry.

panic: lockmgr: locking against myself

syncing disks... 64 46 31 13 2 done

dumping to dev 20001, offset 303392
dump 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 
70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 
44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 
#0  0xc014ae43 in boot ()
(kgdb) bt
#0  0xc014ae43 in boot ()
#1  0xc014b0c8 in at_shutdown ()
#2  0xc0146dbf in lockmgr ()
#3  0xc016ccec in vop_stdlock ()
#4  0xc01e4c8d in ufs_vnoperate ()
#5  0xc0175ac7 in vn_lock ()
#6  0xc016f66f in vget ()
#7  0xc016b7f7 in vfs_cache_lookup ()
#8  0xc01e4c8d in ufs_vnoperate ()
#9  0xc016dda9 in lookup ()
#10 0xc01b73cc in nfs_namei ()
#11 0xc01a03bf in nfsrv_lookup ()
#12 0xc01b8e26 in nfssvc_nfsd ()
#13 0xc01b8741 in nfssvc ()
#14 0xc020ba4f in syscall ()
#15 0xc020287c in Xint0x80_syscall ()
#16 0x80480e9 in ?? ()

kernel debugging assistance

1999-05-25 Thread David E. Cross
I am trying to trace down the cause of the recursive lock and I stumbled upon

(kgdb) bt
#0  boot (howto=256) at ../../kern/kern_shutdown.c:285
#1  0xc014b3f4 in at_shutdown (
<__set_sysuninit_set_sym_M_KTRACE_uninit_sys_uninit+154>, arg=0x10002, 
queue=-951064448) at ../../kern/kern_shutdown.c:446
#2  0xc01470f8 in lockmgr (lkp=0xc10d8f00, flags=16842754, 
interlkp=0xc74fe8f0, p=0xc743eb20) at ../../kern/kern_lock.c:326
#3  0xc016cfbc in vop_stdlock (ap=0xc7482a64) at ../../kern/vfs_default.c:209
#4  0xc01e4fad in ufs_vnoperate (ap=0xc7482a64)
at ../../ufs/ufs/ufs_vnops.c:2299
#5  0xc0175d97 in vn_lock (vp=0xc74fe880, flags=65538, p=0xc743eb20)
at vnode_if.h:811
#6  0xc016f93f in vget (vp=0xc74fe880, flags=2, p=0xc743eb20)
at ../../kern/vfs_subr.c:1274
#7  0xc016bac7 in vfs_cache_lookup (ap=0xc7482b24)
at ../../kern/vfs_cache.c:439
#8  0xc01e4fad in ufs_vnoperate (ap=0xc7482b24)
at ../../ufs/ufs/ufs_vnops.c:2299
#9  0xc016e079 in lookup (ndp=0xc7482d94) at vnode_if.h:31
#10 0xc01b769c in nfs_namei (ndp=0xc7482d94, fhp=0xc7482d0c, len=3, 
slp=0xc0f7e600, nam=0xc0f54200, mdp=0xc7482c48, dposp=0xc7482c44, 
retdirp=0xc7482c2c, p=0xc743eb20, kerbflag=0, pubflag=0)
at ../../nfs/nfs_subs.c:1642
#11 0xc01a068f in nfsrv_lookup (nfsd=0xc1185700, slp=0xc0f7e600, 
procp=0xc743eb20, mrq=0xc7482e34) at ../../nfs/nfs_serv.c:396
#12 0xc01b90f6 in nfssvc_nfsd (nsd=0xc7482e94, argp=0x8071af4 "", p=0xc743eb20)
at ../../nfs/nfs_syscalls.c:656
#13 0xc01b8a11 in nfssvc (p=0xc743eb20, uap=0xc7482f94)
at ../../nfs/nfs_syscalls.c:342
#14 0xc020bd5f in syscall (frame={tf_es = 39, tf_ds = 39, tf_edi = 8, 
  tf_esi = 0, tf_ebp = -1077944892, tf_isp = -951570460, tf_ebx = 0, 
  tf_edx = -1077945288, tf_ecx = 0, tf_eax = 155, tf_trapno = 12, 
  tf_err = 2, tf_eip = 134518940, tf_cs = 31, tf_eflags = 642, 
  tf_esp = -1077945280, tf_ss = 39}) at ../../i386/i386/trap.c:1100
#15 0xc0202b8c in Xint0x80_syscall ()
#16 0x80480e9 in ?? ()
(kgdb) up 3
#3  0xc016cfbc in vop_stdlock (ap=0xc7482a64) at ../../kern/vfs_default.c:209
209 return (lockmgr(l, ap->a_flags, &ap->a_vp->v_interlock, 
(kgdb) print ap
$1 = (struct vop_lock_args *) 0x0

That doesn't seem to be possible, especially since ap->FOO is used in the same
line.. wouldn't that cause a fault at this point? (it isn't the panic
occurs later as you can see.)

Re: kernel debugging assistance

1999-05-27 Thread David E. Cross
> I don't think that this dump is useful for debugging this problem. Perhaps, 
> if 
> you compile the kernel with DEBUG_LOCKS, you will get more useful info.
> Dima

I checked through the source for DEBUG_LOCKS, it doesn't appear to do anything
other than to printout information information that I already have access
to by way of this dump file.  I will turn it on regardless swo it makes my
life a bit more simple.

In looking through this, and at the program that used to cause this problem
reliably (it no longer does, even though nothing changed on the client or
workstation; I am guessing that it is a race condition that happens 5% of
the time and I filled my quota for the next 20 years ;) I have a theory
what is going on...  NFS service is entirely in the kernel for FreeBSD, 
excepting the NFSDs which mostly sit arround to give the kernel contexts to
pass requests into.  NFS uses its own namei mechanism which requests a lock
on what it is looking up.  What if it gets 2 requests at about the same 
time for the same file.  That would certainly seem a likely cause for this
problem.  I note that all the files that are causing this crash are
files that would be accessed in the aforementioned behaviour; netscape
cache files, .Xauthority-c, and the data file for the test prgram which
is accessed rapidly and repeatedly.

Does this seem like a reasonable theory to anyone?
lockmgr: locking against myself (again)

1999-05-31 Thread David E. Cross
Well, I have a debuging kernel compiled with DEBUG_LOCKS turned on, below is
the result.  This was cvsup-ed and built about May 27, 14:00 EDT.

gdb -k /usr/src/sys/compile/STAGGER/kernel.debug vmcore.10 (yes, that's right, 
#10 in a little over one week)
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-unknown-freebsd"...
IdlePTD 2990080
initial pcb at 2663b8
panicstr: lockmgr: locking against myself
panic messages:
panic: lockmgr: locking against myself

syncing disks... 26 19 11 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 giving up

dumping to dev 20001, offset 303392
dump 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 
70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 
44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 
#0  boot (howto=256) at ../../kern/kern_shutdown.c:285
285 dumppcb.pcb_cr3 = rcr3();
(kgdb) bt
#0  boot (howto=256) at ../../kern/kern_shutdown.c:285
#1  0xc014baa4 in at_shutdown (
<__set_sysuninit_set_sym_M_KTRACE_uninit_sys_uninit+179>, arg=0xc75b9b40, 
queue=65538) at ../../kern/kern_shutdown.c:446
#2  0xc0147794 in debuglockmgr (lkp=0xc1276200, flags=16842754, 
interlkp=0xc75b9bb0, p=0xc743f780, name=0xc0237d57 "vop_stdlock", 
file=0xc023804b "../../kern/vfs_subr.c", line=1274)
at ../../kern/kern_lock.c:326
#3  0xc016d6c1 in vop_stdlock (ap=0xc7453a5c) at ../../kern/vfs_default.c:211
#4  0xc01e5bfd in ufs_vnoperate (ap=0xc7453a5c)
at ../../ufs/ufs/ufs_vnops.c:2299
#5  0xc0176635 in debug_vn_lock (vp=0xc75b9b40, flags=65538, p=0xc743f780, 
filename=0xc023804b "../../kern/vfs_subr.c", line=1274) at vnode_if.h:811
#6  0xc01700d1 in vget (vp=0xc75b9b40, flags=2, p=0xc743f780)
at ../../kern/vfs_subr.c:1274
#7  0xc016c1b3 in vfs_cache_lookup (ap=0xc7453b24)
at ../../kern/vfs_cache.c:439
#8  0xc01e5bfd in ufs_vnoperate (ap=0xc7453b24)
at ../../ufs/ufs/ufs_vnops.c:2299
#9  0xc016e7c1 in lookup (ndp=0xc7453d94) at vnode_if.h:31
#10 0xc01b8260 in nfs_namei (ndp=0xc7453d94, fhp=0xc7453d0c, len=3, 
slp=0xc0f7e400, nam=0xc0f54370, mdp=0xc7453c48, dposp=0xc7453c44, 
retdirp=0xc7453c2c, p=0xc743f780, kerbflag=0, pubflag=0)
at ../../nfs/nfs_subs.c:1642
#11 0xc01a123f in nfsrv_lookup (nfsd=0xc108f100, slp=0xc0f7e400, 
procp=0xc743f780, mrq=0xc7453e34) at ../../nfs/nfs_serv.c:396
#12 0xc01b9cba in nfssvc_nfsd (nsd=0xc7453e94, argp=0x8071af4 "", p=0xc743f780)
at ../../nfs/nfs_syscalls.c:656
#13 0xc01b95d5 in nfssvc (p=0xc743f780, uap=0xc7453f94)
at ../../nfs/nfs_syscalls.c:342
#14 0xc020d02f in syscall (frame={tf_es = 39, tf_ds = 39, tf_edi = 8, 
  tf_esi = 0, tf_ebp = -1077944892, tf_isp = -951762972, tf_ebx = 0, 
  tf_edx = -1077945288, tf_ecx = 0, tf_eax = 155, tf_trapno = 12, 
  tf_err = 2, tf_eip = 134518940, tf_cs = 31, tf_eflags = 642, 
  tf_esp = -1077945280, tf_ss = 39}) at ../../i386/i386/trap.c:1100
#15 0xc0203e5c in Xint0x80_syscall ()
#16 0x80480e9 in ?? ()

3.2-STABLE, 11th panic

1999-06-02 Thread David E. Cross
Here is a backtrace from our latest.  Does anyone have any ideas to try.  Is
there any way to loock at who has the other lock on the file?  (Yes, I know
it is the kernel who has it, but it is requested on behalf of a NFSd, no?)

(kgdb) bt
#0  boot (howto=256) at ../../kern/kern_shutdown.c:285
#1  0xc014baa4 in at_shutdown (
<__set_sysuninit_set_sym_M_KTRACE_uninit_sys_uninit+179>, arg=0xc74e9080, 
queue=65538) at ../../kern/kern_shutdown.c:446
#2  0xc0147794 in debuglockmgr (lkp=0xc1528400, flags=16842754, 
interlkp=0xc74e90f0, p=0xc743eb20, name=0xc0237d57 "vop_stdlock", 
file=0xc023804b "../../kern/vfs_subr.c", line=1274)
at ../../kern/kern_lock.c:326
#3  0xc016d6c1 in vop_stdlock (ap=0xc7482a5c) at ../../kern/vfs_default.c:211
#4  0xc01e5bfd in ufs_vnoperate (ap=0xc7482a5c)
at ../../ufs/ufs/ufs_vnops.c:2299
#5  0xc0176635 in debug_vn_lock (vp=0xc74e9080, flags=65538, p=0xc743eb20, 
filename=0xc023804b "../../kern/vfs_subr.c", line=1274) at vnode_if.h:811
#6  0xc01700d1 in vget (vp=0xc74e9080, flags=2, p=0xc743eb20)
at ../../kern/vfs_subr.c:1274
#7  0xc016c1b3 in vfs_cache_lookup (ap=0xc7482b24)
at ../../kern/vfs_cache.c:439
#8  0xc01e5bfd in ufs_vnoperate (ap=0xc7482b24)
at ../../ufs/ufs/ufs_vnops.c:2299
#9  0xc016e7c1 in lookup (ndp=0xc7482d94) at vnode_if.h:31
#10 0xc01b8260 in nfs_namei (ndp=0xc7482d94, fhp=0xc7482d0c, len=3, 
slp=0xc0f7e400, nam=0xc0f541a0, mdp=0xc7482c48, dposp=0xc7482c44, 
retdirp=0xc7482c2c, p=0xc743eb20, kerbflag=0, pubflag=0)
at ../../nfs/nfs_subs.c:1642
#11 0xc01a123f in nfsrv_lookup (nfsd=0xc13d3400, slp=0xc0f7e400, 
procp=0xc743eb20, mrq=0xc7482e34) at ../../nfs/nfs_serv.c:396
#12 0xc01b9cba in nfssvc_nfsd (nsd=0xc7482e94, argp=0x8071af4 "", p=0xc743eb20)
at ../../nfs/nfs_syscalls.c:656
#13 0xc01b95d5 in nfssvc (p=0xc743eb20, uap=0xc7482f94)
at ../../nfs/nfs_syscalls.c:342
#14 0xc020d02f in syscall (frame={tf_es = 39, tf_ds = 39, tf_edi = 8, 
  tf_esi = 0, tf_ebp = -1077944892, tf_isp = -951570460, tf_ebx = 0, 
  tf_edx = -1077945288, tf_ecx = 0, tf_eax = 155, tf_trapno = 12, 
  tf_err = 2, tf_eip = 134518940, tf_cs = 31, tf_eflags = 642, 
  tf_esp = -1077945280, tf_ss = 39}) at ../../i386/i386/trap.c:1100
#15 0xc0203e5c in Xint0x80_syscall ()
#16 0x80480e9 in ?? ()

3.2-stable, panic #12

1999-06-03 Thread David E. Cross
Our home directory NFS server went down again today, "same bat-panic".
This time it went down on ".Maillock" (usually it goes down on a netscape
cache file or .Xauthorit-c.  Piecing some more together I modified my old
"crash_patoot.c" file (which didn't cause any problems), to the new and 
improved version that does.

This is our environment:
FreeBSD NFS server running 3.2-STABLE from 1.5 to 2 weeks ago. Multiple
client machines of multiple architectures (Solaris 2.6, Irix 6.5.2+, FreeBSB
3.2+).  These crashes were all reproduced with a Solaris client, I do not know
if it is reproduceable with other clients.   Below is the short code segment
that will cause the crash, the additions I added to it to cause the crash
were rename(2) and unlink(2), without those I could not get a crash.

Also, available upon request is a packet dump of all traffic to/from that
machine leading to the crash (it is only 198336 bytes long, it was 
captured with '-s 1500' with tcpdump).

Without further ado crash_patoot.c:
int main(int argc, char **argv)
int fd;
int counter;
char newfilename[1024];

for(counter=0;counter<100;counter++) {
fd=open(argv[1], O_CREAT, 600);
write(fd, &counter,4);
rename(argv[1], newfilename);
rename(newfilename, argv[1]);
fd=open(newfilename, O_CREAT,600);
fd=open(newfilename, O_CREAT,600);
return 0;

If you are able to reproduce this panic please let me know.  I want to be
assured I am not going out of my mind.  I am attempting to dig through the NFS
code to try to find the bug myself, but it is a daunting task.

Re: 3.2-stable, panic #12

1999-06-03 Thread David E. Cross
Just need to make a small amendment to the instructions for the previous
prgram.  I am unable to cause a panic by running it on just one machine, I
need to have it run on 2 machines accessing the same file.

client1% ./crash_patoot cp1
client2% ./crash_patoot cp1

Re: 3.2-stable, panic #12 (simplified)

1999-06-03 Thread David E. Cross
I had the hunch that the problem I am dealing with related to the unlink
portion of NFS... So I have simplified the code down to this tiny snipet which
will reliably crash the system (I left it running by accident and it brought
my test machine down 3 times before I remembered to kill it :).  This is only
3 lines of code, and a for loop to iterate it.

int main(int argc, char **argv)
int fd;
int counter;
char newfilename[1024];

for(counter=0;counter<100;counter++) {
fd=open(argv[1], O_CREAT, 600);
return 0;

Again, this appears to need to be run from multiple machines at once to cause
the problem (running from 2 dual-ultra 2s running solaris 2.6 in this case).
I will attempt to reproduce it with FreeBSD clients later today.  In the
meantime I am getting down and dirty with the NFS kernel routines.

PS: Please give Matt his privs back.  You *really* don't want me sending
patches to the NFS code ;)

Re: 3.2-stable, panic #12 (simplified)

1999-06-04 Thread David E. Cross
Yes, I am *very* certain that it is the server that is crashing.  The
test server in question has no user login-able accounts at all, and is only
running nfsd/mountd/portmap.

Re: 3.2-stable, panic #12

1999-06-04 Thread David E. Cross
> > fd=open(argv[1], O_CREAT, 600);
> Since this opens the file so that it cannot be written to, not
> to mention the really weird mode it will get if it's created by
> that open(), the rest of the thing doesn't deserve to work.
> Generally speaking, it's a good idea to make sure that test code
> is at least decent before starting to puzzle over what it does.

The code does exactly what it is supposed to.  The "600" was a typo that
should be "0600".  If it makes you feel better s/O_CREAT/O_CREAT|O_RDWR/;
the results are the same, the server crashes with a lockmgr: locking against
myself panic, a panic that we have seen !13! times in the past 10 days.  I
would certainly hope and expect it to work (what do you know, it does),
considering "the rest" is a simple unlink of the file that _I_ just created.  I
do not understand why you believe it "does not deserve to work".  

The ONLY purpose of this code was to create a regular file, unlink it, and then
create it again as quickly as possible under the same name.  For this purpose
the mode is irrelevant, and the R/W status of the FD is irrelevant; the code
fulfills its purpose, and in doing so exhibits the aforementioned bug.

Generally speaking, you should test the code to see if it works as advertised
before trying to fix it.  Did this code crash your system when run from 2
solaris 2.6 machines?
Re: restrict connection

1999-06-10 Thread David E. Cross
We have similiar restraints for a certain number of our machines, we have
solved this problem by using FreeBSD's built in firewall
(just add 'options IPFIREWALL' to your kernel config script).  Here is
a *very* simple firewall config to do some such restrictions):
You may note that there are mutliple accept lines for this, this is done
to allow good security and logging for each connection; without these mutliple
steps an attacker could determine what services you are running by forging 
TCP packets to make it look like connections are already there...

add 100 allow all from any to any via lo0  #standard rule for lo0

#log and allow standard telnet from IP1 *only*, and only from IF1 interface
add 200 allow log tcp from IP1 to HOSTIP 23 setup in recv IF1
add 200 allow tcp from IP1 to HOSTIP 23 in recv IF1
add 200 allow tcp from HOSTIP 23 to IP1 out xmit IF1
add 200 deny log tcp from any to HOSTIP 23

#Allow ssh connects from a secured subnet
add 300 allow log tcp from NET1/NET1MASK to HOSTIP 22 setup in recv IF1
add 300 allow tcp from NET1/NET1MASK to HOSTIP 22 in recv IF1
add 300 allow tcp from HOSTIP 22 to NET1/NET1MASK out xmit IF1
add 300 deny log tcp from any to HOSTIP 23

Wash, rinse, and repeat for your other services.  FTP will be a bit tricky
since it is a 2-way communication, but I have done it, you just open up a
set of ports in the 4000-5000 range, and make sure nothing ever runs on them,
and they are outbound connections only, accept only "established" packets in
on them.  The OS will bind to the first port that it can, and for me so
far that has taken into account firewall rules.

High syscall overhead?

1999-06-11 Thread David E. Cross
Just doing some performance testing and I noticed something rather

Here is the test program:
int main (void)
int count=0;
for(count=0;count <1000;++count)

return 0;

The time on linux for this program is ~5 seconds (linux "time" reports 3.x, but
a wall clock clearly shows 5.x, go fig).   FreeBSD reports 18.x seconds?!.  I
have a dual processor system and decided to parallel run them... it took
52!?! seconds, linux on the same was again about 5.  Looking through the
exception.s it appears that on entry to the kernel an MP lock is obtained...
I thought we had splX(); to protect concurancy in the kernel.

I am just curious what's the story with this.  On some of my other tests it is
clear that FreeBSD is handling concurancy much better than linux (by an equal
factor actually, and on "real" tasks like real I/O handling).

Re: High syscall overhead?

1999-06-11 Thread David E. Cross
Oops, here is some additional information from my system:

bash-2.02$ cat sc.c
int main (void)
int count=0;
for(count=0;count <1000;++count)

return 0;
bash-2.02$ cc -o sc sc.c
bash-2.02$ uptime
11:19AM  up 3 hrs, 4 users, load averages: 0.01, 0.01, 0.00
bash-2.02$ time ./sc

sys 0m18.521s
bash-2.02$ (date;time ./sc;date) & (date;time ./sc;date) &
[3] 516
[4] 517
Fri Jun 11 11:21:51 EDT 1999
[1]   Donetime ./sc
Fri Jun 11 11:21:51 EDT 1999
[2]   Donetime ./sc
sys 0m52.056s

sys 0m52.053s
Fri Jun 11 11:22:43 EDT 1999
Fri Jun 11 11:22:43 EDT 1999

bash-2.02$ dmesg | head
Copyright (c) 1992-1999 FreeBSD Inc.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.
FreeBSD 3.2-STABLE #0: Fri Jun 11 08:18:12 EDT 1999
Timecounter "i8254"  frequency 1193182 Hz
CPU: Pentium II/Xeon/Celeron (686-class CPU)
  Origin = "GenuineIntel"  Id = 0x652  Stepping=2
real memory  = 268435456 (262144K bytes)
avail memory = 257925120 (251880K bytes)
Programming 24 pins in IOAPIC #0
FreeBSD/SMP: Multiprocessor motherboard
 cpu0 (BSP): apic id:  1, version: 0x00040011, at 0xfee0
 cpu1 (AP):  apic id:  0, version: 0x00040011, at 0xfee0
 io0 (APIC): apic id:  2, version: 0x00170011, at 0xfec0
Preloaded elf kernel "kernel" at 0xc02df000.
Probing for devices on PCI bus 0:
chip0:  rev 0x03 on pci0.0.0
chip1:  rev 0x03 on pci0.1.0

It is -STABLE from June 7th, mid-day.

-STABLE, panic #15

1999-06-11 Thread David E. Cross
Yes, it has happened again (twice in fact).

I am desperately trying to find the source of this.  So far my conclusions
have led me to a race in unlink and NFS somewhere (still have no clue where).
And it is only from Sun clients to date.  Also, this started happening in
ernest arround when we put the latest patches on our Suns (this hadn't been
mentioned before.) seeing how I can reliably reproduce this panic (I am trying
today's STABLE now to see if I still can), does anyone have anything they 
would like me to try.  This is reaching critical proportions for us.

Re: -STABLE, panic #15

1999-06-11 Thread David E. Cross
Yes, I have determined (just today) that the PANIC is only Solaris, and only
with NFSv3  (It may be posssible with NFSv2, but my program doesn't do it 
as quickly.).  I have a NFS traffic dump of a mere 19K of all nfs traffic to
the machine before the panic.  Also, it does NOT ALWAYS cause a panic.  95%
of the time it does, the rest of the time it just stops serving NFS.  I
had it happen again now, and I looked and noticed that all NFSds were in
disk-wait, in channel "inode".  Things are starting to smell a bit
sweater for me.  I continue to look arround the kernel source for clues, but
it is difficult.

a copy of the packet dump is at:

Please help.

Re: -STABLE, panic #15

1999-06-11 Thread David E. Cross
Update, even smaller... 6.5K file, patoot.2 now exists in the same location.

Re: -STABLE, panic #15

1999-06-11 Thread David E. Cross
Ok, I am hot on the trail... I have found a comonality(sp?) between at least
2 of the Panics.  (the 2 I listed)...

it is as follows:

request: create cp1
request: create cp1
reply: ok
reply: error, file exists
request: lookup
request: lookup
(never any response to those.)

I am guessting that: the second create gets a lock but never releases it under
these conditions.  the lookup comes along and *wham*.  Note, that almost all of 
these panics occur within the nfs_lookup routines.  I have *one* panic out
of all of these (with lockmgr) that doesn't after I nail this down, I
will try to trace that down if it is a seperate panic.

I am very close on this, I would feel bad and great at the same time if
someone beats me to this.

Re: do softupdates work on SMP -stable and -current now?

1999-06-14 Thread David E. Cross
I certainly hope they are working under SMP I am running
a 4-way Pentium-III xeon box using vinum and softupdates.  so far it has
been a champ.

3.2-STABLE panic #17 (NFS -- additional informatio)

1999-06-14 Thread David E. Cross
Ok, now that I have hopefully gotten the criticial people's attention, I 
will proceed with the details:

1: My test program can only reproduce this with NFSv3/UDP from a recently
patched Solaris system to a FreeBSD server.  I have not tested older
Solaris patches, I suspect that they will not cause the panic.

2: Recent packet traces of the NFS traffic reveals that it is likely *NOT*
unlink that is causing the problem.  I now have 6 packet traces and all fail
within a couple of packets of the NFS server responding to a create request
with "ERROR: File exists" (that is TCPDUMP terminoligy).  Looking through
the create call, I think I can see it, but it is tough.

2a: only ever *one* "ERROR: File exists" has ever been seen per a single crash.

3: Based on #1 and #2 I can surmise that other OSs will eventually trip this, 
but the access pattern is sufficiently different that my test program will not
do it.

4: It is not a concurancy issue as far as I can tell.  I wrapped all of 
nfssrv_create() (or whatever it is called) in a spl_softclock(), splx() pairing
and I could still cause the panic.

5: I tried compiling with "MAX_PERF" which in effect comments out the panic, 
this caused all access to the directory to result in (D)iskwait, with WCHAN of
"inode".  Effectively bringing the NFS server down, but without the advantage
of having the machine come back by itself.

I am still digging arround, but this is significantly above my head.  It is
great fun, and I wouldn't mind continuing except this is becoming a difficult
issue.  We have backed everything down to NFSv2, but existing mounts are
difficult to get rid of.

3.2-STABLE, panic #end. Problem found..

1999-06-14 Thread David E. Cross
I think I found the problem, and I have a pseudo-fix... (the machine nolonger

This is a bleeding edge development, I have not had time to refine this code
any.  The problem is that nfs_create for NFSv3 does not release the lock
for "vp" with a vput() before it exits.  My crude "patch" follows (read
"hacked, with an axe").

oops, here's the patch

1999-06-14 Thread David E. Cross
*** nfs_serv.c  Tue Jun  8 15:53:11 1999
--- /cs/crossd/nfs_serv.c   Mon Jun 14 16:05:45 1999
*** 1343,1348 
--- 1343,1349 
fhandle_t *fhp;
u_quad_t frev, tempsize;
u_char cverf[NFSX_V3CREATEVERF];
+   int eexistdebug=0;
  #ifndef nolint
rdev = 0;
*** 1380,1385 
--- 1381,1387 
if (nd.ni_vp) {
error = EEXIST;
+   eexistdebug=1;
*** 1489,1497 
zfree(namei_zone, nd.ni_cnd.cn_pnbuf);
vp = nd.ni_vp;
!   if (nd.ni_dvp == vp)
!   else
VOP_ABORTOP(nd.ni_dvp, &nd.ni_cnd);
if (vap->va_size != -1) {
--- 1491,1499 
zfree(namei_zone, nd.ni_cnd.cn_pnbuf);
vp = nd.ni_vp;
!   if (nd.ni_dvp == vp) 
!   else 
VOP_ABORTOP(nd.ni_dvp, &nd.ni_cnd);
if (vap->va_size != -1) {
*** 1505,1513 
error = VOP_SETATTR(vp, vap, cred,
!   if (error)
if (!error) {
bzero((caddr_t)fhp, sizeof(nfh));
--- 1507,1516 
error = VOP_SETATTR(vp, vap, cred,
!   if (error) 
+   if (eexistdebug)  vput(vp);
if (!error) {
bzero((caddr_t)fhp, sizeof(nfh));

To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-hackers" in the body of the message


1999-06-14 Thread David E. Cross
I have been looking at the code for UMAPfs... I am trying to understand 
conceptually why it is so unstable...  It looks straightforward enough as
simply passing the calls it receives on to the FS below it, almost like it
didn't exist at all.  Why does this cause problems?  Isn't the only difference
between a UMAP/UNION FS and a "native" FS an additional stack frame in the

(As I am starting to wrap up this FS adventure, I am looking to start another:)

NFSv3 fixes...

1999-06-14 Thread David E. Cross
Sorry about that everyone, I 'repl'ied to the wrong message.

> Ack, you may have opened up a can of worms here.  I don't even think
> that nfs_namei() does the right thing when it returns an error... it
> doesn't look like it clears the ndp->ni_vp either in some error cases.
Who, me?  Open a can of worms?  ;)

> We are going to have to instrument the code - basically means NULLing
> out ni_vp and any local vnode pointer when the vnode in question is
> released so we can keep track of it and putting KASSERT()s in strategic
> places.  nfs_namei() in nfs/nfs_subs.c and just about all the subroutines
> defined in nfs/nfs_serv.c.

That was along the lines of my thoughts too... it became painfully obvious
that this sort of bug could be (and probably is) everywhere in the nfs
server code.  I will be happy to follow your lead on this (honored one
may say).  I am hoping to have some time to deal with this tonight, but I did
just get my CD-RW drive.  We should probably take the time to document the
code some more while we are at it... simple things like commenting what
braces go to what would have greatly eased my trace through the code :)

Re: umapfs...

1999-06-15 Thread David E. Cross
>> I have been looking at the code for UMAPfs... I am trying to understand 
>> conceptually why it is so unstable...
>You're looking in the wrong place. It's unstable because of
>infrastructure problems which require fairly substantial amounts of
>work to correct.

I guess that is what I am asking... What is different between the following:

int foo(void){
return 0;


int foo_prime(void) {
return foo();

That is my interpretation of the code.  It would *seem* to just pass the 
call off to the next FS layer as if the VFS system of the kernel had done it
directly Conceptually I must be missing something.

Re: Holy cow - path component freeing a mess? (was Re: D'oh!)

1999-06-15 Thread David E. Cross
> :Umm, okay but I'm a little confused about how the zfree I'm adding to
> :nfs_nget falls under this. Am I being really stupid here?
> it's unrelated.  I was starting a new thread.
> I have finished fixing up nfs_serv.c and am now testing it.  Most of
> the procedures required significant adjustments to catch all the 
> problems - mainly due to the various NFS macros in nfsm_subs.h doing
> 'goto nfsmout;'.

Way to go!  I was hoping this would happen... it is the miracle of Open Source.
I am a bit sad that I'm not doing any of the stuff now though :(, you guys
are just too gosh darn quick.

Seriously though... when are we likely to see this stuff hit -STABLE?  I would
like to to dig through your nfs_serv.c at some point before it gets commited
too.  There are a couple of other NFSv3 bugs that I have been tracking and I
would like to see if this addresses those.

Re: Holy cow - path component freeing a mess? (was Re: D'oh!)

1999-06-15 Thread David E. Cross
> The differences between -current and -stable for nfs_serv.c and nfs_subs.c
> are relatively minor.  Once we've life tested the hell out of it in 
> current, it should be easy to MFC into stable.  Maybe 3 weeks total.

Hmm... that is a bit long for us... 3 weeks, 21 days at 1.7 day/panic = 
0.59 Panic/Day ==  21 (day) * 0.59(day/panic) [remeber to check your units ;]
that's another 12 panics for us (if they keep up at the current rate).
Luckily since we have backed down to NFSv2 we are a bit more stable.  The only
reason the server went down today is because the IDE disk decided to flake out.
It was amazing, even though the OS disk was dead it still continued to serve
some NFS requests (from different partitions of course :)

Don't get me wrong, I agree this is the correct procedure, but I plan to roll
my own nfs_serv.c untill it gets MFC-ed... and I'll be able to provide you
with some real world test results of the new code on -STABLE.

vinum performance

1999-06-17 Thread David E. Cross
I have a drive that is rated at ~16 Meg/second, and indeed it delivers on the
order of 15+ Meg/second.  If I use Vinum to create a concatinated device
of 2 such units performance drops to 2.5 Meg/sec.  This seems like a 
drastic drop in performance.  Any ideas what I am doin incorrectly?

Re: vinum performance

1999-06-18 Thread David E. Cross
> > I have a drive that is rated at ~16 Meg/second, and indeed it delivers on 
> > the
> > order of 15+ Meg/second.  If I use Vinum to create a concatinated device
> > of 2 such units performance drops to 2.5 Meg/sec.  This seems like a
> > drastic drop in performance. 
> Indeed, if you're comparing apples with apples.
> > Any ideas what I am doin incorrectly?
> No.  You haven't really given any details.  
> Most of the performance testing I have done has been with striped
> plexes (which offer the potential for better performance), and I've
> found that in massively concurrent situations the performance is
> roughly what you would expect (almost n * normal disk performance,
> where n is the number of disks in the stripe set.  I'd expect
> performance of a concatenated plex to be pretty close to that of the
> raw disk.  How are you measuring performance?  I'd recommend rawio
> (ftp://ftp.lemis.com/pub/rawio.tar.gz).

Ok, I am terribly sorry I didn't provide more information.  I was very tired
(it has been a long week; after the NFS work the main NFS server that has
been having all of the problems decided that its main OS partion was going
to have a hardware failure...)  Anyway, here is some more information...

bash-2.03$ df
Filesystem1K-blocks UsedAvail Capacity  Mounted on
/dev/da0s1a   99183215266972324%/
/dev/da0s1e 2032623  1062270   80774457%/usr
/dev/da0s1f  198399 3466   179062 2%/var
/dev/vinum/concat  29077993   252757 26498997 1%/mnt
bash-2.03$ cd /var/tmp
bash-2.03$ df -k .
Filesystem   1K-blocks UsedAvail Capacity  Mounted on
/dev/da0s1f 198399 3466   179062 2%/var
bash-2.03$ dd bs=64k if=/dev/zero of=foo count=2048
2048+0 records in
2048+0 records out
134217728 bytes transferred in 10.218804 secs (13134387 bytes/sec)
bash-2.03$ cd /mnt
bash-2.03$ df -k .
Filesystem 1K-blocks UsedAvail Capacity  Mounted on
/dev/vinum/concat   29077993   252757 26498997 1%/mnt
bash-2.03$ dd bs=64k if=/dev/zero of=foo count=2048
2048+0 records in
2048+0 records out
134217728 bytes transferred in 59.653922 secs (2249940 bytes/sec)
bash-2.03$ vinum info
Can't open history file /var/tmp/vinum_history: Permission denied (13)
Can't open /dev/vinum/Control: Permission denied
bash-2.03$ su -
hostname# vinum info
Flags: 0x80204
Total of 21 blocks malloced, total memory: 9552
Maximum allocs: 1264, malloc table at 0xc3583ad4
hostname# vinum printconfig
# Vinum configuration of hostname, saved at Fri Jun 18 16:08:56 1999
drive drive1 device /dev/da0s1h
drive drive2 device /dev/da1s1h
volume concat
plex name concat.p0 org concat vol concat 
sd name concat.p0.s0 drive drive1 plex concat.p0 len 27597000b driveoffset 265b 
plexoffset 0b
sd name concat.p0.s1 drive drive2 plex concat.p0 len 32405704b driveoffset 265b 
plexoffset 27597000b

If you need anything else it can probably be provided.  Oh, this is 3.2-STABLE
from last week.

Re: Microsoft performance (was: ...)

1999-06-24 Thread David E. Cross
I certainly hope you have applied the recent NFS patches.  That should solve
your problem.

Re: Microsoft performance (was: ...)

1999-06-24 Thread David E. Cross
> A simple start would be to explicitly put a macro or call in each 
> syscall to push down the lock.  That way people can move that
> macro farther and farther down in the syscall code path, hopefully
> removing it entirely in some cases.  I think having the call at
> the beginning of each syscall would motivate people into doing that
> sort of work.
> "Hey, y'know getppid() is safe, i'll just take the lock out."
> "this function xxx() is safe until this point I can process a lot
> before actually needing this lock..."
> "y'know I just have a structure that's not accessable to any other calls
> that i'm going to fill in, i'll just lift the lock right here"
> "if I just do this something here, I really am re-entrant and safe.."
> Providing a simple api for spinlocks and mutexes would be very nice.
> If some of the FreeBSD gods (core) said something along the lines
> of we'd like to see the process table have XXX method of access
> and locking people will code it, the same way with the many other
> subsystems.
> Things like pmap and UFS and INET will be a royal pain to get
> SMP safe, however baby steps towards lifting the lock for
> simpler subsystems will lead the way.  FreeBSD has the
> most intellegent people in the industry working together, 
> all that is needed is a starting point.

I think mutex is the way to go.  I am 100% for it, and I think now that this
problem is getting a good deal of light we should start to do something about

One of the problems with locks that doesn't seem to have been mentioned
(although I am sure many have thought it) is deadlocks.  You get A waiting
for B and b with A.  With mutexi (plural?) you would lock just the resource
that you are curently working on, and you would be guaranteed to release it
(if the programmers do it right, of course ;).  The advantage is with Mutex
is that you don't need to be as omnipotent to use it.

bug in latests NFS patches for -stable

1999-06-29 Thread David E. Cross
There is a small by critical error in the latest patches which causes the
server to never transmit a response packet back to the client in certain
conditions on a nfs create RPC.  Below is the updated NFS3 patch.  If 
jullian could take this for review and place it at the "official" unoffical
URL it would be muchly appreciated :)  It is a one line patch.


Re: Redundant Remote Webserver clustering

1999-06-29 Thread David E. Cross
> Miguel Gilly wrote:
> > 
> > Bonsai Studio: Web Design and More
> > http://www.bonsai-studio.com
> > Content-type: text/plain; charset="US-ASCII"
> > Content-transfer-encoding: 7bit
> > 
> > Dear Sirs,
> > 
> > I would find it extremely helpful if FreeBSD could offer redundant
> > clustering capabilities for ISP applications.
> > 
> > Nowadays I feel that it is a far better choice to choose a x86 Unix cluster
> > over the expensive Sun/SGI SMP servers.
> > 
> > I found some affordable tools for Linux, but almost nothing for FreeBSD. I
> > feel such an ability  would raise the value a lot of FreeBSD.
> Define clustering.  If you mean a bunch of boxes that serve up HTTP
> requests and the lot of them continue working in the face of a 
> failure on one, you CAN do this with FreeBSD, and the "Beowulf"
> software you're probably thinking of for Linux WILL NOT do this.
I have looked into the "Beowulf" system alot recently.  It is nothing but
a glorified COW design.  And it uses "off the shelf" software components
that run under FreeBSD as well of better than linux often.  I used to
thing it was a big deal.  Not any more :I  This is a tangent though :)

> You do this on FreeBSD (or Linux or Solaris) by creating a "layer 4
> router" or HTTP switch that directs traffic evenly among your several
> web servers, and stops sending traffic to servers that have failed.
Where could someone find information on setting this up, and what software
to use?  I have someone who would be very interested in this.  Isn't the
"layer 4 router" a SPoF though?

3.2-19990630-STABLE: ATAPI 1.1: unknown phase

1999-06-30 Thread David E. Cross
The error message in the subject (atapi 1.1: unknown phase) has plagued me
for some time... everything still works, it just displays that error on
the first access to the disk... untill today.  Today I am trying to install
FreeBSD 3.2  (19990630) from CDROM.  It hangs on probing devices (likely
acd0), however if I boot to the system already there (this is an upgrade,
the system was originally installed via NFS) I can access the CDROM
with only the aforementioned "warning".  Any ideas?

Re: Kernel Drivers

1999-07-11 Thread David E. Cross
Hmm... perhaps if Anthony is willing we can use his experience to help us
further document the procedure for writing a FreeBSD PCI device driver?

brooktree 848 OEM card/no sound :(

1999-07-14 Thread David E. Cross
I am helping a freind install FreeBSD on his machine
(it is running 4.0-CURRENT now).  everything works flawlessly, except his
OEM BrookTree 848 based soundcard.  The card itself is transplanted from 
his gateway machine (where it also had the same problems).  Here are some

Machine is a Dual-PII-400 with a SB-AWE64 PCI soundcard.  The Bt848 is 
unrecognized, and I need to sysctl -w hw.bt848.tuner=[49] to get it to 
work (either one works as far as I can tell with no difference).

Copyright (c) 1992-1999 The FreeBSD Project.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.
FreeBSD 4.0-CURRENT #0: Tue Jul 13 20:33:07 EDT 1999
Timecounter "i8254"  frequency 1193182 Hz
CPU: Pentium II/Xeon/Celeron (686-class CPU)
  Origin = "GenuineIntel"  Id = 0x652  Stepping = 2
real memory  = 268435456 (262144K bytes)
avail memory = 258150400 (252100K bytes)
Programming 24 pins in IOAPIC #0
FreeBSD/SMP: Multiprocessor motherboard
 cpu0 (BSP): apic id:  0, version: 0x00040011, at 0xfee0
 cpu1 (AP):  apic id:  1, version: 0x00040011, at 0xfee0
 io0 (APIC): apic id:  2, version: 0x00170011, at 0xfec0
Preloaded elf kernel "kernel" at 0xc02fa000.
Pentium Pro MTRR support enabled, default memory type is uncacheable
Probing for PnP devices:
CSN 1 Vendor ID: CTL00e4 [0xe4008c0e] Serial 0x01c574c8 Comp ID: PNPb02f 
pcm1 (SB16pnp  sn 0x01c574c8) at 0x220-0x22f irq 5 drq 1 flags 0x15 
on isa
npx0:  on motherboard
npx0: INT 16 interface
apm0:  on motherboard
apm: found APM BIOS version 1.2
pcib0:  on motherboard
pci0:  on pcib0
WARNING: "bktr" is usurping "bktr"'s cdevsw[]
chip0:  at device 0.0 on pci0
pcib1:  at device 1.0 on pci0
pci1:  on pcib1
vga-pci0:  irq 17 at device 0.0 on pci1
isab0:  at device 7.0 on pci0
ide_pci0:  at device 7.1 on pci0
uhci0:  at device 7.2 on pci0
uhci0: could not map ports
device_probe_and_attach: uhci0 attach returned 6
chip1:  at device 7.3 on pci0
vx0: <3COM 3C590 Etherlink III PCI> irq 17 at device 16.0 on pci0
utp/aui/bnc[*utp*]: disable 'auto select' with DOS util! address 
Warning! Defective early revision adapter!
isa0:  on motherboard
fdc0:  at port 0x3f0-0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> at fdc0 drive 0
wdc0 at port 0x1f0-0x1f7 irq 14 flags 0xa0ffa0ff on isa0
wdc0: unit 0 (wd0): , DMA, 32-bit, multi-block-16
wd0: 13783MB (28229040 sectors), 28005 cyls, 16 heads, 63 S/T, 512 B/S
wdc1 at port 0x170-0x177 irq 15 flags 0xa0ffa0ff on isa0
wdc1: unit 0 (atapi): , removable, accel, ovlap, dma, iordis
wcd0: drive speed 5512KB/sec, 128KB cache
wcd0: supported read types: CD-R, CD-RW, CD-DA, packet track
wcd0: Audio: play, 256 volume levels
wcd0: Mechanism: ejectable tray
wcd0: Medium: CD-ROM 120mm audio disc loaded, unlocked
atkbdc0:  at port 0x60-0x6f on isa0
atkbd0:  irq 1 on atkbdc0
psm0:  irq 12 on atkbdc0
psm0: model Generic PS/2 mouse, device ID 0
vga0:  at port 0x3b0-0x3df iomem 0xa-0xb on isa0
sc0:  on isa0
sc0: VGA <16 virtual consoles, flags=0x200>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
ppc0 at port 0x378-0x37f irq 7 flags 0x40 on isa0
ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/9 bytes threshold
plip0:  on ppbus 0
lpt0:  on ppbus 0
lpt0: Interrupt-driven port
ppi0:  on ppbus 0
vx0 XXX: driver didn't set ifq_maxlen
changing root device to wd0s1a
SMP: AP CPU #1 Launched!
acd0: read_toc failed

(kernel config)
machine i386
cpu I686_CPU
ident   MABROOK
options INET#InterNETworking
options FFS #Berkeley Fast Filesystem
options FFS_ROOT#FFS usable as root device [keep this!]
options MFS #Memory Filesystem
options NFS #Network Filesystem
options MSDOSFS #MSDOS Filesystem
options CD9660  #ISO 9660 Filesystem
options PROCFS  #Process filesystem
options COMPAT_43   #Compatible with BSD 4.3 [KEEP THIS!]
options UCONSOLE#Allow users to grab the console
options USERCONFIG  #boot -c editor
options VISUAL_USERCONFIG   #visual boot -c editor
options KTRACE  #ktrace(1) syscall trace support
options SYSVSHM #SYSV-style shared memory
options SYSVMSG #SYSV-style message queues
options SYSVSEM #SYSV-style semaphores
options SMP # Symmetric MultiProcessor Kernel
options APIC_IO # Symmetric (APIC) I/O

Re: Swap overcommit

1999-07-15 Thread David E. Cross
> > No, wait, I got that wrong I think.
> > 
> > Oh yah, I remember now.  Hmm.  How odd.  I came across a case where
> > read() could return -1 and not set errno properly if errno
> > was already set, but a perusal of the kernel code seems to indicate
> > that this can't happen.  Very weird.
> > 
> I thought I saw this somewhere too, but I thought it was more of a case that
> it was somewhere *inside* read that errno had to be preserved. i.e. errno
> gets set somewhere at the top of the code, and if it was already set at a
> certain point, failure was expected, and to pass along the original errno,
> not the new one.
> Or perhaps we're sharing a hallucination. :)
Well, set/getpriority(2), certainly can return "-1"  and not be an error.
You would need to clear out errno before that call and check it on return.

This is where excpetions would be a great gain.  It could also be used to
force programmers to check their system calls more closely.  Oops, you didn't
handle excpetion foo?  SIGBADPROGRAMMER.
Device Driver Question (bus_set_resource)

2001-01-17 Thread David E. Cross

I am writing a simple, I/O only device driver (no lectures about /dev/io
please ;).  It has not PnP abilities, and I have run into the following
problem with bus_set_resource():  

static int das1400adc_isa_probe(device_t dev)
struct das1400adc_softc *sc = device_get_softc(dev);
int unit = device_get_unit(dev);
int pnperror;


pnperror=ISA_PNP_PROBE(device_get_parent(dev), dev, das1400adc_pnp_ids);
if (pnperror != ENXIO)
return pnperror;

if (bus_set_resource(dev, SYS_RES_IOPORT, /*rid*/sc->port0_rid,
sc->port0, 3) < 0)
return ENXIO;
/*  if (bus_set_resource(dev, SYS_RES_IOPORT, sc->port1_rid,
sc->port1, 1) < 0)
return ENXIO;
if (bus_set_resource(dev, SYS_RES_IOPORT, sc->port2_rid,
sc->port2, 1) < 0)
return ENXIO;
device_set_desc(dev, "CIO-DAS1400-ADC");
return 0; /* all is good */

static int das1400adc_isa_attach(device_t dev)
struct das1400adc_softc *sc = device_get_softc(dev);

sc->port0_r = bus_alloc_resource(dev, SYS_RES_IOPORT, 
&sc->port0_rid, /*start*/0, /*end*/ ~0, /*count*/ 0,

/*  sc->port1_r = bus_alloc_resource(dev, SYS_RES_IOPORT,
&sc->port1_rid, 0, ~0, 0,

sc->port2_r = bus_alloc_resource(dev, SYS_RES_IOPORT,
&sc->port2_rid, 0, ~0, 0,
if (sc->port0_r == NULL )
/* || sc->port1_r == NULL )
sc->port2_r == NULL)
return ENXIO;

sc->md_dev=make_dev(&das1400adc_cdevsw, 0, 0, 0, 0600, "adc0");
return 0;   

Given that code, I get the following attach messages from the kernel:

"das1400adc2:  at port 0x310-0x312 irq 5 drq 1,5 on isa0"
Uhm... I set neither the IRQ nor the drq... where does it get these from, and
how can I get it to "do the right thing"?  Also, If I uncomment the settings
for the additional ranges "really weird things" start to happen.  An example
of 'weirdness' is that exact same code, when kldload-ed will attach a totally
different device. 

Oh yeah, this is under 4.2-STABLE from 20010103.

Re: Device Driver Question (bus_set_resource)

2001-01-17 Thread David E. Cross

Thank you...

After a couple of hours, Jon Chen and I have figured out most of what you 
just said :P :)

How would one use hints with a kld?

Re: Device Driver Question (bus_set_resource)

2001-01-17 Thread David E. Cross

> > Thank you...
> > 
> > After a couple of hours, Jon Chen and I have figured out most of what you 
> > just said :P :)
> > 
> > How would one use hints with a kld?
> Badly. 8(  You can only really set them with the loader right now.
> There are a couple of kernel datastores that need some tweaking; the 
> environment is one of them.

That is what I thought... given my recent performance on "what I thought" WRT
the kernel, I thought I would double check ;)

Re: Device Driver Question (bus_set_resource)

2001-01-18 Thread David E. Cross

> > Thank you...
> > 
> > After a couple of hours, Jon Chen and I have figured out most of what you 
> > just said :P :)
> > 
> > How would one use hints with a kld?
> Badly. 8(  You can only really set them with the loader right now.
> There are a couple of kernel datastores that need some tweaking; the 
> environment is one of them.

Ok, everything is working well, except for one last thing... 
I load the driver the first time, and it loads correctly.  I unload it and
reload it and it attempts to attach twice (with the exact same resource
values).  I unload and reload it and it attempts to load 3 times, unload/reload
... 4 times (you see the pattern).  If I load the second module I am working
on (exact same type as this module), it tries to re-attach the old module
"N" times (depending on the number of previous unload/reloads).

I am obviously building up state somewhere in the kernel... how can I get
rid of this "state"?

Using multiple Malloc-Disks

2001-01-29 Thread David E. Cross

I need to use multiple malloc disks for a custom net-boot image I am working
on.  The problem is that whenever I access /dev/md1 from the disk it gives
me a 'device not configured' error.  I originally thought that this was an
error in how a preloaded image interfaced with the system, but I also get
this on a disk-booted machine.

Consider the following test:

> dd bs=512 if=/dev/md0c of=/dev/null
> 2 Blocks in
> 2 Blocks out
> dd bs=512 if=/dev/md1c of=/dev/null
> Device not configured.

Yet, according to the manpage:
> The md driver uses the ``almost-clone'' convention, whereby opening de-
> vice number N creates device instance number N+1.

What is wrong here?   A quick look through the source finds there is no code
in the open() routine to create a new instance; though I am not entirely sure
that is where it would be located.

gif(4) question

2001-03-21 Thread David E. Cross

I recently tried (for the first time) to get gif running under FreeBSD
4.3-BETA (cvsup-ed yesterday).  I noticed the following:

gifconfig gif0 inet
ifconfig gif0 netmask 0xff00

and then I 'ping' it will try to route the packet instead of 
reply directly.  I need to 'route add' to have it
reply to the packet directly.  I don't need to do this for other types
of interfaces... did I mess something up, is this how it is supposed to 
be (doesn't seem to be documented as such).

2001-04-12 Thread David E. Cross

Well, I am able to reproduce the crash pretty reliably, I don't know what is
causing it yet, I just kill all the other ypservs on a subnet except for this
one and it crashes about once every 5 minutes.  I have some questions/theories
that I'd like to bounce off of people:

1)  In the yp_all function it calls yp_fork() to fork a new ypserv, the
parent them calls return(NULL); and the child handles the request.
Looking at the ktraces, I notice that the parent does not close
the socket connection, but after the child finishes the transaction
the parent gets a read() return value of 0 (EOF) for that socket and 
then closes it.  Since this is a yp_all request there _shouldn't_ be 
any more read data on the socket until the close event (which is a read
of 0), but that socket is still open in both the parent and the child,
and the child is making calls against it... is there a possibility
of some shared data corruption within the RPC code that anyone could
think of?

2)  The RPC code itself has a lot of checks against blocking... is the forking
of ypserv even needed at all?

ypserv (on -RC/-STABLE)... almost there

2001-04-13 Thread David E. Cross

I have trace the problem in ypserv down to the RPC dispatch routines..
I am digging further and I hope to have it found and eliminated today 
(in time for -RELEASE ;)

If anyone has any idea how it could be tripping up here, please let me
know.  My 2 guesses are a corrupted svc_callback entry (no idea how
it is getting corrupted, yet.) or there is something walking on stuff 
within ypprog_1 or ypprog_2  (I don't know yet if the segfault is
the (sc->dispatch) call in svc.c in the rpc library, or if it is within the
function that sc->dispatch calls  (the next seg-fault will let me know this.)

a bug in ypserv found

2001-04-15 Thread David E. Cross

I have found _a_ bug in ypserv (I think I may be stumbling over multiple
different bugs, but this one is very reproducable).

It is dying in the yp_testflags routine, in the for loop that goes through
the CIRCLEQ.  The loop dies with qptr pointing to a struct that is all NULL
(my reading of CIRCLEQ suggests this isn't supposed to be possible), *and*
qhead (the global variable representing the CIRCLEQ_HEAD) pointing to a
structure that is all NULL (also not supposed to be possible). The fact that
&qptr != qhead to me suggests that there was data there when it started, but
that it got ripped out from in under it.  I am not sure how though:
qhead is a "static" global variable, and the only async entry into the 
routine is called from the signal-handler for SIGHUP, problem is that SIGHUP
is not being called.

(Aside: this has been a real pain to track down... I traced it into the
RPC library and back out the other side... NOT FUN)
still more ypserv woes

2001-04-16 Thread David E. Cross

Ok... I am coming to the conclusion that there is some sort of kernel
issue that is causing this problem.  Here is what I have done and discovered
to date (this is all with 4.3-RC2 FWIW):

At some point the 'qhead' CIRCLEQ structure in yp_dblookup.c gets corrupted.
This is declared as a static, and no handles are passed back out of the
function, so aside from data-segment smashing, all accesses to that
structure _must_ happen within yp_dblookup.c.  To date, _almost_ every
single segfault has been in the for loop of yp_testflags (this is a bit
odd in and of itself given that the CIRCLEQ is being mangled) ( I do not
recall the exact situation for the one not in yp_testflags. ), so I 
wrote a function called 'queue_verify()' whose only lob is to travel 
once down the CIRCLEQ, assert the number of entries in the CIRCLEQ is
the same as numdbs and exit.  I placed this function after every
Berkeley DB function call and other random points in the function calls
in "yp_dblookup.c".  Right now I am only seeing seg-faults in the 
queue_verify() that I placed before the for loop in yp_testflags *very*
strange, one would think with the number I have placed everywhere that
it would get tripped up somewhere else too).  I also notice that it
always dies very shortly after it fork()s a child to handle a YP_ALL request
(one of the things the child does is the delete its copy of the CIRCLEQ).
Is it possible that a copy-on-write is somehow getting mangled and causing

FWIW: this system is a single CPU PentiumPro acting as a firewall/gateway with
1 FXP, 2 dc, and 2 xl interfaces (the fxp and one each of the dc and xl are

Any ideas?  Any clue where to look next, I am running out ideas here.

ypserv: a resolution (i think)

2001-04-16 Thread David E. Cross

After some more intensive debugging, and a leap of faith, I _think_ I have
the problem licked, but I would appreciate some more brains to examine the

The original cause of ypserv's problems was the sharing of DBPs between 
the parent and child.  The resolution to this was to close all of them, in
the child.   This appears to be where the problem lies, it was assumed to
be safe to call the dbclose in the child... apparently dbclose does some
stuff that is still dangerous.  So, my solution is to move the close routine
_before_ the fork (so far *crossed fingers* this is working). However,
since yp_all is called fairly frequently, this is bad(TM) for the parent.
My second solution was to have the child call yp_init_dbs() instead
of yp_flush_all()  (the former would just nuke the references to the FDs, but
actually keep them open).  This didn't work.  Can anyone provide any clues
as to why?  Does the DB library keep its own cache, and unless they are
"really" closed it will just loop back to the open ones anyway?   The
current solution is suboptimal since for many cases it removes the DBCACHE
entirely, but I don't know what other solution exists.

I know some others who use ypserv heavily have run into these problems, if
you need the patch, I can provide it if you are willing to give it a test ;)

JKH:  I think this _really_ needs to get into 4.3-RELEASE, this has been 
a vexing bug for over a year.  The current solution may be sub-optimal, but
it is more optimal than:

pid 75351 (ypserv), uid 0: exited on signal 11 (core dumped)
pid 75364 (ypserv), uid 0: exited on signal 11 (core dumped)
pid 75365 (ypserv), uid 0: exited on signal 11 (core dumped)
pid 75370 (ypserv), uid 0: exited on signal 11 (core dumped)
pid 75377 (ypserv), uid 0: exited on signal 11 (core dumped)
pid 75379 (ypserv), uid 0: exited on signal 11 (core dumped)
pid 75215 (ypserv), uid 0: exited on signal 11 (core dumped)


sigh... ypserv bug still very much alive

2001-04-09 Thread David E. Cross

The ypserv bug (the one where ypserv randomly stops responding or
just seg-faults) is still very much alive.  I had to restart it
about 11 times in the course of 20 minutes this morning.  That's
the bad news, the good news is that I started it each time with
'ktrace -i'.  

Going back a bit, Matt Dillon suggested that the problem may have been
in the signal handler for sigchld.  I looked at the signal handler and 
it does not appear to be doing anything dangerous at all (just a
child_count--;)  is it doing something dangerous that I am just not seeing?

Also, in the last 200 lines of kdump output for each and every crash there
is the sequence of calls "select();  gettimeofday();"... that sequence of
calls never appears in the ypserv source code, but does appear in svc_tcp.c
in librpc... my question is: "ypserv defines its own svc_run, and for
TCP connections specifically handles things itself very carefully, how is
the svc_tcp.c code getting called at all?"  I think the answer to that is
the source of the problem (it should also be noted that in the case where
ypserv hasn't died and I have collected ktrace information -- up to 8 gig
of it -- the "select(); gettimeofday();" sequence is _never_ called.)

One of my ktrace-s is _very_ small, only 330K, from fork()/exec() to 
SIG_DFL/SEGV, so I am hoping this will provide easily digestible information.
I did not include context-switch information in the ktrace for the following
  1) It didn't appear to be usefull, and since I did specify the -i, it is 
 obvious where context switches occur (to the only thing that could affect
 anything: the children)
  2) It caused ypserv to act strangely... instead of dying, it just got
 very slow, and didn't respond.

Anyone interested in helping me track this one down?

David Cross   | email: [EMAIL PROTECTED] 
