Robert Watson wrote:
On Tue, 2 Sep 2008, Luigi Rizzo wrote:
The real problem is that these data structures are dynamic and
potentially large, so the following approach (used e.g. in ipfw)
enter kernel;
get shared lock on the structure;
navigate through the structure and make a linearized copy;
unlock;
copyout the linearized copy;
is extremely expensive and has the potential to block other activities
for a long time.
Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with
varying levels of abstraction, for pushing data, and all fundamentally
suffer from the problem of a lack of general export abstraction.
What we'd need is some internal representation of the data structure
that could give us individual entries of the data structure on each
call, together with extra info (a pointer if we can guarantee that it
doesn't get stale, something more if we cannot make the guarantee) to
allow the navigation to occur.
In some code I have seen (and some I have written) there is always two
levels of storage in some modules.. One that contains the majority
of the information and one that contains "changes that occured since
the main container was locked"..
so for example the routing tables might be locked and if
a routing change is requested thereafter, it gets stored in a
transactional form on the side structure.. a routing lookup
during the period that the structure is locked (if a read lock) simply
goes ahead, and at completion checks if there is a better
answer in the waiting list. A write request is stored as a
transaction request on the waiting list. not saying it works for
everything but If we had a kernel written in a high enough level
language, such methods could be broadly used.. oh well.
using reader-writer locking mitigates a lot of this..
I think there's necessarily implementation-specific details to all of
these steps for any given kernel subsystem -- we have data structures,
synchronization models, etc, that are all tuned to their common use
requirements, and monitoring is very much an edge case. I don't think
this is bad: this is an OS kernel, after all, but it does make things a
bit more tricky. Even if we can't share code, sharing approaches across
subsystems is a good idea.
For an example of what you have in mind, take a look at the sysctl
monitoring for UNIX domain sockets. First, we allocate an array of
pointers sized to the number of unpcb's we have. Then we walk the list,
bumping the references and adding pointers to the array. Then we
release the global locks, and proceed lock, externalize, unlock, and
copyout each individual entry, using a generation number fo manage
staleness. Finally, we walk the list dropping the refcounts and free
the array. This voids holding global locks for a long time, as well as
the stale data issue. It's unideal in other ways -- long lists,
reference count complexity, etc, but as I mentioned, it is very much an
edge case, and much of that mechanism (especially refcounts) is
something we need anyway for any moderately complex kernel data structure.
refcounts are, unfortunatly a really bad thing for multiprocessors.
refcounts, if they are actually incremented now and then are usually
out of scope for any given CPU forcing lots of cache flushes and real
reads from memory. There are some elaborate MP-refcount schemes we
really should look at but most require a lot of memory.
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"