On Fri, 28 Dec 2007, Bruce Evans wrote:
On Fri, 28 Dec 2007, Bruce Evans wrote:
In previous mail, you (Mark) wrote:
# With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
# kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
# in the application. If packets w
On Fri, 28 Dec 2007, Bruce Evans wrote:
In previous mail, you (Mark) wrote:
# With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
# kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
# in the application. If packets were dropped they would show up
# with nets
On Sat, 22 Dec 2007, Mark Fullmer wrote:
On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote:
I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things). Perhaps I misread the problem, and
Mark Fullmer wrote:
On Dec 25, 2007, at 12:27 AM, Kostik Belousov wrote:
What fs do you use ? If FFS, are softupdates turned on ? Please, show the
total time spent in the softdepflush process.
Also, try to add the FULL_PREEMPTION kernel config option and report
whether it helps.
FFS with s
On Dec 25, 2007, at 12:27 AM, Kostik Belousov wrote:
What fs do you use ? If FFS, are softupdates turned on ? Please,
show the
total time spent in the softdepflush process.
Also, try to add the FULL_PREEMPTION kernel config option and report
whether it helps.
FFS with soft updates on all
On Mon, Dec 24, 2007 at 08:16:50PM -0500, Mark Fullmer wrote:
>
> On Dec 24, 2007, at 8:19 AM, Kostik Belousov wrote:
>
> >
> >Mark, could you, please, retest the patch below in your setup ?
> >I want to put a change or some edition of it into the 7.0 release, and
> >we need to move fast to do th
On Dec 24, 2007, at 8:19 AM, Kostik Belousov wrote:
Mark, could you, please, retest the patch below in your setup ?
I want to put a change or some edition of it into the 7.0 release, and
we need to move fast to do this.
It's building now. The testing will run overnight.
Your patch to ffs_s
On Mon, 24 Dec 2007, Kostik Belousov wrote:
On Sun, Dec 23, 2007 at 10:20:31AM +1100, Bruce Evans wrote:
On Sat, 22 Dec 2007, Kostik Belousov wrote:
Ok, since you talked about this first :). I already made the following
patch, but did not published it since I still did not inspected all
caller
On Sun, Dec 23, 2007 at 10:20:31AM +1100, Bruce Evans wrote:
> On Sat, 22 Dec 2007, Kostik Belousov wrote:
> >Ok, since you talked about this first :). I already made the following
> >patch, but did not published it since I still did not inspected all
> >callers of MNT_VNODE_FOREACH() for safety of
* David G Lawrence <[EMAIL PROTECTED]> [071221 23:31] wrote:
> > > > Can you use a placeholder vnode as a place to restart the scan?
> > > > you might have to mark it special so that other threads/things
> > > > (getnewvnode()?) don't molest it, but it can provide for a convenient
> > > > restart p
On Sat, 22 Dec 2007, Kostik Belousov wrote:
On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote:
On Sat, 22 Dec 2007, Kostik Belousov wrote:
Yes, rewriting the syncer is the right solution. It probably cannot be done
quickly enough. If the yield workaround provide mitigation for now, i
On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote:
> On Sat, 22 Dec 2007, Kostik Belousov wrote:
> >Yes, rewriting the syncer is the right solution. It probably cannot be done
> >quickly enough. If the yield workaround provide mitigation for now, it
> >shall go in.
>
> I don't think rewr
This appears to work. No packet loss with vfs.numvnodes
at 32132, 16K PPS test with 1 million packets.
I'll run some additional tests bringing vfs.numvnodes
closer to kern.maxvnodes.
On Dec 22, 2007, at 2:03 AM, Kostik Belousov wrote:
As Bruce Evans noted, there is a vfs_msync() that do almo
On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote:
I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things). Perhaps I misread the problem, and it is
actually that networking works but u
On Sat, 22 Dec 2007, Kostik Belousov wrote:
On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote:
I'm just an observer, and I may be confused, but it seems to me that this is
motion in the wrong direction (at least, it's not going to fix the actual
problem). As I understand the probl
> > > Can you use a placeholder vnode as a place to restart the scan?
> > > you might have to mark it special so that other threads/things
> > > (getnewvnode()?) don't molest it, but it can provide for a convenient
> > > restart point.
> >
> >That was one of the solutions that I considered and
> As Bruce Evans noted, there is a vfs_msync() that do almost the same
> traversal of the vnodes. It was missed in the previous patch. Try this one.
I forgot to comment on that when Bruce pointed that out. My solution
has been to comment out the call to vfs_msync. :-) It comes into play
when yo
> I'm just an observer, and I may be confused, but it seems to me that this is
> motion in the wrong direction (at least, it's not going to fix the actual
> problem). As I understand the problem, once you reach a certain point, the
> system slows down *every* 30.999 seconds. Now, it's possible for
On Sat, Dec 22, 2007 at 01:28:31AM -0500, Mark Fullmer wrote:
>
> On Dec 22, 2007, at 12:36 AM, Kostik Belousov wrote:
> >Lets check whether the syncer is the culprit for you.
> >Please, change the value of the syncdelay at the sys/kern/vfs_subr.c
> >around the line 238 from 30 to some other value
> >What patch you have used ?
>
> This is hand applied from the diff you sent December 19, 2007 1:24:48
> PM EST
Mark, try the previos patch from Kostik - the one that does the one
tick msleep. I think you'll find that that one does work. The likely
problem with the second version is that ui
On Dec 22, 2007, at 12:36 AM, Kostik Belousov wrote:
On Fri, Dec 21, 2007 at 10:30:51PM -0500, Mark Fullmer wrote:
The uio_yield() idea did not work. Still have the same 31 second
interval packet loss.
What patch you have used ?
This is hand applied from the diff you sent December 19, 2007
On Fri, Dec 21, 2007 at 04:24:32PM -0800, Alfred Perlstein wrote:
> * David G Lawrence <[EMAIL PROTECTED]> [071221 15:42] wrote:
> > > >Unfortunately, the version of the patch that I sent out isn't going
> > > > to
> > > > help your problem. It needs to yield at the top of the loop, but vp
>
On Fri, Dec 21, 2007 at 10:30:51PM -0500, Mark Fullmer wrote:
> The uio_yield() idea did not work. Still have the same 31 second
> interval packet loss.
What patch you have used ?
Lets check whether the syncer is the culprit for you.
Please, change the value of the syncdelay at the sys/kern/vfs
On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote:
>
>
> I'm just an observer, and I may be confused, but it seems to me that this is
> motion in the wrong direction (at least, it's not going to fix the actual
> problem). As I understand the problem, once you reach a certain point, t
The uio_yield() idea did not work. Still have the same 31 second
interval
packet loss.
Is it safe to assume the vp will be valid after a msleep() or
uio_yield()? If
so can we do something a little different:
Currently:
/* this takes too long when list is large */
MNT_VNODE_FOREACH(vp, mp
I'm just an observer, and I may be confused, but it seems to me that this is
motion in the wrong direction (at least, it's not going to fix the actual
problem). As I understand the problem, once you reach a certain point, the
system slows down *every* 30.999 seconds. Now, it's possible for the co
* David G Lawrence <[EMAIL PROTECTED]> [071221 15:42] wrote:
> > >Unfortunately, the version of the patch that I sent out isn't going to
> > > help your problem. It needs to yield at the top of the loop, but vp isn't
> > > necessarily valid after the wakeup from the msleep. That's a problem tha
> >Unfortunately, the version of the patch that I sent out isn't going to
> > help your problem. It needs to yield at the top of the loop, but vp isn't
> > necessarily valid after the wakeup from the msleep. That's a problem that
> > I'm having trouble figuring out a solution to - the solutions
* David G Lawrence <[EMAIL PROTECTED]> [071219 09:12] wrote:
> > >Try it with "find / -type f >/dev/null" to duplicate the problem
> > >almost
> > >instantly.
> >
> > I was able to verify last night that (cd /; tar -cpf -) > all.tar would
> > trigger the problem. I'm working getting a test runn
Thanks, I'll test this later on today.
On Dec 19, 2007, at 1:11 PM, Kostik Belousov wrote:
On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote:
Try it with "find / -type f >/dev/null" to duplicate the problem
almost
instantly.
I was able to verify last night that (cd /; tar -cpf
On Wed, Dec 19, 2007 at 12:06:59PM -0500, Mark Fullmer wrote:
>Thanks for the other info on timer resolution, I overlooked
>clock_gettime().
If you have a UP system with a usable TSC (or equivalent) then
using rdtsc() (or equivalent) is a much cheaper way to measure
short durations with high resol
On Wed, Dec 19, 2007 at 11:44:00AM -0800, Julian Elischer wrote:
> David G Lawrence wrote:
> >>> In any case, it appears that my patch is a no-op, at least for the
> >>>problem I was trying to solve. This has me confused, however, because at
> >>>one point the problem was mitigated with it. The pat
David G Lawrence wrote:
In any case, it appears that my patch is a no-op, at least for the
problem I was trying to solve. This has me confused, however, because at
one point the problem was mitigated with it. The patch has gone through
several iterations, however, and it could be that it was mad
On Wed, 19 Dec 2007, David G Lawrence wrote:
The patch should work fine. IIRC, it yields voluntarily so that other
things can run. I committed a similar hack for uiomove(). It was
It patches the bottom of the loop, which is only reached if the vnode
is dirty. So it will only help if there
Just to confirm the patch did not change the behavior. I ran with it
last night and double checked this morning to make sure.
It looks like if you put the check at the top of the loop and the
next node
is changed during msleep() SLIST_NEXT will walk into the trash. I'm
in over my head here..
On Wed, Dec 19, 2007 at 08:11:59PM +0200, Kostik Belousov wrote:
> On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote:
> > > >Try it with "find / -type f >/dev/null" to duplicate the problem
> > > >almost
> > > >instantly.
> > >
> > > I was able to verify last night that (cd /; tar
On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote:
> > >Try it with "find / -type f >/dev/null" to duplicate the problem
> > >almost
> > >instantly.
> >
> > I was able to verify last night that (cd /; tar -cpf -) > all.tar would
> > trigger the problem. I'm working getting a test
On Thu, 20 Dec 2007, Bruce Evans wrote:
On Wed, 19 Dec 2007, David G Lawrence wrote:
Considering that the CPU clock cycle time is on the order of 300ps, I
would say 125ns to do a few checks is pathetic.
As I said, 125 nsec is a short time in this context. It is approximately
the time for a
On Wed, 19 Dec 2007, David G Lawrence wrote:
Try it with "find / -type f >/dev/null" to duplicate the problem almost
instantly.
FreeBSD used to have some code that would cause vnodes with no cached
pages to be recycled quickly (which would have made a simple find
ineffective without reading
> >Try it with "find / -type f >/dev/null" to duplicate the problem
> >almost
> >instantly.
>
> I was able to verify last night that (cd /; tar -cpf -) > all.tar would
> trigger the problem. I'm working getting a test running with
> David's ffs_sync() workaround now, adding a few counters there
On Dec 19, 2007, at 9:54 AM, Bruce Evans wrote:
On Tue, 18 Dec 2007, Mark Fullmer wrote:
A little progress.
I have a machine with a KTR enabled kernel running.
Another machine is running David's ffs_vfsops.c's patch.
I left two other machines (GENERIC kernels) running the packet
loss tes
> > In any case, it appears that my patch is a no-op, at least for the
> >problem I was trying to solve. This has me confused, however, because at
> >one point the problem was mitigated with it. The patch has gone through
> >several iterations, however, and it could be that it was made to the top
On Wed, 19 Dec 2007, David G Lawrence wrote:
Debugging shows that the problem is like I said. The loop really does
take 125 ns per iteration. This time is actually not very much. The
Considering that the CPU clock cycle time is on the order of 300ps, I
would say 125ns to do a few checks i
David G Lawrence wrote:
Try it with "find / -type f >/dev/null" to duplicate the problem almost
instantly.
FreeBSD used to have some code that would cause vnodes with no cached
pages to be recycled quickly (which would have made a simple find
ineffective without reading the files at lea
>In any case, it appears that my patch is a no-op, at least for the
> problem I was trying to solve. This has me confused, however, because at
> one point the problem was mitigated with it. The patch has gone through
> several iterations, however, and it could be that it was made to the top
> o
> Try it with "find / -type f >/dev/null" to duplicate the problem almost
> instantly.
FreeBSD used to have some code that would cause vnodes with no cached
pages to be recycled quickly (which would have made a simple find
ineffective without reading the files at least a little bit). I guess
th
> On Tue, 18 Dec 2007, David G Lawrence wrote:
>
> >>>I got an almost identical delay (with 64000 vnodes).
> >>>
> >>>Now, 17ms isn't much.
> >>
> >> Says you. On modern systems, trying to run a pseudo real-time
> >> application
> >>on an otherwise quiescent system, 17ms is just short of an e
On Tue, 18 Dec 2007, Mark Fullmer wrote:
A little progress.
I have a machine with a KTR enabled kernel running.
Another machine is running David's ffs_vfsops.c's patch.
I left two other machines (GENERIC kernels) running the packet loss test
overnight. At ~ 32480 seconds of uptime the proble
On Tue, 18 Dec 2007, David G Lawrence wrote:
I got an almost identical delay (with 64000 vnodes).
Now, 17ms isn't much.
Says you. On modern systems, trying to run a pseudo real-time application
on an otherwise quiescent system, 17ms is just short of an eternity. I agree
that the syncer sho
A little progress.
I have a machine with a KTR enabled kernel running.
Another machine is running David's ffs_vfsops.c's patch.
I left two other machines (GENERIC kernels) running the packet loss test
overnight. At ~ 32480 seconds of uptime the problem starts. This is
really
close to a 16 b
> > I got an almost identical delay (with 64000 vnodes).
> >
> > Now, 17ms isn't much.
>
>Says you. On modern systems, trying to run a pseudo real-time application
> on an otherwise quiescent system, 17ms is just short of an eternity. I agree
> that the syncer should be preemptable (which is
> I got an almost identical delay (with 64000 vnodes).
>
> Now, 17ms isn't much.
Says you. On modern systems, trying to run a pseudo real-time application
on an otherwise quiescent system, 17ms is just short of an eternity. I agree
that the syncer should be preemptable (which is what my bandai
On Tue, 18 Dec 2007, David G Lawrence wrote:
I didn't say it caused any bogus disk I/O. My original problem
(after a day or two of uptime) was an occasional large scheduling delay
for a process that needed to process VoIP frames in real-time. It was
happening every 31 seconds and was causing v
> On Tue, 18 Dec 2007, David G Lawrence wrote:
>
> >>Thanks. Have a kernel building now. It takes about a day of uptime
> >>after reboot before I'll see the problem.
> >
> > You may also wish to try to get the problem to occur sooner after boot
> >on a non-patched system by doing a "tar cf /dev
On Tue, 18 Dec 2007, David G Lawrence wrote:
Thanks. Have a kernel building now. It takes about a day of uptime
after reboot before I'll see the problem.
You may also wish to try to get the problem to occur sooner after boot
on a non-patched system by doing a "tar cf /dev/null /" (note: su
> Thanks. Have a kernel building now. It takes about a day of uptime
> after reboot before I'll see the problem.
You may also wish to try to get the problem to occur sooner after boot
on a non-patched system by doing a "tar cf /dev/null /" (note: substitute
/dev/zero instead of /dev/null, i
> >Right, it's a non-optimal loop when N is very large, and that's a fairly
> >well understood problem. I think what DG was getting at, though, is
> >that this massive flush happens every time the syncer runs, which
> >doesn't seem correct. Sure, maybe you just rsynced 100,000 files 20
> >seconds
On Mon, 17 Dec 2007, Scott Long wrote:
Bruce Evans wrote:
On Mon, 17 Dec 2007, David G Lawrence wrote:
One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of
On Mon, 17 Dec 2007, Mark Fullmer wrote:
Thanks. Have a kernel building now. It takes about a day of uptime after
reboot before I'll see the problem.
Yes run "find / >/dev/null" to see the problem if it is the syncer one.
At least the syscall latency problem does seem to be this. Under ~5.
Bruce Evans wrote:
On Mon, 17 Dec 2007, David G Lawrence wrote:
One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling f
On Mon, 17 Dec 2007, David G Lawrence wrote:
One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling ffs_syncvnode()) ever
On Mon, 17 Dec 2007, David G Lawrence wrote:
While trying to diagnose a packet loss problem in a RELENG_6 snapshot
dated
November 8, 2007 it looks like I've stumbled across a broken driver or
kernel routine which stops interrupt processing long enough to severly
degrade network performance every
Thanks. Have a kernel building now. It takes about a day of uptime
after reboot before I'll see the problem.
--
mark
On Dec 17, 2007, at 5:24 AM, David G Lawrence wrote:
While trying to diagnose a packet loss problem in a RELENG_6 snapshot
dated
November 8, 2007 it looks like I've stumbled
Back to back test with no ethernet switch between two em interfaces,
same result. The receiving side has been up > 1 day and exhibits
the problem. These are also two different servers. The small
gettimeofday() syscall tester also shows the same ~30
second pattern of high latency between syscall
> While trying to diagnose a packet loss problem in a RELENG_6 snapshot
> dated
> November 8, 2007 it looks like I've stumbled across a broken driver or
> kernel routine which stops interrupt processing long enough to severly
> degrade network performance every 30.99 seconds.
I noticed this a
One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling ffs_syncvnode()) every time. This should normally only
happen for di
While trying to diagnose a packet loss problem in a RELENG_6 snapshot
dated
November 8, 2007 it looks like I've stumbled across a broken driver or
kernel routine which stops interrupt processing long enough to severly
degrade network performance every 30.99 seconds.
Packets appear to make it as
67 matches
Mail list logo