On Thu, Jun 20, 2002 at 12:25:58PM -0400, Andrew Gallatin wrote: [...] > > > Do you think it would be feasable to glue in a new jumbo (10K?) > > > allocator on top of the existing mbuf and mcl allocators using the > > > existing mechanisms and the existing MCLBYTES > PAGE_SIZE support > > > (but broken out into separte functions and macros)? > > > > Assuming that you can still play those VM tricks with the pages spit > > out by mb_alloc (kern/subr_mbuf.c in -CURRENT), then this wouldn't be a > > problem at all. It's easy to add a new fixed-size type allocation to > > mb_alloc. In fact, it would be beneficial. mb_alloc uses per-CPU > > caches and also makes mbuf and cluster allocations share the same > > per-CPU lock. What could be done is that the jumbo buffer allocations > > could share the same lock as well (since they will likely usually be > > allocated right after an mbuf is). This would give us jumbo-cluster > > support, but it would only be useful for devices clued enough to break > > up the cluster into PAGE_SIZE chunks and do scatter/gather. For most > > worthy gigE devices, I don't think this should be a problem. > > I'm a bit worried about other devices.. Tradidtionally, mbufs have > never crossed page boundaries so most drivers never bother to check > for a transmit mbuf crossing a page boundary. Using physically > discontigous mbufs could lead to a lot of subtle data corruption.
I assume here that when you say "mbuf" you mean "jumbo buffer attached to an mbuf." In that case, yeah, all that we need to make sure of is that the driver knows that it's dealing with non-physically-contiguous pages. For what concerns regular 2K mbuf clusters as well as the 256 byte mbufs themselves, they never cross page boundaries so this should not be a problem for those drivers that do not use jumbo clusters. > One question. I've observed some really anomolous behaviour under > -stable with my Myricom GM driver (2Gb/s + 2Gb/s link speed, Dual 1GHz > pIII). When I use 4K mbufs for receives, the best speed I see is > about 1300Mb/sec. However, if I use private 9K physically contiguous > buffers I see 1850Mb/sec (iperf TCP). > > The obvious conclusion is that there's a lot of overhead in setting up > the DMA engines, but that's not the case; we have a fairly quick chain > dma engine. I've provided a "control" by breaking my contiguous > buffers down into 4K chunks so that I do the same number of DMAs in > both cases and I still see ~1850 Mb/sec for the 9K buffers. > > A coworker suggested that the problem was that when doing copyouts to > userspace, the PIII was doing speculative reads and loading the cache > with the next page. However, we then start copying from a totally > different address using discontigous buffers, so we effectively take > 2x the number of cache misses we'd need to. Does that sound > reasonable to you? I need to try malloc'ing virtually contigous and > physically discontigous buffers & see if I get the same (good) > performance... I believe that the Intel chips do "virtual page caching" and that the logic that does the virtual -> physical address translation sits between the L2 cache and RAM. If that is indeed the case, then your idea of testing with virtually contiguous pages is a good one. Unfortunately, I don't know if the PIII is doing speculative cache-loads, but it could very well be the case. If it is and if in fact the chip does caching based on virtual addresses, then providing it with virtually contiguous address space may yield better results. If you try this, please let me know. I'm extremely interested in seeing the results! > Cheers, > > Drew Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message