Casper Dik, Yes, I am familiar with Bonwick's slab allocators and tried it for wirespeed test of 64byte pieces for a 1Gb and then 100Mb Eths and lastly 10Mb Eth. My results were not encouraging. I assume it has improved over time.
First, let me ask what happens to the FS if the allocs in the intent log code are sleeping waiting for memory???? IMO, The general problem with memory allocators is: - getting memory from a "cache" of ones own size/type is orders of magnitude higher than just getting some off one's own freelist, - their is a built in latency to recouperate/steal memory from other processes, - this stealing forces a sleep and context switches, - the amount of time to sleep is undeterminate with a single call per struct. How long can you sleep for? 100ms or 250ms or more.. - no process can guarantee a working set, In the time when memory was expensive, maybe a global sharing mechanisms would make sense, but when the amount of memory is somewhat plentiful and cheap, *** It then makes sense for a 2 stage implementation of preallocation of a working set and then normal allocation with the added latency. So, it makes sense to pre-allocate a working set of allocs by a single alloc call, break up the alloc into needed sizes, and then alloc from your own free list, -----> if that freelist then empties, maybe then take the extra overhead with the kmem call. Consider this a expected cost to exceed a certain watermark. But otherwise, I bet if I give you some code for the pre-alloc, I bet 10 allocs from the freelist can be done versus the kmem_alloc call, and at least 100 to 10k allocs if sleep occurs on your side. Actually, I think it is so bad, that why don't you time 1 kmem_free versus grabbing elements off the freelist, However, don't trust me, I will drop a snapshot of the code to you tomarrow if you want and you make a single CPU benchmark comparison. Your multiple CPU issue, forces me to ask, is it a common occurance that 2 are more CPUs are simultaneouly requesting memory for the intent log? If it is, then their should be a freelist of a low watermark set of elements per CPU. However, one thing at a time.. So, do you want that code? It will be a single alloc of X units and then place them on a freelist. You then time it takes to remove Y elements from the freelist versus 1 kmem_alloc with a NO_SLEEP arg and report the numbers. Then I would suggest the call with the smallest sleep possible. How many allocs can then be done? 25k, 35k, more... Oh, the reason why we aren't timing the initial kmem_alloc call for the freelist, is because I expect that to occur during init and not proceed until memory is alloc'ed. Mitchell Erblich ------------------------ [EMAIL PROTECTED] wrote: > > > at least one location: > > > > When adding a new dva node into the tree, a kmem_alloc is done with > > a KM_SLEEP argument. > > > > thus, this process thread could block waiting for memory. > > > > I would suggest adding a pre-allocated pool of dva nodes. > > This is how the Solaris memory allocator works. It keeps pools of > "pre-allocated" nodes about until memory conditions are low. > > > When a new dva node is needed, first check this pre-allocated > > pool and allocate from their. > > There are two reasons why this is a really bad idea: > > - the system will run out of memory even sooner if people > start building their own free-lists > > - a single freelist does not scale; at two CPUs it becomes > the allocation bottleneck (I've measured and removed two > such bottlenecks from Solaris 9) > > You might want to learn about how the Solaris memory allocator works; > it pretty much works like you want, except that it is all part of the > framework. And, just as in your case, it does run out some times but > a private freelist does not help against that. > > > Why? This would eliminate a possible sleep condition if memory > > is not immediately available. The pool would add a working > > set of dva nodes that could be monitored. Per alloc latencies > > could be amortized over a chunk allocation. > > That's how the Solaris memory allocator already works. > > Casper _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss