On Wed, 4 Oct 2006, Erblichs wrote:
Casper Dik,
Yes, I am familiar with Bonwick's slab allocators and tried
it for wirespeed test of 64byte pieces for a 1Gb and then
100Mb Eths and lastly 10Mb Eth. My results were not
encouraging. I assume it has improved over time.
First, let me ask what happens to the FS if the allocs
in the intent log code are sleeping waiting for memory????
The same as would happen to the FS with your proposed additional allocator
layer in if that "freelist" of yours runs out - it'll wait, you'll see a
latency bubble.
You seem to think it's likely that a kmem_alloc(..., KM_SLEEP) will sleep.
It's not. Anything but. See below.
IMO, The general problem with memory allocators is:
- getting memory from a "cache" of ones own size/type
is orders of magnitude higher than just getting some
off one's own freelist,
This is why the kernel memory allocator in Solaris has two such freelists:
- the per-CPU kmem magazines (you say below 'one step at a time',
but that step is already done in Solaris kemem)
- the slab cache
- their is a built in latency to recouperate/steal memory
from other processes,
Stealing ("reclaim" in Solaris kmem terms) happens if the following three
conditions are true:
- nothing in the per-CPU magazines
- nothing in the slab cache
- nothing in the quantum caches
- on the attempt to grow the quantum cache, the request to the
vmem backend finds no readily-available heap to satisfy the
growth demand immediately
- this stealing forces a sleep and context switches,
- the amount of time to sleep is undeterminate with a single
call per struct. How long can you sleep for? 100ms or
250ms or more..
- no process can guarantee a working set,
Yes and no. If your working set is small, use the stack.
In the time when memory was expensive, maybe a global
sharing mechanisms would make sense, but when the amount
of memory is somewhat plentiful and cheap,
*** It then makes sense for a 2 stage implementation of
preallocation of a working set and then normal allocation
with the added latency.
So, it makes sense to pre-allocate a working set of allocs
by a single alloc call, break up the alloc into needed sizes,
and then alloc from your own free list,
See above - all of that _IS_ already done in Solaris kmem/vmem, with more
parallelism and more intermediate caching layers designed to bring down
allocation latency than your simple freelist approach would achieve.
-----> if that freelist then empties, maybe then take the extra
overhead with the kmem call. Consider this a expected cost to exceed
a certain watermark.
But otherwise, I bet if I give you some code for the pre-alloc, I bet
10
allocs from the freelist can be done versus the kmem_alloc call, and
at least 100 to 10k allocs if sleep occurs on your side.
The same statistics can be made for Solaris kmem - you satisfy the request
from the per-CPU magazine, you satisfy the request from the slab cache,
you satisfy the request via immediate vmem backend allocation and a growth
of the slab cache. All of these with increased latency but without
sleeping. Sleeping only comes in if you're so tight on memory that you
need to perform coalescing in the backend, and purge least-recently-used
things from other kmem caches in favour of new backend requests. Just
because you chose to say kmem_alloc(...,KM_SLEEP) doesn't mean you _will_
sleep. Normally you won't.
Actually, I think it is so bad, that why don't you time 1 kmem_free
versus grabbing elements off the freelist,
However, don't trust me, I will drop a snapshot of the code to you
tomarrow if you want and you make a single CPU benchmark comparison.
Your multiple CPU issue, forces me to ask, is it a common
occurance that 2 are more CPUs are simultaneouly requesting
memory for the intent log? If it is, then their should be a
freelist of a low watermark set of elements per CPU. However,
one thing at a time..
Of course it's common - have two or more threads do filesystem I/O at the
same time and you're already there. Which is why, one thing at a time,
Solaris kmem had the magazine layer for, I think (predates my time at
Sun), around 12 years now, to get SMP scalability. Been there done that ...
So, do you want that code? It will be a single alloc of X units
and then place them on a freelist. You then time it takes to
remove Y elements from the freelist versus 1 kmem_alloc with
a NO_SLEEP arg and report the numbers. Then I would suggest the
call with the smallest sleep possible. How many allocs can then
be done? 25k, 35k, more...
Oh, the reason why we aren't timing the initial kmem_alloc call
for the freelist, is because I expect that to occur during init
and not proceed until memory is alloc'ed.
Can you provide timing measurements under various loads that show the
benefit of your change to ZFS vs. using Solaris kmem as-is ? On single-CPU
machines as well as on 100-CPU machines ? On single disks as well as on
100-disk pools ? We're very interested there. Performance characterization
of ZFS has, in a way, just started, as people begin using it for their own
purposes, coming up with their own numbers; changes that improve "speed"
will obviously be welcome !
Best wishes,
FrankH.
Mitchell Erblich
------------------------
[EMAIL PROTECTED] wrote:
at least one location:
When adding a new dva node into the tree, a kmem_alloc is done with
a KM_SLEEP argument.
thus, this process thread could block waiting for memory.
I would suggest adding a pre-allocated pool of dva nodes.
This is how the Solaris memory allocator works. It keeps pools of
"pre-allocated" nodes about until memory conditions are low.
When a new dva node is needed, first check this pre-allocated
pool and allocate from their.
There are two reasons why this is a really bad idea:
- the system will run out of memory even sooner if people
start building their own free-lists
- a single freelist does not scale; at two CPUs it becomes
the allocation bottleneck (I've measured and removed two
such bottlenecks from Solaris 9)
You might want to learn about how the Solaris memory allocator works;
it pretty much works like you want, except that it is all part of the
framework. And, just as in your case, it does run out some times but
a private freelist does not help against that.
Why? This would eliminate a possible sleep condition if memory
is not immediately available. The pool would add a working
set of dva nodes that could be monitored. Per alloc latencies
could be amortized over a chunk allocation.
That's how the Solaris memory allocator already works.
Casper
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
==========================================================================
No good can come from selling your freedom, not for all gold of the world,
for the value of this heavenly gift exceeds that of any fortune on earth.
==========================================================================
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss