Re: [zfs-discuss] single memory allocation in the ZFS intent log

Frank Hofmann Wed, 04 Oct 2006 03:15:14 -0700

On Wed, 4 Oct 2006, Erblichs wrote:

Casper Dik,


        Yes, I am familiar with Bonwick's slab allocators and tried
        it for wirespeed test of 64byte pieces for a 1Gb and then
        100Mb Eths and lastly 10Mb Eth. My results were not
        encouraging. I assume it has improved over time.

        First, let me ask what happens to the FS if the allocs
        in the intent log code are sleeping waiting for memory????

The same as would happen to the FS with your proposed additional allocatorlayer in if that "freelist" of yours runs out - it'll wait, you'll see alatency bubble.

You seem to think it's likely that a kmem_alloc(..., KM_SLEEP) will sleep.It's not. Anything but. See below.


        IMO, The general problem with memory allocators is:

        - getting memory from a "cache" of ones own size/type
          is orders of magnitude higher than just getting some
          off one's own freelist,


This is why the kernel memory allocator in Solaris has two such freelists:

        - the per-CPU kmem magazines (you say below 'one step at a time',
          but that step is already done in Solaris kemem)
        - the slab cache


        - their is a built in latency to recouperate/steal memory
          from other processes,

Stealing ("reclaim" in Solaris kmem terms) happens if the following threeconditions are true:


        - nothing in the per-CPU magazines
        - nothing in the slab cache
        - nothing in the quantum caches
        - on the attempt to grow the quantum cache, the request to the
          vmem backend finds no readily-available heap to satisfy the
          growth demand immediately


        - this stealing forces a sleep and context switches,

        - the amount of time to sleep is undeterminate with a single
          call per struct. How long can you sleep for? 100ms or
          250ms or more..

        - no process can guarantee a working set,


Yes and no. If your working set is small, use the stack.


        In the time when memory was expensive, maybe a global
        sharing mechanisms would make sense, but when  the amount
        of memory is somewhat plentiful and cheap,

        *** It then makes sense for a 2 stage implementation of
            preallocation of a working set and then normal allocation
            with the added latency.

        So, it makes sense to pre-allocate a working set of allocs
        by a single alloc call, break up the alloc into needed sizes,
        and then alloc from your own free list,

See above - all of that _IS_ already done in Solaris kmem/vmem, with moreparallelism and more intermediate caching layers designed to bring downallocation latency than your simple freelist approach would achieve.


        -----> if that freelist then empties, maybe then take the extra
        overhead with the kmem call. Consider this a expected cost to exceed
        a certain watermark.

        But otherwise, I bet if I give you some code for the pre-alloc, I bet
10
        allocs from the freelist can be done versus the kmem_alloc call, and
        at least 100 to 10k allocs if sleep occurs on your side.

The same statistics can be made for Solaris kmem - you satisfy the requestfrom the per-CPU magazine, you satisfy the request from the slab cache,you satisfy the request via immediate vmem backend allocation and a growthof the slab cache. All of these with increased latency but withoutsleeping. Sleeping only comes in if you're so tight on memory that youneed to perform coalescing in the backend, and purge least-recently-usedthings from other kmem caches in favour of new backend requests. Justbecause you chose to say kmem_alloc(...,KM_SLEEP) doesn't mean you _will_sleep. Normally you won't.


        Actually, I think it is so bad, that why don't you time 1 kmem_free
        versus grabbing elements off the freelist,

        However, don't trust me, I will drop a snapshot of the code to you
        tomarrow if you want and you make a single CPU benchmark comparison.

        Your multiple CPU issue, forces me to ask, is it a common
        occurance that 2 are more CPUs are simultaneouly requesting
        memory for the intent log? If it is, then their should be a
        freelist of a low watermark set of elements per CPU. However,
        one thing at a time..

Of course it's common - have two or more threads do filesystem I/O at thesame time and you're already there. Which is why, one thing at a time,Solaris kmem had the magazine layer for, I think (predates my time atSun), around 12 years now, to get SMP scalability. Been there done that ...


        So, do you want that code? It will be a single alloc of X units
        and then place them on a freelist. You then time it takes to
        remove Y elements from the freelist versus 1 kmem_alloc with
        a NO_SLEEP arg and report the numbers. Then I would suggest the
        call with the smallest sleep possible. How many allocs can then
        be done? 25k, 35k, more...

        Oh, the reason why we aren't timing the initial kmem_alloc call
        for the freelist, is because I expect that to occur during init
        and not proceed until memory is alloc'ed.

Can you provide timing measurements under various loads that show thebenefit of your change to ZFS vs. using Solaris kmem as-is ? On single-CPUmachines as well as on 100-CPU machines ? On single disks as well as on100-disk pools ? We're very interested there. Performance characterizationof ZFS has, in a way, just started, as people begin using it for their ownpurposes, coming up with their own numbers; changes that improve "speed"will obviously be welcome !


Best wishes,
FrankH.



        Mitchell Erblich
        ------------------------







[EMAIL PROTECTED] wrote:

      at least one location:

      When adding a new dva node into the tree, a kmem_alloc is done with
      a KM_SLEEP argument.

      thus, this process thread could block waiting for memory.

      I would suggest adding a  pre-allocated pool of dva nodes.


This is how the Solaris memory allocator works.  It keeps pools of
"pre-allocated" nodes about until memory conditions are low.

      When a new dva node is needed, first check this pre-allocated
      pool and allocate from their.


There are two reasons why this is a really bad idea:

        - the system will run out of memory even sooner if people
          start building their own free-lists

        - a single freelist does not scale; at two CPUs it becomes
          the allocation bottleneck (I've measured and removed two
          such bottlenecks from Solaris 9)

You might want to learn about how the Solaris memory allocator works;
it pretty much works like you want, except that it is all part of the
framework.  And, just as in your case, it does run out some times but
a private freelist does not help against that.

      Why? This would eliminate a possible sleep condition if memory
           is not immediately available. The pool would add a working
           set of dva nodes that could be monitored. Per alloc latencies
           could be amortized over a chunk allocation.


That's how the Solaris memory allocator already works.

Casper

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


==========================================================================
No good can come from selling your freedom, not for all gold of the world,
for the value of this heavenly gift exceeds that of any fortune on earth.
==========================================================================
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] single memory allocation in the ZFS intent log

Reply via email to