Re: [zfs-discuss] single memory allocation in the ZFS intent log

Erblichs Wed, 04 Oct 2006 02:36:20 -0700

Casper Dik,

        Yes, I am familiar with Bonwick's slab allocators and tried
        it for wirespeed test of 64byte pieces for a 1Gb and then
        100Mb Eths and lastly 10Mb Eth. My results were not 
        encouraging. I assume it has improved over time.


        First, let me ask what happens to the FS if the allocs
        in the intent log code are sleeping waiting for memory????

        IMO, The general problem with memory allocators is:

        - getting memory from a "cache" of ones own size/type
          is orders of magnitude higher than just getting some
          off one's own freelist,

        - their is a built in latency to recouperate/steal memory
          from other processes,

        - this stealing forces a sleep and context switches,

        - the amount of time to sleep is undeterminate with a single
          call per struct. How long can you sleep for? 100ms or
          250ms or more..

        - no process can guarantee a working set,

        In the time when memory was expensive, maybe a global
        sharing mechanisms would make sense, but when  the amount
        of memory is somewhat plentiful and cheap,

        *** It then makes sense for a 2 stage implementation of
            preallocation of a working set and then normal allocation
            with the added latency. 

        So, it makes sense to pre-allocate a working set of allocs
        by a single alloc call, break up the alloc into needed sizes,
        and then alloc from your own free list,

        -----> if that freelist then empties, maybe then take the extra
        overhead with the kmem call. Consider this a expected cost to exceed
        a certain watermark.

        But otherwise, I bet if I give you some code for the pre-alloc, I bet
10 
        allocs from the freelist can be done versus the kmem_alloc call, and
        at least 100 to 10k allocs if sleep occurs on your side.

        Actually, I think it is so bad, that why don't you time 1 kmem_free
        versus grabbing elements off the freelist,

        However, don't trust me, I will drop a snapshot of the code to you
        tomarrow if you want and you make a single CPU benchmark comparison.

        Your multiple CPU issue, forces me to ask, is it a common
        occurance that 2 are more CPUs are simultaneouly requesting
        memory for the intent log? If it is, then their should be a
        freelist of a low watermark set of elements per CPU. However,
        one thing at a time..

        So, do you want that code? It will be a single alloc of X units
        and then place them on a freelist. You then time it takes to
        remove Y elements from the freelist versus 1 kmem_alloc with
        a NO_SLEEP arg and report the numbers. Then I would suggest the
        call with the smallest sleep possible. How many allocs can then
        be done? 25k, 35k, more...

        Oh, the reason why we aren't timing the initial kmem_alloc call
        for the freelist, is because I expect that to occur during init
        and not proceed until memory is alloc'ed.
        

        Mitchell Erblich
        ------------------------

        

        

        

[EMAIL PROTECTED] wrote:
> 
> >       at least one location:
> >
> >       When adding a new dva node into the tree, a kmem_alloc is done with
> >       a KM_SLEEP argument.
> >
> >       thus, this process thread could block waiting for memory.
> >
> >       I would suggest adding a  pre-allocated pool of dva nodes.
> 
> This is how the Solaris memory allocator works.  It keeps pools of
> "pre-allocated" nodes about until memory conditions are low.
> 
> >       When a new dva node is needed, first check this pre-allocated
> >       pool and allocate from their.
> 
> There are two reasons why this is a really bad idea:
> 
>         - the system will run out of memory even sooner if people
>           start building their own free-lists
> 
>         - a single freelist does not scale; at two CPUs it becomes
>           the allocation bottleneck (I've measured and removed two
>           such bottlenecks from Solaris 9)
> 
> You might want to learn about how the Solaris memory allocator works;
> it pretty much works like you want, except that it is all part of the
> framework.  And, just as in your case, it does run out some times but
> a private freelist does not help against that.
> 
> >       Why? This would eliminate a possible sleep condition if memory
> >            is not immediately available. The pool would add a working
> >            set of dva nodes that could be monitored. Per alloc latencies
> >            could be amortized over a chunk allocation.
> 
> That's how the Solaris memory allocator already works.
> 
> Casper
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] single memory allocation in the ZFS intent log

Reply via email to