Re: [zfs-discuss] single memory allocation in the ZFS intent log

Casper . Dik Wed, 04 Oct 2006 03:14:40 -0700

>Casper Dik,
>
>       Yes, I am familiar with Bonwick's slab allocators and tried
>       it for wirespeed test of 64byte pieces for a 1Gb and then
>       100Mb Eths and lastly 10Mb Eth. My results were not 
>       encouraging. I assume it has improved over time.

Nothing which tries to send 64 byte pieces over 1Gb ethernet or 100Mb
ethernet will give encouraging results.

>       First, let me ask what happens to the FS if the allocs
>       in the intent log code are sleeping waiting for memory????

How are you going to guarantee that there is *always* memory available?

I think that's barking up the wrong tree.  I think that a proper solution is
not trying to find a way which prevents memory from running out but
rather a way of dealing with the case of it running out.

If KMEM_SLEEP is used in a path where it is causing problems, then no
amount of freelists is going to solve that.  There needs to be a solution
which does not sleep.

>       - getting memory from a "cache" of ones own size/type
>         is orders of magnitude higher than just getting some
>         off one's own freelist,

Actually, that's not true; Bonwick's allocator is *FASTER* by a *wide*
margin than your own freelist.

Believe me, I've measured this, I've seen "my own freelist" collapse
on the floor when confronted with as few as two CPUs.

As a minimum, you will need *per CPU* free lists.

And that's precisely what the kernel memory allocator gives you.

>       In the time when memory was expensive, maybe a global
>       sharing mechanisms would make sense, but when  the amount
>       of memory is somewhat plentiful and cheap,

Not if all bits of the system are going to keep their own freelists * #CPUs.

Then you are suddenly faced with a *MUCH* higher memory demand.  The
Bonwick allocator does keep quite a bit cached and keeps more memory
unavailable already.

>       *** It then makes sense for a 2 stage implementation of
>           preallocation of a working set and then normal allocation
>           with the added latency. 

But the normal Bonwick allocation *is* two-stage; you are proposing to
add a 3rd stage.

>       So, it makes sense to pre-allocate a working set of allocs
>       by a single alloc call, break up the alloc into needed sizes,
>       and then alloc from your own free list,

That's what the Bonwick allocator does; so why are you duplicating this?

Apart from the questionable performance gain (I believe there to be none),
the loss of the kernel memory allocator debugging functionality is severe:

        - you can no longer track where the individual blocks are allocated
        - you can no longer track buffer overruns
        - buffers run into one another, so one overrun buffer corrupts another
          without trace

>       -----> if that freelist then empties, maybe then take the extra
>       overhead with the kmem call. Consider this a expected cost to exceed
>       a certain watermark.

This is exactly how the magazine layer works.

>       But otherwise, I bet if I give you some code for the pre-alloc, I bet
>10 
>       allocs from the freelist can be done versus the kmem_alloc call, and
>       at least 100 to 10k allocs if sleep occurs on your side.

I hope you're not designing this with a single lock per queue.

I have eradicated code in Solaris 9 which looked like this:

        struct au_buff *
        au_get_buff(void)
        {
                au_buff_t *buffer = NULL;

                mutex_enter(&au_free_queue_lock);

                if (au_free_queue == NULL) {
                        if (au_get_chunk(1)) {
                                mutex_exit(&au_free_queue_lock);
                                return (NULL);
                        }
                }

                buffer = au_free_queue;
                au_free_queue = au_free_queue->next_buf;
                mutex_exit(&au_free_queue_lock);
                buffer->next_buf = NULL;
                return (buffer);
        }

(with a corresponding free routine which never returned memory to the
system but kept it in the freelist)

This was replaced with essentially:

        buffer = kmem_cache_alloc(au_buf_cache, KM_SLEEP);

The first bit of code stopped scaling at 1 CPU (the performance
with two CPUs was slightly worse than with one CPU)

The second bit of code was both FASTER in the single CPU case and
scaled to the twelve CPUs I had for testing.

>       Actually, I think it is so bad, that why don't you time 1 kmem_free
>       versus grabbing elements off the freelist,

I did, it's horrendous.

Don't forget that the typical case, when the magazine layer is properly
size after the system has been running for a while, no locks need to be
grabbed to get memory as the magazines are per-CPU.

But with your single freelist, you must grab a lock.  Somewhere in
the grab/release lock cycle there's at least one atomic operation
and memory barrier.

Those are perhaps cheap on single CPU systems but run in the hundreds
of cycles on larger systems.

Here's something else from Sun's "collective" memory which I think
illustrates why we think private freelists are bad, /on principle/.

When Jeff Bonwick redid the memory allocator he did this because he
noticed that the performance was bad, so bad even that people generally
avoided the allocator if they could (hence the use of private freelists
in particular parts of the system such as the above example from
auditing).  This is particularly poor software engineering for two important
reasons; if a core bit of software does not work or perform properly, it
needs to be rewritten and not worked around; and if a core bit of software
is worked around, any future improvements will go unnoticed.

The auditing example and the resulting poor performance of the auditing
code amply demonstrates both points; there's even additional similarity
between it and the intent log: both allocate buffers to be written to
disk.

The object lesson of the Bonwick memory allocator, the addition of
the magazine layer in a later release is this: if this tool is not
up to the job, it must be improved not worked around.

So *if* the memory allocator is not able to deliver the QoS required
for a particular part of the system, it stands to reason that it
should be improved with QoS features to serve the need of those
consumers.

(But I'm not sure that that is even the case; but then, I'm not sufficiently
familiar with how ZIL or ZFS work and what exact issues there are)

Sorry for the lecture; my experience with single freelists of
buffers to be written to disk is just something I thought I needed to
share in this discussion.


Casper
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] single memory allocation in the ZFS intent log

Reply via email to