Łukasz пишет:
> After few hours with dtrace and source code browsing I found that in my space 
> map there are no 128K blocks left. 

Actually you may have some 128k or more free space segments, but 
alignment requirements will not allow to allocate them. Consider the 
following example:

1. Space map starts at 0 at its size is 256KB.
2. There are two 512-byte blocks allocated from the space map - one at
the beginning, another one at the end, so space map contains exactly 1
space segment with start 512 and size 255k.

Let's allocate 128k block from such space map. avl_find() will return 
this space segment, then we will calculate offset for the segment we are 
going to allocate:

align = size & -size = 128k & -128k = 0x20000 & 0xfffffffffffe0000 = 
0x20000 = 128k

offset = P2ROUNDUP(ss->start, align) = P2ROUNDUP(512, 128k) =
-(-(512) & -(128k)) = -(0xfffffffffffffe00 & 0xfffffffffffe0000) =
-(0xfffffffffffe0000) = -(-128k) = 128k

then we will check if offset + size is less than or equal to space 
segment end, which is not true in this case
offset + size = 128k + 128k = 256k > 255k = ss->ss_end.

So even though we have 255k free in contiguous space segment, we cannot 
allocate 128k block out of it due to alignment requirements.

What is the reason for such alignment requirements? I can see at least two:
a) reduce number of search locations for big blocks to reduce number of 
iterations in 'while' loop inside metaslab_ff_alloc()
b) since we are using cursors to keep location were the last allocated 
block ended (for each block size), this allows to ensure that 
allocations of smaller size will have a chance not to loop in the 
'while' loop inside metaslab_ff_alloc()

There may be other reasons also.

Bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill 
performance, even on a pool with lots of free data" fixes rather nasty 
race condition issue, which aggravates performance of a 
metaslab_ff_alloc() on a fragmented pool:

http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c?r2=3848&r1=3713

But loops (and recursion) stay there.

> Try this on your ZFS. 
>   dtrace -n fbt::metaslab_group_alloc:return'/arg1 == -1/{}
> 
> If you will get probes, then you also have the same problem.
> Allocating from space map works like this:
> 1. metaslab_group_alloc want to allocate 128K block size
> 2. for (all metaslabs) {
>    read space map and check 128K block size
>    if no block then remove flag METASLAB_ACTIVE_MASK
> }
> 3. unload maps for all metaslabs without METASLAB_ACTIVE_MASK
> 
> Thats is why spa_sync take so much time.
> 
> Now the workaround:
>  zfs set recordsize=8K pool
Good idea, but it may have some drawbacks.

> Now the spa_sync functions takes 1-2 seconds, processor is idle, 
> only few metaslabs space maps are loaded:
>> 00000600103ee500::walk metaslab |::print struct metaslab ms_map.sm_loaded ! 
>> grep -c "0x"
> 3
> 
> But now I have another question.
> How 8k blocks will impact on performance ?
First of all, you will need more block pointers to address the same 
amount of data, which is not good if you files are big and mostly 
static. If files change frequently this may increase fragmentation 
further...

Victor.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to