On 14/07/17 18:43, Ruben Rodriguez wrote:
> How to reproduce...
I'll provide more concise details on how to test this behavior:
Ceph config:
[client]
rbd readahead max bytes = 0 # we don't want forced readahead to fool us
rbd cache = true
Start a qemu vm, with a rbd image attached with virtio-scsi:
<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='writeback'/>
<auth username='libvirt'>
<secret type='ceph' uuid='...'/>
</auth>
<source protocol='rbd' name='libvirt-pool/test'>
<host name='cephmon1' port='6789'/>
<host name='cephmon2' port='6789'/>
<host name='cephmon3' port='6789'/>
</source>
<blockio logical_block_size='512' physical_block_size='512'/>
<target dev='sdb' bus='scsi'/>
<boot order='2'/>
<address type='drive' controller='0' bus='0' target='0' unit='1'/>
</disk>
Block device parameters, inside the vm:
NAME ALIGN MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
sdb 0 4194304 4194304 512 512 1 noop 128 4096 2G
Collect performance statistics from librbd, using command:
$ ceph --admin-daemon /var/run/ceph/ceph-client.[...].asok perf dump
Note the values for:
- rd: number of read operations done by qemu
- rd_bytes: length of read requests done by qemu
- cache_ops_hit: read operations hitting the cache
- cache_ops_miss: read ops missing the cache
- data_read: data read from the cache
- op_r: number of objects sent by the OSD
Perform one small read, not at the beginning of the image (because udev
may have read it already), at a 4MB boundary line:
dd if=/dev/sda ibs=512 count=1 skip=41943040 iflag=skip_bytes
Do it again advancing 5000 bytes (to not overlap with the previous read)
Run the perf dump command again
dd if=/dev/sda ibs=512 count=1 skip=41948040 iflag=skip_bytes
Run the perf dump command again
If you compare the op_r values at each step, you should see a cache miss
each time, and a object read each time. Same object fetched twice.
IMPACT:
Let's take a look at how the op_r value increases by doing some common
operations:
- Booting a vm: This operation needs (in my case) ~70MB to be read,
which include the kernel, initrd and all files read by systemd and
daemons, until a command prompt appears. Values read
"rd": 2524,
"rd_bytes": 69685248,
"cache_ops_hit": 228,
"cache_ops_miss": 2268,
"cache_bytes_hit": 90353664,
"cache_bytes_miss": 63902720,
"data_read": 69186560,
"op": 2295,
"op_r": 2279,
That is 2299 objects being fetched from the OSD to read 69MB.
- Greping inside the linux source code (833MB), takes almost 3 minutes.
Values get increased to:
"rd": 65127,
"rd_bytes": 1081487360,
"cache_ops_hit": 228,
"cache_ops_miss": 64885,
"cache_bytes_hit": 90353664,
"cache_bytes_miss": 1075672064,
"data_read": 1080988672,
"op_r": 64896,
That is over 60.000 objects fetched to read <1GB, and *0* cache hits.
Optimized, this should take 10 seconds, and fetch ~700 objects.
Is my Qemu implementation completely broken? Or is this expected? Please
help!
--
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com