Hello Romero, I am still begineer with Ceph, but as far as I understood, ceph is not designed to lose the 33% of the cluster at once and recover rapidly. What I understand is that you are losing 33% of the cluster losing 1 rack out of 3. It will take a very long time to recover, before you have HEALTH_OK status.
can you check with ceph -w how long it takes for ceph to converge to a healthy cluster after you switch off the switch in Rack-A ? Saverio 2015-06-24 14:44 GMT+02:00 Romero Junior <r.jun...@global.leaseweb.com>: > Hi, > > > > We are setting up a test environment using Ceph as the main storage > solution for my QEMU-KVM virtualization platform, and everything works fine > except for the following: > > > > When I simulate a failure by powering off the switches on one of our three > racks my virtual machines get into a weird state, the illustration might > help you to fully understand what is going on: > http://i.imgur.com/clBApzK.jpg > > > > The PGs are distributed based on racks, there are not default crush rules. > > > > The number of PGs is the following: > > > > root@srv003:~# ceph osd pool ls detail > > pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 > object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags > hashpspool stripe_width 0 > > > > The qemu talks directly to Ceph through librdb, the disk is configured as > the following: > > > > <disk type='network' device='disk'> > > <driver name='qemu' type='raw' cache='writeback'/> > > <auth username='libvirt'> > > <secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/> > > </auth> > > <source protocol='rbd' name='libvirt-pool/ceph-vm-automated'> > > <host name='10.XX.YY.1' port='6789'/> > > <host name='10.XX.YY.2' port='6789'/> > > <host name='10.XX.YY.2' port='6789'/> > > </source> > > <target dev='vda' bus='virtio'/> > > <alias name='virtio-disk25'/> > > <address type='pci' domain='0x0000' bus='0x00' slot='0x04' > function='0x0'/> > > </disk> > > > > > > As mentioned, it's not a real read-only state, I can "touch" files and > even login on the affected virtual machines (by the way, all are affected) > however, a simple 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a > 3 GB file download starts (via wget/curl), it usually crashes after the > first few hundred megabytes and it resumes as soon as I power on the > “failed” rack. Everything goes back to normal as soon as the rack is > powered on again. > > > > For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5 > TB each). > > > > On the virtual machine, after recovering the rack, I can see the following > messages on /var/log/kern.log: > > > > [163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120 > seconds. > > [163800.444260] Not tainted 3.13.0-55-generic #94-Ubuntu > > [163800.444295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > > [163800.444346] jbd2/vda1-8 D ffff88007fd13180 0 135 2 > 0x00000000 > > [163800.444354] ffff880036d3bbd8 0000000000000046 ffff880036a4b000 > ffff880036d3bfd8 > > [163800.444386] 0000000000013180 0000000000013180 ffff880036a4b000 > ffff88007fd13a18 > > [163800.444390] ffff88007ffc69d0 0000000000000002 ffffffff811efa80 > ffff880036d3bc50 > > [163800.444396] Call Trace: > > [163800.444420] [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50 > > [163800.444426] [<ffffffff817279bd>] io_schedule+0x9d/0x140 > > [163800.444432] [<ffffffff811efa8e>] sleep_on_buffer+0xe/0x20 > > [163800.444437] [<ffffffff81727e42>] __wait_on_bit+0x62/0x90 > > [163800.444442] [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50 > > [163800.444447] [<ffffffff81727ee7>] out_of_line_wait_on_bit+0x77/0x90 > > [163800.444455] [<ffffffff810ab300>] ? autoremove_wake_function+0x40/0x40 > > [163800.444461] [<ffffffff811f0dba>] __wait_on_buffer+0x2a/0x30 > > [163800.444470] [<ffffffff8128be4d>] > jbd2_journal_commit_transaction+0x185d/0x1ab0 > > [163800.444477] [<ffffffff8107562f>] ? try_to_del_timer_sync+0x4f/0x70 > > [163800.444484] [<ffffffff8129017d>] kjournald2+0xbd/0x250 > > [163800.444490] [<ffffffff810ab2c0>] ? prepare_to_wait_event+0x100/0x100 > > [163800.444496] [<ffffffff812900c0>] ? commit_timeout+0x10/0x10 > > [163800.444502] [<ffffffff8108b702>] kthread+0xd2/0xf0 > > [163800.444507] [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0 > > [163800.444513] [<ffffffff81733ca8>] ret_from_fork+0x58/0x90 > > [163800.444517] [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0 > > > > A few theories for this behavior were mention on #Ceph (OFTC): > > > > [14:09] <Be-El> RomeroJnr: i think the problem is the fact that you write > to parts of the rbd that have not been accessed before > > [14:09] <Be-El> RomeroJnr: ceph does thin provisioning; each rbd is > striped into chunks of 4 mb. each stripe is put into one pgs > > [14:10] <Be-El> RomeroJnr: if you access formerly unaccessed parts of the > rbd, a new stripe is created. and this probably fails if one of the racks > is down > > [14:10] <Be-El> RomeroJnr: but that's just a theory...maybe some developer > can comment on this later > > [14:21] <Be-El> smerz: creating an object in a pg might be different than > writing to an object > > [14:21] <Be-El> smerz: with one rack down ceph cannot satisfy the pg > requirements in RomeroJnr's case > > [14:22] <smerz> i can only agree with you. that i would expect other > behaviour > > > > The question is: is this behavior indeed expected? > > Kind regards, > > Romero Junior > Hosting Engineer > LeaseWeb Global Services B.V. > > T: +31 20 316 0230 > M: +31 6 2115 9310 > E: r.jun...@global.leaseweb.com > W: www.leaseweb.com > Luttenbergweg 8, 1101 EC Amsterdam, Netherlands > > > *LeaseWeb is the brand name under which the various independent > LeaseWeb companies operate. Each company is a separate and distinct entity > that provides services in a particular geographic area. LeaseWeb Global > Services B.V. does not provide third-party services. Please see > www.leaseweb.com/en/legal <http://www.leaseweb.com/en/legal> for more > information.* > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com