Oh yeah, for the iscsi fio full write test, did you experiment with bs and numjobs? For just 10 GB iscsi, I think numjibs > 1 (around 4 is when I stop seeing benefits) and bs < 1MB (around 64K to 256K) works better.
On 12/08/2014 05:22 PM, Mike Christie wrote: > Some distros have LIO setup by default to use 64 for the session wide > (default_cmdsn_depth) so increase that (session wide means all LUs > accessed through that session will be limited by total of 64 requests > across all LUs). > > If you are using linux on the initiator side increase > node.session.queue_depth and node.session.cmds_max. > > MTU should be increased too. > > Unless you are using a single processor, changing the > /sys/block/sdX/queue/rq_affinity to 2 might help if you are noticing > only one CPU is getting bogged down with all the IO processing. > > For the random IO tests, the iscsi initiator is probably at some fault > for the poor performance. It does not do well with small IO like 4K. > However, here is probably something else in play, because it should not > be as low as you are seeing. > > For the iscsi+krbd setup, is the storage being used the WD disks in the > RAID0 config? > > > On 12/08/2014 01:21 PM, David Moreau Simard wrote: >> Not a bad idea.. which reminds me there might be some benefits to toying >> with the MTU settings as well.. >> >> I'll check when I have a chance. >> -- >> David Moreau Simard >> >>> On Dec 8, 2014, at 2:13 PM, Nick Fisk <n...@fisk.me.uk> wrote: >>> >>> Hi David, >>> >>> This is a long shot, but have you checked the Max queue depth on the iscsi >>> side. I've got a feeling that lio might be set at 32 as default. >>> >>> This would definitely have an effect at the high queue depths you are >>> testing with. >>> >>> On 8 Dec 2014 16:53, David Moreau Simard <dmsim...@iweb.com> wrote: >>>> >>>> Haven't tried other iSCSI implementations (yet). LIO/targetcli makes it >>>> very easy to iQuoting David Moreau Simard <dmsim...@iweb.com> >>> >>>> Haven't tried other iSCSI implementations (yet). >>>> >>>> LIO/targetcli makes it very easy to implement/integrate/wrap/automate >>>> around so I'm really trying to get this right. >>>> >>>> PCI-E SSD cache tier in front of spindles-backed erasure coded pool in 10 >>>> Gbps across the board yields results slightly better or very similar to >>>> two spindles in hardware RAID-0 with writeback caching. >>>> With that in mind, the performance is not outright awful by any means, >>>> there's just a lot of overhead we have to be reminded about. >>>> >>>> What I'd like to further test but am unable to right now is to see what >>>> happens if you scale up the cluster. Right now I'm testing on only two >>>> nodes. >>>> Does the IOPS scale linearly with increasing amount of OSDs/servers ? Or >>>> is it more about a capacity thing ? >>>> >>>> Perhaps if someone else can chime in, I'm really curious. >>>> -- >>>> David Moreau Simard >>>> >>>>> On Dec 6, 2014, at 11:18 AM, Nick Fisk <n...@fisk.me.uk> wrote: >>>>> >>>>> Hi David, >>>>> >>>>> Very strange, but I'm glad you managed to finally get the cluster working >>>>> normally. Thank you for posting the benchmarks figures, it's interesting >>>>> to >>>>> see the overhead of LIO over pure RBD performance. >>>>> >>>>> I should have the hardware for our cluster up and running early next >>>>> year, I >>>>> will be in a better position to test the iSCSI performance then. I will >>>>> report back once I have some numbers. >>>>> >>>>> Just out of interest, have you tried any of the other iSCSI >>>>> implementations >>>>> to see if they show the same performance drop? >>>>> >>>>> Nick >>>>> >>>>> -----Original Message----- >>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>>>> David Moreau Simard >>>>> Sent: 05 December 2014 16:03 >>>>> To: Nick Fisk >>>>> Cc: ceph-users@lists.ceph.com >>>>> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target >>>>> >>>>> I've flushed everything - data, pools, configs and reconfigured the whole >>>>> thing. >>>>> >>>>> I was particularly careful with cache tiering configurations (almost >>>>> leaving >>>>> defaults when possible) and it's not locking anymore. >>>>> It looks like the cache tiering configuration I had was causing the >>>>> problem >>>>> ? I can't put my finger on exactly what/why and I don't have the luxury of >>>>> time to do this lengthy testing again. >>>>> >>>>> Here's what I dumped as far as config goes before wiping: >>>>> ======== >>>>> # for var in size min_size pg_num pgp_num crush_ruleset >>>>> erasure_code_profile; do ceph osd pool get volumes $var; done >>>>> size: 5 >>>>> min_size: 2 >>>>> pg_num: 7200 >>>>> pgp_num: 7200 >>>>> crush_ruleset: 1 >>>>> erasure_code_profile: ecvolumes >>>>> >>>>> # for var in size min_size pg_num pgp_num crush_ruleset hit_set_type >>>>> hit_set_period hit_set_count target_max_objects target_max_bytes >>>>> cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age >>>>> cache_min_evict_age; do ceph osd pool get volumecache $var; done >>>>> size: 2 >>>>> min_size: 1 >>>>> pg_num: 7200 >>>>> pgp_num: 7200 >>>>> crush_ruleset: 4 >>>>> hit_set_type: bloom >>>>> hit_set_period: 3600 >>>>> hit_set_count: 1 >>>>> target_max_objects: 0 >>>>> target_max_bytes: 100000000000 >>>>> cache_target_dirty_ratio: 0.5 >>>>> cache_target_full_ratio: 0.8 >>>>> cache_min_flush_age: 600 >>>>> cache_min_evict_age: 1800 >>>>> >>>>> # ceph osd erasure-code-profile get ecvolumes >>>>> directory=/usr/lib/ceph/erasure-code >>>>> k=3 >>>>> m=2 >>>>> plugin=jerasure >>>>> ruleset-failure-domain=osd >>>>> technique=reed_sol_van >>>>> ======== >>>>> >>>>> And now: >>>>> ======== >>>>> # for var in size min_size pg_num pgp_num crush_ruleset >>>>> erasure_code_profile; do ceph osd pool get volumes $var; done >>>>> size: 5 >>>>> min_size: 3 >>>>> pg_num: 2048 >>>>> pgp_num: 2048 >>>>> crush_ruleset: 1 >>>>> erasure_code_profile: ecvolumes >>>>> >>>>> # for var in size min_size pg_num pgp_num crush_ruleset hit_set_type >>>>> hit_set_period hit_set_count target_max_objects target_max_bytes >>>>> cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age >>>>> cache_min_evict_age; do ceph osd pool get volumecache $var; done >>>>> size: 2 >>>>> min_size: 1 >>>>> pg_num: 2048 >>>>> pgp_num: 2048 >>>>> crush_ruleset: 4 >>>>> hit_set_type: bloom >>>>> hit_set_period: 3600 >>>>> hit_set_count: 1 >>>>> target_max_objects: 0 >>>>> target_max_bytes: 150000000000 >>>>> cache_target_dirty_ratio: 0.5 >>>>> cache_target_full_ratio: 0.8 >>>>> cache_min_flush_age: 0 >>>>> cache_min_evict_age: 1800 >>>>> >>>>> # ceph osd erasure-code-profile get ecvolumes >>>>> directory=/usr/lib/ceph/erasure-code >>>>> k=3 >>>>> m=2 >>>>> plugin=jerasure >>>>> ruleset-failure-domain=osd >>>>> technique=reed_sol_van >>>>> ======== >>>>> >>>>> Crush map hasn't really changed before and after. >>>>> >>>>> FWIW, the benchmarks I pulled out of the setup: >>>>> https://gist.github.com/dmsimard/2737832d077cfc5eff34 >>>>> Definite overhead going from krbd to krbd + LIO... >>>>> -- >>>>> David Moreau Simard >>>>> >>>>> >>>>>> On Nov 20, 2014, at 4:14 PM, Nick Fisk <n...@fisk.me.uk> wrote: >>>>>> >>>>>> Here you go:- >>>>>> >>>>>> Erasure Profile >>>>>> k=2 >>>>>> m=1 >>>>>> plugin=jerasure >>>>>> ruleset-failure-domain=osd >>>>>> ruleset-root=hdd >>>>>> technique=reed_sol_van >>>>>> >>>>>> Cache Settings >>>>>> hit_set_type: bloom >>>>>> hit_set_period: 3600 >>>>>> hit_set_count: 1 >>>>>> target_max_objects >>>>>> target_max_objects: 0 >>>>>> target_max_bytes: 1000000000 >>>>>> cache_target_dirty_ratio: 0.4 >>>>>> cache_target_full_ratio: 0.8 >>>>>> cache_min_flush_age: 0 >>>>>> cache_min_evict_age: 0 >>>>>> >>>>>> Crush Dump >>>>>> # begin crush map >>>>>> tunable choose_local_tries 0 >>>>>> tunable choose_local_fallback_tries 0 >>>>>> tunable choose_total_tries 50 >>>>>> tunable chooseleaf_descend_once 1 >>>>>> >>>>>> # devices >>>>>> device 0 osd.0 >>>>>> device 1 osd.1 >>>>>> device 2 osd.2 >>>>>> device 3 osd.3 >>>>>> >>>>>> # types >>>>>> type 0 osd >>>>>> type 1 host >>>>>> type 2 chassis >>>>>> type 3 rack >>>>>> type 4 row >>>>>> type 5 pdu >>>>>> type 6 pod >>>>>> type 7 room >>>>>> type 8 datacenter >>>>>> type 9 region >>>>>> type 10 root >>>>>> >>>>>> # buckets >>>>>> host ceph-test-hdd { >>>>>> id -5 # do not change unnecessarily >>>>>> # weight 2.730 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item osd.1 weight 0.910 >>>>>> item osd.2 weight 0.910 >>>>>> item osd.0 weight 0.910 >>>>>> } >>>>>> root hdd { >>>>>> id -3 # do not change unnecessarily >>>>>> # weight 2.730 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item ceph-test-hdd weight 2.730 } host ceph-test-ssd { >>>>>> id -6 # do not change unnecessarily >>>>>> # weight 1.000 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item osd.3 weight 1.000 >>>>>> } >>>>>> root ssd { >>>>>> id -4 # do not change unnecessarily >>>>>> # weight 1.000 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item ceph-test-ssd weight 1.000 } >>>>>> >>>>>> # rules >>>>>> rule hdd { >>>>>> ruleset 0 >>>>>> type replicated >>>>>> min_size 0 >>>>>> max_size 10 >>>>>> step take hdd >>>>>> step chooseleaf firstn 0 type osd >>>>>> step emit >>>>>> } >>>>>> rule ssd { >>>>>> ruleset 1 >>>>>> type replicated >>>>>> min_size 0 >>>>>> max_size 4 >>>>>> step take ssd >>>>>> step chooseleaf firstn 0 type osd >>>>>> step emit >>>>>> } >>>>>> rule ecpool { >>>>>> ruleset 2 >>>>>> type erasure >>>>>> min_size 3 >>>>>> max_size 20 >>>>>> step set_chooseleaf_tries 5 >>>>>> step take hdd >>>>>> step chooseleaf indep 0 type osd >>>>>> step emit >>>>>> } >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>>>>> Of David Moreau Simard >>>>>> Sent: 20 November 2014 20:03 >>>>>> To: Nick Fisk >>>>>> Cc: ceph-users@lists.ceph.com >>>>>> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target >>>>>> >>>>>> Nick, >>>>>> >>>>>> Can you share more datails on the configuration you are using ? I'll >>>>>> try and duplicate those configurations in my environment and see what >>>>> happens. >>>>>> I'm mostly interested in: >>>>>> - Erasure code profile (k, m, plugin, ruleset-failure-domain) >>>>>> - Cache tiering pool configuration (ex: hit_set_type, hit_set_period, >>>>>> hit_set_count, target_max_objects, target_max_bytes, >>>>>> cache_target_dirty_ratio, cache_target_full_ratio, >>>>>> cache_min_flush_age, >>>>>> cache_min_evict_age) >>>>>> >>>>>> The crush rulesets would also be helpful. >>>>>> >>>>>> Thanks, >>>>>> -- >>>>>> David Moreau Simard >>>>>> >>>>>>> On Nov 20, 2014, at 12:43 PM, Nick Fisk <n...@fisk.me.uk> wrote: >>>>>>> >>>>>>> Hi David, >>>>>>> >>>>>>> I've just finished running the 75GB fio test you posted a few days >>>>>>> back on my new test cluster. >>>>>>> >>>>>>> The cluster is as follows:- >>>>>>> >>>>>>> Single server with 3x hdd and 1 ssd >>>>>>> Ubuntu 14.04 with 3.16.7 kernel >>>>>>> 2+1 EC pool on hdds below a 10G ssd cache pool. SSD is also >>>>>>> 2+partitioned to >>>>>>> provide journals for hdds. >>>>>>> 150G RBD mapped locally >>>>>>> >>>>>>> The fio test seemed to run without any problems. I want to run a few >>>>>>> more tests with different settings to see if I can reproduce your >>>>>>> problem. I will let you know if I find anything. >>>>>>> >>>>>>> If there is anything you would like me to try, please let me know. >>>>>>> >>>>>>> Nick >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>>>>>> Of David Moreau Simard >>>>>>> Sent: 19 November 2014 10:48 >>>>>>> To: Ramakrishna Nishtala (rnishtal) >>>>>>> Cc: ceph-users@lists.ceph.com; Nick Fisk >>>>>>> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target >>>>>>> >>>>>>> Rama, >>>>>>> >>>>>>> Thanks for your reply. >>>>>>> >>>>>>> My end goal is to use iSCSI (with LIO/targetcli) to export rbd block >>>>>>> devices. >>>>>>> >>>>>>> I was encountering issues with iSCSI which are explained in my >>>>>>> previous emails. >>>>>>> I ended up being able to reproduce the problem at will on various >>>>>>> Kernel and OS combinations, even on raw RBD devices - thus ruling out >>>>>>> the hypothesis that it was a problem with iSCSI but rather with Ceph. >>>>>>> I'm even running 0.88 now and the issue is still there. >>>>>>> >>>>>>> I haven't isolated the issue just yet. >>>>>>> My next tests involve disabling the cache tiering. >>>>>>> >>>>>>> I do have client krbd cache as well, i'll try to disable it too if >>>>>>> cache tiering isn't enough. >>>>>>> -- >>>>>>> David Moreau Simard >>>>>>> >>>>>>> >>>>>>>> On Nov 18, 2014, at 8:10 PM, Ramakrishna Nishtala (rnishtal) >>>>>>> <rnish...@cisco.com> wrote: >>>>>>>> >>>>>>>> Hi Dave >>>>>>>> Did you say iscsi only? The tracker issue does not say though. >>>>>>>> I am on giant, with both client and ceph on RHEL 7 and seems to work >>>>>>>> ok, >>>>>>> unless I am missing something here. RBD on baremetal with kmod-rbd >>>>>>> and caching disabled. >>>>>>>> >>>>>>>> [root@compute4 ~]# time fio --name=writefile --size=100G >>>>>>>> --filesize=100G --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 >>>>>>>> --sync=0 --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>>>>>> --iodepth=200 --ioengine=libaio >>>>>>>> writefile: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, >>>>>>>> iodepth=200 >>>>>>>> fio-2.1.11 >>>>>>>> Starting 1 process >>>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/853.0MB/0KB /s] [0/853/0 >>>>>>>> iops] [eta 00m:00s] ... >>>>>>>> Disk stats (read/write): >>>>>>>> rbd0: ios=184/204800, merge=0/0, ticks=70/16164931, >>>>>>>> in_queue=16164942, util=99.98% >>>>>>>> >>>>>>>> real 1m56.175s >>>>>>>> user 0m18.115s >>>>>>>> sys 0m10.430s >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Rama >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >>>>>>>> Behalf Of David Moreau Simard >>>>>>>> Sent: Tuesday, November 18, 2014 3:49 PM >>>>>>>> To: Nick Fisk >>>>>>>> Cc: ceph-users@lists.ceph.com >>>>>>>> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target >>>>>>>> >>>>>>>> Testing without the cache tiering is the next test I want to do when >>>>>>>> I >>>>>>> have time.. >>>>>>>> >>>>>>>> When it's hanging, there is no activity at all on the cluster. >>>>>>>> Nothing in "ceph -w", nothing in "ceph osd pool stats". >>>>>>>> >>>>>>>> I'll provide an update when I have a chance to test without tiering. >>>>>>>> -- >>>>>>>> David Moreau Simard >>>>>>>> >>>>>>>> >>>>>>>>> On Nov 18, 2014, at 3:28 PM, Nick Fisk <n...@fisk.me.uk> wrote: >>>>>>>>> >>>>>>>>> Hi David, >>>>>>>>> >>>>>>>>> Have you tried on a normal replicated pool with no cache? I've seen >>>>>>>>> a number of threads recently where caching is causing various >>>>>>>>> things to >>>>>>> block/hang. >>>>>>>>> It would be interesting to see if this still happens without the >>>>>>>>> caching layer, at least it would rule it out. >>>>>>>>> >>>>>>>>> Also is there any sign that as the test passes ~50GB that the cache >>>>>>>>> might start flushing to the backing pool causing slow performance? >>>>>>>>> >>>>>>>>> I am planning a deployment very similar to yours so I am following >>>>>>>>> this with great interest. I'm hoping to build a single node test >>>>>>>>> "cluster" shortly, so I might be in a position to work with you on >>>>>>>>> this issue and hopefully get it resolved. >>>>>>>>> >>>>>>>>> Nick >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >>>>>>>>> Behalf Of David Moreau Simard >>>>>>>>> Sent: 18 November 2014 19:58 >>>>>>>>> To: Mike Christie >>>>>>>>> Cc: ceph-users@lists.ceph.com; Christopher Spearman >>>>>>>>> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target >>>>>>>>> >>>>>>>>> Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and >>>>>>>>> chatted with "dis" on #ceph-devel. >>>>>>>>> >>>>>>>>> I ran a LOT of tests on a LOT of comabination of kernels (sometimes >>>>>>>>> with tunables legacy). I haven't found a magical combination in >>>>>>>>> which the following test does not hang: >>>>>>>>> fio --name=writefile --size=100G --filesize=100G >>>>>>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0 >>>>>>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>>>>>>> --iodepth=200 --ioengine=libaio >>>>>>>>> >>>>>>>>> Either directly on a mapped rbd device, on a mounted filesystem >>>>>>>>> (over rbd), exported through iSCSI.. nothing. >>>>>>>>> I guess that rules out a potential issue with iSCSI overhead. >>>>>>>>> >>>>>>>>> Now, something I noticed out of pure luck is that I am unable to >>>>>>>>> reproduce the issue if I drop the size of the test to 50GB. Tests >>>>>>>>> will complete in under 2 minutes. >>>>>>>>> 75GB will hang right at the end and take more than 10 minutes. >>>>>>>>> >>>>>>>>> TL;DR of tests: >>>>>>>>> - 3x fio --name=writefile --size=50G --filesize=50G >>>>>>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0 >>>>>>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>>>>>>> --iodepth=200 --ioengine=libaio >>>>>>>>> -- 1m44s, 1m49s, 1m40s >>>>>>>>> >>>>>>>>> - 3x fio --name=writefile --size=75G --filesize=75G >>>>>>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0 >>>>>>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>>>>>>> --iodepth=200 --ioengine=libaio >>>>>>>>> -- 10m12s, 10m11s, 10m13s >>>>>>>>> >>>>>>>>> Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP >>>>>>>>> >>>>>>>>> Does that ring you guys a bell ? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> David Moreau Simard >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Nov 13, 2014, at 3:31 PM, Mike Christie <mchri...@redhat.com> >>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 11/13/2014 10:17 AM, David Moreau Simard wrote: >>>>>>>>>>> Running into weird issues here as well in a test environment. I >>>>>>>>>>> don't >>>>>>>>> have a solution either but perhaps we can find some things in common.. >>>>>>>>>>> >>>>>>>>>>> Setup in a nutshell: >>>>>>>>>>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs >>>>>>>>>>> with separate public/cluster network in 10 Gbps) >>>>>>>>>>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, >>>>>>>>>>> Ceph >>>>>>>>>>> 0.87-1 (10 Gbps) >>>>>>>>>>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps) >>>>>>>>>>> >>>>>>>>>>> Relevant cluster config: Writeback cache tiering with NVME PCI-E >>>>>>>>>>> cards (2 >>>>>>>>> replica) in front of a erasure coded pool (k=3,m=2) backed by >>>>>>>>> spindles. >>>>>>>>>>> >>>>>>>>>>> I'm following the instructions here: >>>>>>>>>>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd >>>>>>>>>>> - im a ges-san-storage-devices No issues with creating and >>>>>>>>>>> mapping a 100GB RBD image and then creating the target. >>>>>>>>>>> >>>>>>>>>>> I'm interested in finding out the overhead/performance impact of >>>>>>>>> re-exporting through iSCSI so the idea is to run benchmarks. >>>>>>>>>>> Here's a fio test I'm trying to run on the client node on the >>>>>>>>>>> mounted >>>>>>>>> iscsi device: >>>>>>>>>>> fio --name=writefile --size=100G --filesize=100G >>>>>>>>>>> --filename=/dev/sdu --bs=1M --nrfiles=1 --direct=1 --sync=0 >>>>>>>>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>>>>>>>>> --iodepth=200 --ioengine=libaio >>>>>>>>>>> >>>>>>>>>>> The benchmark will eventually hang towards the end of the test >>>>>>>>>>> for some >>>>>>>>> long seconds before completing. >>>>>>>>>>> On the proxy node, the kernel complains with iscsi portal login >>>>>>>>>>> timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance >>>>>>>>>>> errors in syslog: http://pastebin.com/AiRTWDwR >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> You are hitting a different issue. German Anders is most likely >>>>>>>>>> correct and you hit the rbd hang. That then caused the iscsi/scsi >>>>>>>>>> command to timeout which caused the scsi error handler to run. In >>>>>>>>>> your logs we see the LIO error handler has received a task abort >>>>>>>>>> from the initiator and that timed out which caused the escalation >>>>>>>>>> (iscsi portal login related messages). >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@lists.ceph.com >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@lists.ceph.com >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com