[ceph-users] Firefly 0.80.9 OSD issues with conn ect claims to be...wrong node
Hello, seeing issues with OSDs stalling and error messages such as: 2015-06-04 06:48:17.119618 7fc932d59700 0 -- 10.80.4.15:6820/3501 >> 10.80.4.30 :6811/3003603 pipe(0xb6b4000 sd=19 :33085 s=1 pgs=311 cs=4 l=0 c=0x915c6e0).conn ect claims to be 10.80.4.30:6811/4106 not 10.80.4.30:6811/3003603 - wrong node! 2015-06-04 06:48:32.119941 7fc932d59700 0 -- 10.80.4.15:6820/3501 >> 10.80.4.30 :6811/3003603 pipe(0xb6b4000 sd=19 :33086 s=1 pgs=311 cs=4 l=0 c=0x915c6e0).conn ect claims to be 10.80.4.30:6811/4106 not 10.80.4.30:6811/3003603 - wrong node! 2015-06-04 06:48:47.120291 7fc932d59700 0 -- 10.80.4.15:6820/3501 >> 10.80.4.30 :6811/3003603 pipe(0xb6b4000 sd=19 :33087 s=1 pgs=311 cs=4 l=0 c=0x915c6e0).conn ect claims to be 10.80.4.30:6811/4106 not 10.80.4.30:6811/3003603 - wrong node! 2015-06-04 06:49:02.120645 7fc932d59700 0 -- 10.80.4.15:6820/3501 >> 10.80.4.30 :6811/3003603 pipe(0xb6b4000 sd=19 :33088 s=1 pgs=311 cs=4 l=0 c=0x915c6e0).conn ect claims to be 10.80.4.30:6811/4106 not 10.80.4.30:6811/3003603 - wrong node! 2015-06-04 06:49:17.121030 7fc932d59700 0 -- 10.80.4.15:6820/3501 >> 10.80.4.30 :6811/3003603 pipe(0xb6b4000 sd=19 :33089 s=1 pgs=311 cs=4 l=0 c=0x915c6e0).conn ect claims to be 10.80.4.30:6811/4106 not 10.80.4.30:6811/3003603 - wrong node! 2015-06-04 06:49:32.121354 7fc932d59700 0 -- 10.80.4.15:6820/3501 >> 10.80.4.30 :6811/3003603 pipe(0xb6b4000 sd=19 :33090 s=1 pgs=311 cs=4 l=0 c=0x915c6e0).conn ect claims to be 10.80.4.30:6811/4106 not 10.80.4.30:6811/3003603 - wrong node! No IP duplication and OSD nodes have multiple IPs, ceph cluster, ceph public and a management IP Thank you, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] .New Ceph cluster - cannot add additional monitor
I wonder if your issue is related to: http://tracker.ceph.com/issues/5195 "I had to add the new monitor to the local ceph.conf file and push that with "ceph-deploy --overwrite-conf config push " to all cluster hosts and I had to issue "ceph mon add " on one of the existing cluster monitors" Regards, Alex G ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Combining MON & OSD Nodes
I would not do this, MONs are very important and any load or stability issues on OSD nodes would interfere with the cluster uptime. I found it acceptable to run MONs on virtual machines with local storage. But since MONs oversee OSD nodes, I believe combining them is a recipe for disaster, FWIW. Regards, Alex On Thu, Jun 25, 2015 at 12:37 PM, Shane Gibson wrote: > > For a small deployment this might be ok - but as mentioned, mon logging > might be an issue. Consider the following: > > * disk resources for mon logging (maybe dedicate a disk to logging, to > avoid disk IO contention for OSDs) > * CPU resources, some Filesystem types for OSDs can eat a lot of CPU > (that' good, they're doing hard work, your using those resources to gain > performance!!) > * consider memory pressure of both mons and OSDs - Filesystem cache in > memory is a good thing, are you going to be impacting that w/ comingling > mons? > > If you have fairly decent machines with more cores/HTs than OSD disks, you > probably don't have a huge CPU issue to worry about (...probably...). > > ~~shane > > On 6/25/15, 9:23 AM, "ceph-users on behalf of Quentin Hartman" > qhart...@direwolfdigital.com> wrote: > > The biggest downside that I've found is the log volume that mons create eats > a lot of io. I was running mons on my OSDs previously, but in my current > dpeloyment I've moved them to other hardware and noticed a perceptible load > reduction on those nodes that were formerly running mons. > > QH > > On Thu, Jun 25, 2015 at 10:21 AM, Lazuardi Nasution > wrote: >> >> Hi, >> >> I'm looking for pros and cons of combining MON and OSD functionality on >> the same nodes. Mostly recommended configuration is to have dedicated, odd >> number MON nodes. What I'm thinking is more like single node deployment but >> consist more than one node, if we have 3 nodes we have 3 MONs with 3 OSDs. >> Since MON will only consume small resources, I think MON load will not >> degrade OSD performance significantly. If we have odd number of nodes, we >> can still maintain the quorum of MON with this way. Any idea? >> >> Best regards, >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Redundant networks in Ceph
The current network design in Ceph (http://ceph.com/docs/master/rados/configuration/network-config-ref) uses nonredundant networks for both cluster and public communication. Ideally, in a high load environment these will be 10 or 40+ GbE networks. For cost reasons, most such installation will use the same switch hardware and separate Ceph traffic using VLANs. Networking in complex, and situations are possible when switches and routers drop traffic. We ran into one of those at one of our sites - connections to hosts stay up (so bonding NICs does not help), yet OSD communication gets disrupted, client IO hangs and failures cascade to client applications. My understanding is that if OSDs cannot connect for some time over the cluster network, that IO will hang and time out. The document states " If you specify more than one IP address and subnet mask for either the public or the cluster network, the subnets within the network must be capable of routing to each other." Which in real world means complicated Layer 3 setup for routing and is not practical in many configurations. What if there was an option for "cluster 2" and "public 2" networks, to which OSDs and MONs would go either in active/backup or active/active mode (cluster 1 and cluster 2 exist separately do not route to each other)? The difference between this setup and bonding is that here decision to fail over and try the other network is at OSD/MON level, and it bring resilience to faults within the switch core, which is really only detectable at application layer. Am I missing an already existing feature? Please advise. Best regards, Alex Gorbachev Intelligent Systems Services Inc. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Redundant networks in Ceph
Hi Nick, Thank you fro writing back: > I think the answer is you do 1 of 2 things. You either design your network > so that it is fault tolerant in every way so that network interruption is > not possible. Or go with non-redundant networking, but design your crush map > around the failure domains of the network. We'll redesign the network shortly - the problem is in general that I am finding it is possible, in even well designed redundant networks, to have packet loss occur for various reasons (maintenance, cables, protocol issues etc.). So while there is not an interruption (defined as 100% service loss), there may be occasional packet loss issues and high latency situations, even when the backbone is very fast. The CRUSH map idea sounds interesting. But there are still concerns, such as massive data relocations East-West (between racks in a leaf-spine architecture such as https://community.mellanox.com/docs/DOC-1475 , should there be an outage in the spine. Plus such issues are enormously hard to troubleshoot. > I'm interested in your example of where OSD's where unable to communicate. > What happened? Would it possible to redesign the network to stop this > happening? Our SuperCore design uses Ceph OSD nodes to provide storage to LIO Target iSCSI nodes, which then deliver it to ESXi hosts. LIO is sensitive to hangs, and often we see an RBD hang translate into iSCSI timeout, which causes ESXi to abort connections, hang and crash applications. This only happens at one site, where it is likely there is a switch issue somewhere. These issues are sporadic and come and go as storms - so far all Ceph analysis pointed to network disruptions, from which the RBD client is unable to recover. The network vendor still cannot find anything wrong. We'll replace the whole network, but I was thinking, having seen such issues at a few other sites, if a "B-bus" for networking would be a good design for OSDs. This approach is commonly used in traditional SANs, where the "A bus" and "B bus" are not connected,so they cannot possibly cross contaminate in any way. Another reference is multipathing, where IO can be send via redundant paths - most storage vendors recommend using application (higher) level multipathing (aka MPIO) vs. network redundancy (such as bonding). We find this to be a valid recommendation as clients run into issues less. Somewhat related to http://serverfault.com/questions/510882/why-mpio-instead-of-802-3ad-team-for-iscsi to quote - "MPIO detects and handles path failures, whereas 802.3ad can only compensate for a link failure". I see OSD connections as paths, rather than links, as these are higher level object storage exchanges. Thank you, Alex > > Nick > >> -Original Message----- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Alex Gorbachev >> Sent: 27 June 2015 19:02 >> To: ceph-users@lists.ceph.com >> Subject: [ceph-users] Redundant networks in Ceph >> >> The current network design in Ceph >> (http://ceph.com/docs/master/rados/configuration/network-config-ref) >> uses nonredundant networks for both cluster and public communication. >> Ideally, in a high load environment these will be 10 or 40+ GbE networks. > For >> cost reasons, most such installation will use the same switch hardware and >> separate Ceph traffic using VLANs. >> >> Networking in complex, and situations are possible when switches and >> routers drop traffic. We ran into one of those at one of our sites - >> connections to hosts stay up (so bonding NICs does not help), yet OSD >> communication gets disrupted, client IO hangs and failures cascade to > client >> applications. >> >> My understanding is that if OSDs cannot connect for some time over the >> cluster network, that IO will hang and time out. The document states " >> >> If you specify more than one IP address and subnet mask for either the >> public or the cluster network, the subnets within the network must be >> capable of routing to each other." >> >> Which in real world means complicated Layer 3 setup for routing and is not >> practical in many configurations. >> >> What if there was an option for "cluster 2" and "public 2" networks, to > which >> OSDs and MONs would go either in active/backup or active/active mode >> (cluster 1 and cluster 2 exist separately do not route to each other)? >> >> The difference between this setup and bonding is that here decision to > fail >> over and try the other network is at OSD/MON level, and it bring > resilience to >> faults within the switch core, which is really only detectable at > application >> layer. >> >
Re: [ceph-users] Redundant networks in Ceph
Hi Nick, I know what you mean, no matter how hard you try something unexpected > always happens. That said I think OSD timeouts should be higher than HSRP > and spanning tree convergence times, so I think it should survive most > incidents that I can think of. > So for high speed networking (40+ GbE) it seems the MLAG/bond solutions are the best option due to nonblocking nature and the requirement that Ceph networks stay connected - i.e. Layer 2 VLAN for ceph public, another layer 2 VLAN for ceph cluster and yet another layer 2 VLAN for iSCSI. I think what we saw was a physically defective port creating hangs in the switch, or something similar, so there are some timeouts from time to time. ARP proxies, and routing issues had created similar incidents in the past at other sites. > > > > > The CRUSH map idea sounds interesting. But there are still concerns, > such as > > massive data relocations East-West (between racks in a leaf-spine > > architecture such as > > https://community.mellanox.com/docs/DOC-1475 , should there be an > > outage in the spine. Plus such issues are enormously hard to > troubleshoot. > > You can set the maximum crush grouping that will allow OSD's to be marked > out. You can use this to stop unwanted data movement from occurring during > outages. > Do you have a CRUSH map example by any chance? > > Ah, yeah, been there with LIO and esxi and gave up on it. I found any > pause longer than around 10 seconds would send both of them into a death > spiral. I know you currently only see it due to some networking blip, but > you will most likely also see it when disks fail...etc For me I couldn't > have all my Datastores going down every time something blipped or got a > little slow. There are discussions ongoing about it on the Target mailing > list and Mike Christie from Redhat is looking into the problem, so > hopefully it will get sorted at some point. For what it's worth, both SCST > and TGT seem to be immune from this. > > Odd thing is I can fail drives and switches, and connections in a lab under 16 stream workload from two VMs and never get a timeout like this. In our POC cloud environment though (which does have larger drives and more of them, and 9 VM hosts in 2 clusters vs. 2 VM hosts in 1 cluster for lab), we do see these "abort-APD-PDL" storms that propagate to hostd hangs and all kind of unpleasant consequences. I saw many patches slated for kernel 4.2 on the target-devel list and I have provided a lot of diagnostic data there, but can see RBD hangs at times in osdc like this: root@roc-4r-scd214:/sys/kernel/debug/ceph/8d5c925a-f6b9-4064-9ea7-f4770eca7247.client1615259# cat osdc 22143 osd11 11.e4c3492 rbd_data.319f32ae8944a.000b read 23771 osd31 21.dda0f5af rbd_data.b6ce62ae8944a.e7a8 set-alloc-hint,write 23782 osd31 11.e505a6ea rbd_data.3dd222ae8944a.00c9 read 26228 osd211.ec37db43 rbd_data.319f32ae8944a.0006 read 26260 osd31 21.dda0f5af rbd_data.b6ce62ae8944a.e7a8 set-alloc-hint,write 26338 osd31 11.e505a6ea rbd_data.3dd222ae8944a.00c9 read root@roc-4r-scd214 :/sys/kernel/debug/ceph/8d5c925a-f6b9-4064-9ea7-f4770eca7247.client1615259# So the ongoing discussion is :is this ceph being slow or LIO being not resilient enough". Ref: http://www.spinics.net/lists/target-devel/msg09311.html http://www.spinics.net/lists/target-devel/msg09682.html And especially for discussion about allowing iscsi login during another hang: http://www.spinics.net/lists/target-devel/msg09687.html and http://www.spinics.net/lists/target-devel/msg09688.html > > > We'll replace the whole network, but I was thinking, having seen such > issues > > at a few other sites, if a "B-bus" for networking would be a good design > for > > OSDs. This approach is commonly used in traditional SANs, where the "A > > bus" and "B bus" are not connected,so they cannot possibly cross > > contaminate in any way. > > Probably implementing something like multipathTCP would be the best bet to > mirror the traditional dual fabric SAN design. > > Assuming http://www.multipath-tcp.org/ and http://lwn.net/Articles/544399/ Looks very interesting. > > > > > > > > Nick > > > > > >> -Original Message- > > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > > >> Of Alex Gorbachev > > >> Sent: 27 June 2015 19:02 > > >> To: ceph-users@lists.ceph.com > > >> Subject: [ceph-users] Redundant networks in Ceph > > >> > > >> The current network design in Ceph > > >> (http://ceph.com/docs/
[ceph-users] OSD crashes
Hello, we are experiencing severe OSD timeouts, OSDs are not taken out and we see the following in syslog on Ubuntu 14.04.2 with Firefly 0.80.9. Thank you for any advice. Alex Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261899] BUG: unable to handle kernel paging request at 0019001c Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261923] IP: [] find_get_entries+0x66/0x160 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261941] PGD 1035954067 PUD 0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261955] Oops: [#1] SMP Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261969] Modules linked in: xfs libcrc32c ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp co retemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac edac_core lpc_ich joy dev mei_me mei ioatdma wmi 8021q ipmi_si garp 8250_fintek mrp ipmi_msghandler stp llc bonding mac_hid lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel hid_ generic usbhid hid igb ahci mpt2sas mlx4_core i2c_algo_bit libahci dca raid_class ptp scsi_transport_sas pps_core arcmsr Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262182] CPU: 10 PID: 8711 Comm: ceph-osd Not tainted 4.1.0-040100-generic #201506220235 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262197] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262215] task: 8800721f1420 ti: 880fbad54000 task.ti: 880fbad54000 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262229] RIP: 0010:[] [] find_get_entries+0x66/0x160 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262248] RSP: 0018:880fbad571a8 EFLAGS: 00010246 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262258] RAX: 880004000158 RBX: 000e RCX: Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262303] RDX: 880004000158 RSI: 880fbad571c0 RDI: 0019 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262347] RBP: 880fbad57208 R08: 00c0 R09: 00ff Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262391] R10: R11: 0220 R12: 00b6 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262435] R13: 880fbad57268 R14: 000a R15: 880fbad572d8 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262479] FS: 7f98cb0e0700() GS:88103f48() knlGS: Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262524] CS: 0010 DS: ES: CR0: 80050033 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262551] CR2: 0019001c CR3: 001034f0e000 CR4: 000407e0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262596] Stack: Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262618] 880fbad571f8 880cf6076b30 880bdde05da8 00e6 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262669] 0100 880cf6076b28 00b5 880fbad57258 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262721] 880fbad57258 880fbad572d8 880cf6076b28 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262772] Call Trace: Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262801] [] pagevec_lookup_entries+0x22/0x30 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262831] [] truncate_inode_pages_range+0xf4/0x700 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262862] [] truncate_inode_pages+0x15/0x20 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262891] [] truncate_inode_pages_final+0x5f/0xa0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262949] [] xfs_fs_evict_inode+0x3c/0xe0 [xfs] Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262981] [] evict+0xb8/0x190 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263009] [] dispose_list+0x41/0x50 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263037] [] prune_icache_sb+0x4f/0x60 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263067] [] super_cache_scan+0x155/0x1a0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263096] [] do_shrink_slab+0x13f/0x2c0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263126] [] ? shrink_lruvec+0x330/0x370 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263157] [] ? isolate_migratepages_block+0x299/0x5c0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263188] [] shrink_slab+0xd8/0x110 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263217] [] shrink_zone+0x2cf/0x300 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263246] [] ? compact_zone+0x7d/0x4f0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263275] [] shrink_zones+0x104/0x2a0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263304] [] ? compact_zone_order+0x5d/0x70 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263336] [] ? ktime_get+0x46/0xb0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263365] [] do_try_to_free_pages+0xd7/0x160 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263396] [] try_to_free_pages+0xb7/0x170 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263427] [] __alloc_pages_nodemask+0
Re: [ceph-users] OSD crashes
Thanks Jan. /proc/sys/vm/min_free_kbytes was set to 32M, I set it to 256M with system having 64 GB RAM. Also my swappiness was set to 0, no problems in lab tests, but I wonder if we hit some limit on 24/7 OSD operation. I will update after some days of running with these parameter. Best regards, Alex On Fri, Jul 3, 2015 at 6:27 AM, Jan Schermer wrote: > What’s the value of /proc/sys/vm/min_free_kbytes on your system? Increase > it to 256M (better do it if there’s lots of free memory) and see if it > helps. > It can also be set too high, hard to find any formula how to set it > correctly... > > Jan > > > On 03 Jul 2015, at 10:16, Alex Gorbachev wrote: > > Hello, we are experiencing severe OSD timeouts, OSDs are not taken out and > we see the following in syslog on Ubuntu 14.04.2 with Firefly 0.80.9. > > Thank you for any advice. > > Alex > > > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261899] BUG: unable to > handle kernel paging request at 0019001c > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261923] IP: > [] find_get_entries+0x66/0x160 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261941] PGD 1035954067 PUD 0 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261955] Oops: [#1] SMP > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261969] Modules linked in: > xfs libcrc32c ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal > intel_powerclamp co > retemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel > aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac edac_core > lpc_ich joy > dev mei_me mei ioatdma wmi 8021q ipmi_si garp 8250_fintek mrp > ipmi_msghandler stp llc bonding mac_hid lp parport mlx4_en vxlan > ip6_udp_tunnel udp_tunnel hid_ > generic usbhid hid igb ahci mpt2sas mlx4_core i2c_algo_bit libahci dca > raid_class ptp scsi_transport_sas pps_core arcmsr > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262182] CPU: 10 PID: 8711 > Comm: ceph-osd Not tainted 4.1.0-040100-generic #201506220235 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262197] Hardware name: > Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262215] task: > 8800721f1420 ti: 880fbad54000 task.ti: 880fbad54000 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262229] RIP: > 0010:[] [] find_get_entries+0x66/0x160 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262248] RSP: > 0018:880fbad571a8 EFLAGS: 00010246 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262258] RAX: > 880004000158 RBX: 000e RCX: > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262303] RDX: > 880004000158 RSI: 880fbad571c0 RDI: 0019 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262347] RBP: > 880fbad57208 R08: 00c0 R09: 00ff > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262391] R10: > R11: 0220 R12: 00b6 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262435] R13: > 880fbad57268 R14: 000a R15: 880fbad572d8 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262479] FS: > 7f98cb0e0700() GS:88103f48() knlGS: > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262524] CS: 0010 DS: > ES: CR0: 80050033 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262551] CR2: > 0019001c CR3: 001034f0e000 CR4: 000407e0 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262596] Stack: > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262618] 880fbad571f8 > 880cf6076b30 880bdde05da8 00e6 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262669] 0100 > 880cf6076b28 00b5 880fbad57258 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262721] 880fbad57258 > 880fbad572d8 880cf6076b28 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262772] Call Trace: > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262801] > [] pagevec_lookup_entries+0x22/0x30 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262831] > [] truncate_inode_pages_range+0xf4/0x700 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262862] > [] truncate_inode_pages+0x15/0x20 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262891] > [] truncate_inode_pages_final+0x5f/0xa0 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262949] > [] xfs_fs_evict_inode+0x3c/0xe0 [xfs] > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262981] > [] evict+0xb8/0x190 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263009] > [] dispose_list+0x41/0x50 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263037] > [] prune_icache_sb+0x4f/0x60 > Jul 3 03:42:06 roc-4r-sca020 k
Re: [ceph-users] Block Storage Image Creation Process
Hi Jiwan, On Sat, Jul 11, 2015 at 4:44 PM, Jiwan N wrote: > Hi Ceph-Users, > > I am quite new to Ceph Storage (storage tech in general). I have been > investigating Ceph to understand the precise process clearly. > > *Q: What actually happens When I create a block image of certain size?* > > The ceph documentation from http://docs.ceph.com/docs/v0.67.9/man/8/rbd/ > says, the block devices (images) > are stripped over objects and stored in a RADOS object store. > > This is the process of storage allocation. So, do objects that are broken > down allocate space (themselves) where they reside? Or are these objects > (state and behavior ) information kept (scattered) into the OSDs > (eventually into the disks)? Or does the monitor server keep these info and > allocate those positions (locations) in Object storage disks so that later > on some data can be put into those locations? > > Any precise documentation or explanation will help. > I found this description of the internal architecture most helpful: http://ceph.com/docs/master/architecture/ Best regards, Alex > > -- > Sincerely, > Jiwan Ninglekhu > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
FWIW. Based on the excellent research by Mark Nelson ( http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/) we have dropped SSD journals altogether, and instead went for the battery protected controller writeback cache. Benefits: - No negative force multiplier with one SSD failure taking down multiple OSDs - OSD portability: move OSD drives across nodes - OSD recovery: stick them into a surviving OSD node and they keep working I agree on size=3, seems to be safest in all situations. Regards, Alex On Thu, Jul 9, 2015 at 6:38 PM, Quentin Hartman < qhart...@direwolfdigital.com> wrote: > So, I was running with size=2, until we had a network interface on an > OSD node go faulty, and start corrupting data. Because ceph couldn't tell > which copy was right it caused all sorts of trouble. I might have been able > to recover more gracefully had I caught the problem sooner and been able to > identify the root right away, but as it was, we ended up labeling every VM > in the cluster suspect destroying the whole thing and restoring from > backups. I didn't end up managing to find the root of the problem until I > was rebuilding the cluster and noticed one node "felt weird" when I was > ssh'd into it. It was painful. > > We are currently running "important" vms from a ceph pool with size=3, and > more disposable ones from a size=2 pool, and that seems to be a reasonable > tradeoff so far, giving us a bit more IO overhead tha nwe would have > running 3 for everything, but still having safety where we need it. > > QH > > On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke < > goetz.reini...@filmakademie.de> wrote: > >> Hi Warren, >> >> thanks for that feedback. regarding the 2 or 3 copies we had a lot of >> internal discussions and lots of pros and cons on 2 and 3 :) … and finally >> decided to give 2 copies in the first - now called evaluation cluster - a >> chance to prove. >> >> I bet in 2016 we will see, if that was a good decision or bad and data >> los is in that scenario ok. We evaluate. :) >> >> Regarding one P3700 for 12 SATA disks I do get it right, that if that >> P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me >> from my current knowledge. Or are the P3700 so much more reliable than the >> eg. S3500 or S3700? >> >> Or is the suggestion with the P3700 if we go in the direction of 20+ >> nodes and till than stay without SSDs for journaling. >> >> I really appreciate your thoughts and feedback and I’m aware of the fact >> that building a ceph cluster is some sort of knowing the specs, >> configuration option, math, experience, modification and feedback from best >> practices real world clusters. Finally all clusters are unique in some way >> and what works for one will not work for an other. >> >> Thanks for feedback, 100 kowtows . Götz >> >> >> >> > Am 09.07.2015 um 16:58 schrieb Wang, Warren < >> warren_w...@cable.comcast.com>: >> > >> > You'll take a noticeable hit on write latency. Whether or not it's >> tolerable will be up to you and the workload you have to capture. Large >> file operations are throughput efficient without an SSD journal, as long as >> you have enough spindles. >> > >> > About the Intel P3700, you will only need 1 to keep up with 12 SATA >> drives. The 400 GB is probably okay if you keep the journal sizes small, >> but the 800 is probably safer if you plan on leaving these in production >> for a few years. Depends on the turnover of data on the servers. >> > >> > The dual disk failure comment is pointing out that you are more exposed >> for data loss with 2 copies. You do need to understand that there is a >> possibility for 2 drives to fail either simultaneously, or one before the >> cluster is repaired. As usual, this is going to be a decision you need to >> decide if it's acceptable or not. We have many clusters, and some are 2, >> and others are 3. If your data resides nowhere else, then 3 copies is the >> safe thing to do. That's getting harder and harder to justify though, when >> the price of other storage solutions using erasure coding continues to >> plummet. >> > >> > Warren >> > >> > -Original Message- >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> Of Götz Reinicke - IT Koordinator >> > Sent: Thursday, July 09, 2015 4:47 AM >> > To: ceph-users@lists.ceph.com >> > Subject: Re: [ceph-users] Real world benefit from SSD Journals for a >> more read than write cluster >> > >> > Hi Christian, >> > Am 09.07.15 um 09:36 schrieb Christian Balzer: >> >> >> >> Hello, >> >> >> >> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator >> wrote: >> >> >> >>> Hi again, >> >>> >> >>> time is passing, so is my budget :-/ and I have to recheck the >> >>> options for a "starter" cluster. An expansion next year for may be an >> >>> openstack installation or more performance if the demands rise is >> >>> possible. The "starter" could always be used as test or slow dark >> archive.
Re: [ceph-users] Deadly slow Ceph cluster revisited
May I suggest checking also the error counters on your network switch? Check speed and duplex. Is bonding in use? Is flow control on? Can you swap the network cable? Can you swap a NIC with another node and does the problem follow? Hth, Alex On Friday, July 17, 2015, Steve Thompson wrote: > On Fri, 17 Jul 2015, J David wrote: > > f16 inbound: 6Gbps >> f16 outbound: 6Gbps >> f17 inbound: 6Gbps >> f17 outbound: 6Gbps >> f18 inbound: 6Gbps >> f18 outbound: 1.2Mbps >> > > Unless the network was very busy when you did this, I think that 6 Gb/s > may not be very good either. Usually iperf will give you much more than > that. For example, between two of my OSD's, I get 9.4 Gb/s, or up to 9.9 > Gb/s when nothing else is happening. > > Steve > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD crashes
We have been error free for almost 3 weeks now. The following settings on all OSD nodes were changed: vm.swappiness=1 vm.min_free_kbytes=262144 My discussion on XFS list is here: http://www.spinics.net/lists/xfs/msg33645.html Thanks, Alex On Fri, Jul 3, 2015 at 6:27 AM, Jan Schermer wrote: > What’s the value of /proc/sys/vm/min_free_kbytes on your system? Increase > it to 256M (better do it if there’s lots of free memory) and see if it > helps. > It can also be set too high, hard to find any formula how to set it > correctly... > > Jan > > > On 03 Jul 2015, at 10:16, Alex Gorbachev wrote: > > Hello, we are experiencing severe OSD timeouts, OSDs are not taken out and > we see the following in syslog on Ubuntu 14.04.2 with Firefly 0.80.9. > > Thank you for any advice. > > Alex > > > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261899] BUG: unable to > handle kernel paging request at 0019001c > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261923] IP: > [] find_get_entries+0x66/0x160 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261941] PGD 1035954067 PUD 0 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261955] Oops: [#1] SMP > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261969] Modules linked in: > xfs libcrc32c ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal > intel_powerclamp co > retemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel > aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac edac_core > lpc_ich joy > dev mei_me mei ioatdma wmi 8021q ipmi_si garp 8250_fintek mrp > ipmi_msghandler stp llc bonding mac_hid lp parport mlx4_en vxlan > ip6_udp_tunnel udp_tunnel hid_ > generic usbhid hid igb ahci mpt2sas mlx4_core i2c_algo_bit libahci dca > raid_class ptp scsi_transport_sas pps_core arcmsr > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262182] CPU: 10 PID: 8711 > Comm: ceph-osd Not tainted 4.1.0-040100-generic #201506220235 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262197] Hardware name: > Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262215] task: > 8800721f1420 ti: 880fbad54000 task.ti: 880fbad54000 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262229] RIP: > 0010:[] [] find_get_entries+0x66/0x160 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262248] RSP: > 0018:880fbad571a8 EFLAGS: 00010246 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262258] RAX: > 880004000158 RBX: 000e RCX: > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262303] RDX: > 880004000158 RSI: 880fbad571c0 RDI: 0019 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262347] RBP: > 880fbad57208 R08: 00c0 R09: 00ff > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262391] R10: > R11: 0220 R12: 00b6 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262435] R13: > 880fbad57268 R14: 000a R15: 880fbad572d8 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262479] FS: > 7f98cb0e0700() GS:88103f48() knlGS: > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262524] CS: 0010 DS: > ES: CR0: 80050033 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262551] CR2: > 0019001c CR3: 001034f0e000 CR4: 000407e0 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262596] Stack: > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262618] 880fbad571f8 > 880cf6076b30 880bdde05da8 00e6 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262669] 0100 > 880cf6076b28 00b5 880fbad57258 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262721] 880fbad57258 > 880fbad572d8 880cf6076b28 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262772] Call Trace: > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262801] > [] pagevec_lookup_entries+0x22/0x30 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262831] > [] truncate_inode_pages_range+0xf4/0x700 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262862] > [] truncate_inode_pages+0x15/0x20 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262891] > [] truncate_inode_pages_final+0x5f/0xa0 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262949] > [] xfs_fs_evict_inode+0x3c/0xe0 [xfs] > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262981] > [] evict+0xb8/0x190 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263009] > [] dispose_list+0x41/0x50 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263037] > [] prune_icache_sb+0x4f/0x60 > Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263067] > [] super_cache_scan+0x155/0x1a0 >
Re: [ceph-users] How to improve single thread sequential reads?
Hi Nick, On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk wrote: >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Nick Fisk >> Sent: 13 August 2015 18:04 >> To: ceph-users@lists.ceph.com >> Subject: [ceph-users] How to improve single thread sequential reads? >> >> Hi, >> >> I'm trying to use a RBD to act as a staging area for some data before > pushing >> it down to some LTO6 tapes. As I cannot use striping with the kernel > client I >> tend to be maxing out at around 80MB/s reads testing with DD. Has anyone >> got any clever suggestions of giving this a bit of a boost, I think I need > to get it >> up to around 200MB/s to make sure there is always a steady flow of data to >> the tape drive. > > I've just tried the testing kernel with the blk-mq fixes in it for full size > IO's, this combined with bumping readahead up to 4MB, is now getting me on > average 150MB/s to 200MB/s so this might suffice. > > On a personal interest, I would still like to know if anyone has ideas on > how to really push much higher bandwidth through a RBD. Some settings in our ceph.conf that may help: osd_op_threads = 20 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k filestore_queue_max_ops = 9 filestore_flusher = false filestore_max_sync_interval = 10 filestore_sync_flush = false Regards, Alex > >> >> Rbd-fuse seems to top out at 12MB/s, so there goes that option. >> >> I'm thinking mapping multiple RBD's and then combining them into a mdadm >> RAID0 stripe might work, but seems a bit messy. >> >> Any suggestions? >> >> Thanks, >> Nick >> > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz wrote: > I did some (non-ceph) work on these, and concluded that bcache was the best > supported, most stable, and fastest. This was ~1 year ago, to take it with > a grain of salt, but that's what I would recommend. > > Daniel > > > > From: "Dominik Zalewski" > To: "German Anders" > Cc: "ceph-users" > Sent: Wednesday, July 1, 2015 5:28:10 PM > Subject: Re: [ceph-users] any recommendation of using EnhanceIO? > > > Hi, > > I’ve asked same question last weeks or so (just search the mailing list > archives for EnhanceIO :) and got some interesting answers. > > Looks like the project is pretty much dead since it was bought out by HGST. > Even their website has some broken links in regards to EnhanceIO > > I’m keen to try flashcache or bcache (its been in the mainline kernel for > some time) > > Dominik > > On 1 Jul 2015, at 21:13, German Anders wrote: > > Hi cephers, > >Is anyone out there that implement enhanceIO in a production environment? > any recommendation? any perf output to share with the diff between using it > and not? > > Thanks in advance, > > German > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
HI Jan, On Tue, Aug 18, 2015 at 5:00 AM, Jan Schermer wrote: > I already evaluated EnhanceIO in combination with CentOS 6 (and backported > 3.10 and 4.0 kernel-lt if I remember correctly). > It worked fine during benchmarks and stress tests, but once we run DB2 on it > it panicked within minutes and took all the data with it (almost literally - > files that werent touched, like OS binaries were b0rked and the filesystem > was unsalvageable). Out of curiosity, were you using EnhanceIO in writeback mode? I assume so, as a read cache should not hurt anything. Thanks, Alex > If you disregard this warning - the performance gains weren't that great > either, at least in a VM. It had problems when flushing to disk after > reaching dirty watermark and the block size has some not-well-documented > implications (not sure now, but I think it only cached IO _larger_than the > block size, so if your database keeps incrementing an XX-byte counter it will > go straight to disk). > > Flashcache doesn't respect barriers (or does it now?) - if that's ok for you > than go for it, it should be stable and I used it in the past in production > without problems. > > bcache seemed to work fine, but I needed to > a) use it for root > b) disable and enable it on the fly (doh) > c) make it non-persisent (flush it) before reboot - not sure if that was > possible either. > d) all that in a customer's VM, and that customer didn't have a strong > technical background to be able to fiddle with it... > So I haven't tested it heavily. > > Bcache should be the obvious choice if you are in control of the environment. > At least you can cry on LKML's shoulder when you lose data :-) > > Jan > > >> On 18 Aug 2015, at 01:49, Alex Gorbachev wrote: >> >> What about https://github.com/Frontier314/EnhanceIO? Last commit 2 >> months ago, but no external contributors :( >> >> The nice thing about EnhanceIO is there is no need to change device >> name, unlike bcache, flashcache etc. >> >> Best regards, >> Alex >> >> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz wrote: >>> I did some (non-ceph) work on these, and concluded that bcache was the best >>> supported, most stable, and fastest. This was ~1 year ago, to take it with >>> a grain of salt, but that's what I would recommend. >>> >>> Daniel >>> >>> >>> >>> From: "Dominik Zalewski" >>> To: "German Anders" >>> Cc: "ceph-users" >>> Sent: Wednesday, July 1, 2015 5:28:10 PM >>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO? >>> >>> >>> Hi, >>> >>> I’ve asked same question last weeks or so (just search the mailing list >>> archives for EnhanceIO :) and got some interesting answers. >>> >>> Looks like the project is pretty much dead since it was bought out by HGST. >>> Even their website has some broken links in regards to EnhanceIO >>> >>> I’m keen to try flashcache or bcache (its been in the mainline kernel for >>> some time) >>> >>> Dominik >>> >>> On 1 Jul 2015, at 21:13, German Anders wrote: >>> >>> Hi cephers, >>> >>> Is anyone out there that implement enhanceIO in a production environment? >>> any recommendation? any perf output to share with the diff between using it >>> and not? >>> >>> Thanks in advance, >>> >>> German >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
> IE, should we be focusing on IOPS? Latency? Finding a way to avoid journal > overhead for large writes? Are there specific use cases where we should > specifically be focusing attention? general iscsi? S3? databases directly > on RBD? etc. There's tons of different areas that we can work on (general > OSD threading improvements, different messenger implementations, newstore, > client side bottlenecks, etc) but all of those things tackle different kinds > of problems. > Mark, my take is definitely write latency. Base on this discussion, there is no real safe solution for write caching outside Ceph. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad performances in recovery
> > Just to update the mailing list, we ended up going back to default > ceph.conf without any additional settings than what is mandatory. We are > now reaching speeds we never reached before, both in recovery and in > regular usage. There was definitely something we set in the ceph.conf > bogging everything down. Could you please share the old and new ceph.conf, or the section that was removed? Best regards, Alex > > > On 2015-08-20 4:06 AM, Christian Balzer wrote: >> >> Hello, >> >> from all the pertinent points by Somnath, the one about pre-conditioning >> would be pretty high on my list, especially if this slowness persists and >> nothing else (scrub) is going on. >> >> This might be "fixed" by doing a fstrim. >> >> Additionally the levelDB's per OSD are of course sync'ing heavily during >> reconstruction, so that might not be the favorite thing for your type of >> SSDs. >> >> But ultimately situational awareness is very important, as in "what" is >> actually going and slowing things down. >> As usual my recommendations would be to use atop, iostat or similar on all >> your nodes and see if your OSD SSDs are indeed the bottleneck or if it is >> maybe just one of them or something else entirely. >> >> Christian >> >> On Wed, 19 Aug 2015 20:54:11 + Somnath Roy wrote: >> >>> Also, check if scrubbing started in the cluster or not. That may >>> considerably slow down the cluster. >>> >>> -Original Message- >>> From: Somnath Roy >>> Sent: Wednesday, August 19, 2015 1:35 PM >>> To: 'J-P Methot'; ceph-us...@ceph.com >>> Subject: RE: [ceph-users] Bad performances in recovery >>> >>> All the writes will go through the journal. >>> It may happen your SSDs are not preconditioned well and after a lot of >>> writes during recovery IOs are stabilized to lower number. This is quite >>> common for SSDs if that is the case. >>> >>> Thanks & Regards >>> Somnath >>> >>> -Original Message- >>> From: J-P Methot [mailto:jpmet...@gtcomm.net] >>> Sent: Wednesday, August 19, 2015 1:03 PM >>> To: Somnath Roy; ceph-us...@ceph.com >>> Subject: Re: [ceph-users] Bad performances in recovery >>> >>> Hi, >>> >>> Thank you for the quick reply. However, we do have those exact settings >>> for recovery and it still strongly affects client io. I have looked at >>> various ceph logs and osd logs and nothing is out of the ordinary. >>> Here's an idea though, please tell me if I am wrong. >>> >>> We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was >>> explained several times on this mailing list, Samsung SSDs suck in ceph. >>> They have horrible O_dsync speed and die easily, when used as journal. >>> That's why we're using Intel ssds for journaling, so that we didn't end >>> up putting 96 samsung SSDs in the trash. >>> >>> In recovery though, what is the ceph behaviour? What kind of write does >>> it do on the OSD SSDs? Does it write directly to the SSDs or through the >>> journal? >>> >>> Additionally, something else we notice: the ceph cluster is MUCH slower >>> after recovery than before. Clearly there is a bottleneck somewhere and >>> that bottleneck does not get cleared up after the recovery is done. >>> >>> >>> On 2015-08-19 3:32 PM, Somnath Roy wrote: If you are concerned about *client io performance* during recovery, use these settings.. osd recovery max active = 1 osd max backfills = 1 osd recovery threads = 1 osd recovery op priority = 1 If you are concerned about *recovery performance*, you may want to bump this up, but I doubt it will help much from default settings.. Thanks & Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P Methot Sent: Wednesday, August 19, 2015 12:17 PM To: ceph-us...@ceph.com Subject: [ceph-users] Bad performances in recovery Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider "alright". Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect cl
[ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs
Hello, this is an issue we have been suffering from and researching along with a good number of other Ceph users, as evidenced by the recent posts. In our specific case, these issues manifest themselves in a RBD -> iSCSI LIO -> ESXi configuration, but the problem is more general. When there is an issue on OSD nodes (examples: network hangs/blips, disk HBAs failing, driver issues, page cache/XFS issues), some OSDs respond slowly or with significant delays. ceph osd perf does not show this, neither does ceph osd tree, ceph -s / ceph -w. Instead, the RBD IO hangs to a point where the client times out, crashes or displays other unsavory behavior - operationally this crashes production processes. Today in our lab we had a disk controller issue, which brought an OSD node down. Upon restart, the OSDs started up and rejoined into the cluster. However, immediately all IOs started hanging for a long time and aborts from ESXi -> LIO were not succeeding in canceling these IOs. The only warning I could see was: root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30 requests are blocked > 32 sec; 1 osds have slow requests 30 ops are blocked > 2097.15 sec 30 ops are blocked > 2097.15 sec on osd.4 1 osds have slow requests However, ceph osd perf is not showing high latency on osd 4: root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms) fs_apply_latency(ms) 0 0 13 1 00 2 00 3 172 208 4 00 5 00 6 01 7 00 8 174 819 9 6 10 10 01 11 01 12 35 13 01 14 7 23 15 01 16 00 17 59 18 01 1910 18 20 00 21 00 22 01 23 5 10 SMART state for osd 4 disk is OK. The OSD in up and in: root@lab2-mon1:/var/log/ceph# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -80 root ssd -7 14.71997 root platter -3 7.12000 host croc3 22 0.89000 osd.22 up 1.0 1.0 15 0.89000 osd.15 up 1.0 1.0 16 0.89000 osd.16 up 1.0 1.0 13 0.89000 osd.13 up 1.0 1.0 18 0.89000 osd.18 up 1.0 1.0 8 0.89000 osd.8 up 1.0 1.0 11 0.89000 osd.11 up 1.0 1.0 20 0.89000 osd.20 up 1.0 1.0 -4 0.47998 host croc2 10 0.06000 osd.10 up 1.0 1.0 12 0.06000 osd.12 up 1.0 1.0 14 0.06000 osd.14 up 1.0 1.0 17 0.06000 osd.17 up 1.0 1.0 19 0.06000 osd.19 up 1.0 1.0 21 0.06000 osd.21 up 1.0 1.0 9 0.06000 osd.9 up 1.0 1.0 23 0.06000 osd.23 up 1.0 1.0 -2 7.12000 host croc1 7 0.89000 osd.7 up 1.0 1.0 2 0.89000 osd.2 up 1.0 1.0 6 0.89000 osd.6 up 1.0 1.0 1 0.89000 osd.1 up 1.0 1.0 5 0.89000 osd.5 up 1.0 1.0 0 0.89000 osd.0 up 1.0 1.0 4 0.89000 osd.4 up 1.0 1.0 3 0.89000 osd.3 up 1.0 1.0 How can we proactively detect this condition? Is there anything I can run that will output all slow OSDs? Regards, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs
> This can be tuned in the iSCSI initiation on VMware - look in advanced > settings on your ESX hosts (at least if you use the software initiator). Thanks, Jan. I asked this question of Vmware as well, I think the problem is specific to a given iSCSI session, so wondering if that's strictly the job of the target? Do you know of any specific SCSI settings that mitigate this kind of issue? Basically, give up on a session and terminate it and start a new one should an RBD not respond? As I understand, RBD simply never gives up. If an OSD does not respond but is still technically up and in, Ceph will retry IOs forever. I think RBD and Ceph need a timeout mechanism for this. Best regards, Alex > Jan > > >> On 23 Aug 2015, at 21:28, Nick Fisk wrote: >> >> Hi Alex, >> >> Currently RBD+LIO+ESX is broken. >> >> The problem is caused by the RBD device not handling device aborts properly >> causing LIO and ESXi to enter a death spiral together. >> >> If something in the Ceph cluster causes an IO to take longer than 10 >> seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens, >> as you have seen it never recovers. >> >> Mike Christie from Redhat is doing a lot of work on this currently, so >> hopefully in the future there will be a direct RBD interface into LIO and it >> will all work much better. >> >> Either tgt or SCST seem to be pretty stable in testing. >> >> Nick >> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Alex Gorbachev >>> Sent: 23 August 2015 02:17 >>> To: ceph-users >>> Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD >>> client IO hangs >>> >>> Hello, this is an issue we have been suffering from and researching along >>> with a good number of other Ceph users, as evidenced by the recent posts. >>> In our specific case, these issues manifest themselves in a RBD -> iSCSI >> LIO -> >>> ESXi configuration, but the problem is more general. >>> >>> When there is an issue on OSD nodes (examples: network hangs/blips, disk >>> HBAs failing, driver issues, page cache/XFS issues), some OSDs respond >>> slowly or with significant delays. ceph osd perf does not show this, >> neither >>> does ceph osd tree, ceph -s / ceph -w. Instead, the RBD IO hangs to a >> point >>> where the client times out, crashes or displays other unsavory behavior - >>> operationally this crashes production processes. >>> >>> Today in our lab we had a disk controller issue, which brought an OSD node >>> down. Upon restart, the OSDs started up and rejoined into the cluster. >>> However, immediately all IOs started hanging for a long time and aborts >> from >>> ESXi -> LIO were not succeeding in canceling these IOs. The only warning >> I >>> could see was: >>> >>> root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30 >>> requests are blocked > 32 sec; >>> 1 osds have slow requests 30 ops are blocked > 2097.15 sec >>> 30 ops are blocked > 2097.15 sec on osd.4 >>> 1 osds have slow requests >>> >>> However, ceph osd perf is not showing high latency on osd 4: >>> >>> root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms) >>> fs_apply_latency(ms) >>> 0 0 13 >>> 1 00 >>> 2 00 >>> 3 172 208 >>> 4 00 >>> 5 00 >>> 6 01 >>> 7 00 >>> 8 174 819 >>> 9 6 10 >>> 10 01 >>> 11 01 >>> 12 35 >>> 13 01 >>> 14 7 23 >>> 15 01 >>> 16 00 >>> 17 59 >>> 18 01 >>> 1910 18 >>> 20 00 >>> 21
Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs
HI Jan, On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer wrote: > I never actually set up iSCSI with VMware, I just had to research various > VMware storage options when we had a SAN-probelm at a former job... But I can > take a look at it again if you want me to. Thank you, I don't want to waste your time as I have asked Vmware TAP to research that - I will communicate back anything with which they respond. > > Is it realy deadlocked when this issue occurs? > What I think is partly responsible for this situation is that the iSCSI LUN > queues fill up and that's what actually kills your IO - VMware lowers queue > depth to 1 in that situation and it can take a really long time to recover > (especially if one of the LUNs on the target constantly has problems, or > when heavy IO hammers the adapter) - you should never fill this queue, ever. > iSCSI will likely be innocent victim in the chain, not the cause of the > issues. Completely agreed, so iSCSI's job then is to properly communicate to the initiator that it cannot do what it is asked to do and quit the IO. > > Ceph should gracefully handle all those situations, you just need to set the > timeouts right. I have it set so that whatever happens the OSD can only delay > work for 40s and then it is marked down - at that moment all IO start flowing > again. What setting in ceph do you use to do that? is that mon_osd_down_out_interval? I think stopping slow OSDs is the answer to the root of the problem - so far I only know to do "ceph osd perf" and look at latencies. > > You should take this to VMware support, they should be able to tell whether > the problem is in iSCSI target (then you can take a look at how that behaves) > or in the initiator settings. Though in my experience after two visits from > their "foremost experts" I had to google everything myself because they were > clueless - YMMV. I am hoping the TAP Elite team can do better...but we'll see... > > The root cause is however slow ops in Ceph, and I have no idea why you'd have > them if the OSDs come back up - maybe one of them is really deadlocked or > backlogged in some way? I found that when OSDs are "dead but up" they don't > respond to "ceph tell osd.xxx ..." so try if they all respond in a timely > manner, that should help pinpoint the bugger. I think I know in this case - there are some PCIe AER/Bus errors and TLP Header messages strewing across the console of one OSD machine - ceph osd perf showing latencies aboce a second per OSD, but only when IO is done to those OSDs. I am thankful this is not production storage, but worried of this situation in production - the OSDs are staying up and in, but their latencies are slowing clusterwide IO to a crawl. I am trying to envision this situation in production and how would one find out what is slowing everything down without guessing. Regards, Alex > > Jan > > >> On 24 Aug 2015, at 18:26, Alex Gorbachev wrote: >> >>> This can be tuned in the iSCSI initiation on VMware - look in advanced >>> settings on your ESX hosts (at least if you use the software initiator). >> >> Thanks, Jan. I asked this question of Vmware as well, I think the >> problem is specific to a given iSCSI session, so wondering if that's >> strictly the job of the target? Do you know of any specific SCSI >> settings that mitigate this kind of issue? Basically, give up on a >> session and terminate it and start a new one should an RBD not >> respond? >> >> As I understand, RBD simply never gives up. If an OSD does not >> respond but is still technically up and in, Ceph will retry IOs >> forever. I think RBD and Ceph need a timeout mechanism for this. >> >> Best regards, >> Alex >> >>> Jan >>> >>> >>>> On 23 Aug 2015, at 21:28, Nick Fisk wrote: >>>> >>>> Hi Alex, >>>> >>>> Currently RBD+LIO+ESX is broken. >>>> >>>> The problem is caused by the RBD device not handling device aborts properly >>>> causing LIO and ESXi to enter a death spiral together. >>>> >>>> If something in the Ceph cluster causes an IO to take longer than 10 >>>> seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens, >>>> as you have seen it never recovers. >>>> >>>> Mike Christie from Redhat is doing a lot of work on this currently, so >>>> hopefully in the future there will be a direct RBD interface into LIO and >>>> it >>>> will all work much better. >>>> >>>> Either tgt or SCST seem to be pretty stable in te
Re: [ceph-users] 1 hour until Ceph Tech Talk
Hi Patrick, On Thu, Aug 27, 2015 at 12:00 PM, Patrick McGarry wrote: > Just a reminder that our Performance Ceph Tech Talk with Mark Nelson > will be starting in 1 hour. > > If you are unable to attend there will be a recording posted on the > Ceph YouTube channel and linked from the page at: > > http://ceph.com/ceph-tech-talks/ > > That is an excellent talk and I am wondering if there's any more info on compiling Ceph with jemalloc. Is that something that you would discourage at the moment or OK to try on test systems? Regards, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ESXi/LIO/RBD repeatable problem, hang when cloning VM
e have experienced a repeatable issue when performing the following: Ceph backend with no issues, we can repeat any time at will in lab and production. Cloning an ESXi VM to another VM on the same datastore on which the original VM resides. Practically instantly, the LIO machine becomes unresponsive, Pacemaker fails over to another LIO machine and that too becomes unresponsive. Both running Ubuntu 14.04, kernel 4.1 (4.1.0-040100-generic x86_64), Ceph Hammer 0.94.2, and have been able to take quite a workoad with no issues. output of /var/log/syslog below. I also have a screen dump of a frozen system - attached. Thank you, Alex Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886254] CPU: 22 PID: 18130 Comm: kworker/22:1 Tainted: G C OE 4.1.0-040100-generic #201506220235 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886303] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886364] Workqueue: xcopy_wq target_xcopy_do_work [target_core_mod] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886395] task: 8810441c3250 ti: 88105bb4 task.ti: 88105bb4 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886440] RIP: 0010:[] [] sbc_check_prot+0x49/0x210 [target_core_mod] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886498] RSP: 0018:88105bb43b88 EFLAGS: 00010246 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886525] RAX: 0400 RBX: 8810589eb008 RCX: 0400 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886554] RDX: 8810589eb0f8 RSI: RDI: Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886584] RBP: 88105bb43bc8 R08: R09: 0001 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886613] R10: R11: R12: Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886643] R13: 88084860c000 R14: c02372c0 R15: 0400 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886673] FS: () GS:88105f48() knlGS: Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886719] CS: 0010 DS: ES: CR0: 80050033 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886747] CR2: 0010 CR3: 01e0f000 CR4: 001407e0 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886777] Stack: Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886798] 000b 000c 8810589eb0f8 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886851] 8810589eb008 88084860c000 c02372c0 0400 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886904] 88105bb43c28 c03e528a 000c 0004000c Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886957] Call Trace: Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886989] [] sbc_parse_cdb+0x66a/0xa20 [target_core_mod] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887022] [] iblock_parse_cdb+0x15/0x20 [target_core_iblock] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887077] [] target_setup_cmd_from_cdb+0x1c0/0x260 [target_core_mod] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887133] [] target_xcopy_setup_pt_cmd+0x8d/0x170 [target_core_mod] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887188] [] target_xcopy_read_source.isra.12+0x126/0x220 [target_core_mod] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887243] [] ? sched_clock+0x9/0x10 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887279] [] target_xcopy_do_work+0xf1/0x370 [target_core_mod] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887329] [] ? __switch_to+0x1e6/0x580 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887361] [] process_one_work+0x144/0x490 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887390] [] worker_thread+0x11e/0x460 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887418] [] ? create_worker+0x1f0/0x1f0 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887449] [] kthread+0xc9/0xe0 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887477] [] ? flush_kthread_worker+0x90/0x90 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887510] [] ret_from_fork+0x42/0x70 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887538] [] ? flush_kthread_worker+0x90/0x90 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.890342] Code: 7d f8 49 89 fd 4c 89 65 e0 44 0f b6 62 01 41 89 cf 48 8b be 80 00 00 00 41 8b b5 18 04 00 00 41 c0 ec 05 48 83 bb f0 01 00 00 00 <8b> 4f 10 41 89 f6 74 0a 8b 83 f8 01 00 00 85 c0 75 14 45 84 e4 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.890580] RIP [] sbc_check_prot+0x49/0x210 [target_core_mod] Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.890636] RSP Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.890659] CR2: 0010 Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.890956] ---[ end trace 894b2880b8116889 ]--- Sep 2 12:12:04 roc-4r-scd214 kernel: [86833.204150] BUG: unable to handle kernel paging request at ffd8 Sep 2 12:12:04 roc-4r-scd214 kernel: [86833.204291] IP: [] kthread_data+0x10/0x20 Sep
Re: [ceph-users] ESXi/LIO/RBD repeatable problem, hang when cloning VM
On Thu, Sep 3, 2015 at 6:58 AM, Jan Schermer wrote: > EnhanceIO? I'd say get rid of that first and then try reproducing it. Jan, EnhanceIO has not been used in this case, in fact we have never had a problem with it in read cache mode. Thank you, Alex > > Jan > >> On 03 Sep 2015, at 03:14, Alex Gorbachev wrote: >> >> e have experienced a repeatable issue when performing the following: >> >> Ceph backend with no issues, we can repeat any time at will in lab and >> production. Cloning an ESXi VM to another VM on the same datastore on >> which the original VM resides. Practically instantly, the LIO machine >> becomes unresponsive, Pacemaker fails over to another LIO machine and >> that too becomes unresponsive. >> >> Both running Ubuntu 14.04, kernel 4.1 (4.1.0-040100-generic x86_64), >> Ceph Hammer 0.94.2, and have been able to take quite a workoad with no >> issues. >> >> output of /var/log/syslog below. I also have a screen dump of a >> frozen system - attached. >> >> Thank you, >> Alex >> >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886254] CPU: 22 PID: >> 18130 Comm: kworker/22:1 Tainted: G C OE >> 4.1.0-040100-generic #201506220235 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886303] Hardware name: >> Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a >> 12/05/2013 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886364] Workqueue: >> xcopy_wq target_xcopy_do_work [target_core_mod] >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886395] task: >> 8810441c3250 ti: 88105bb4 task.ti: 88105bb4 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886440] RIP: >> 0010:[] [] >> sbc_check_prot+0x49/0x210 [target_core_mod] >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886498] RSP: >> 0018:88105bb43b88 EFLAGS: 00010246 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886525] RAX: >> 0400 RBX: 8810589eb008 RCX: 0400 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886554] RDX: >> 8810589eb0f8 RSI: RDI: >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886584] RBP: >> 88105bb43bc8 R08: R09: 0001 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886613] R10: >> R11: R12: >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886643] R13: >> 88084860c000 R14: c02372c0 R15: 0400 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886673] FS: >> () GS:88105f48() >> knlGS: >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886719] CS: 0010 DS: >> ES: CR0: 80050033 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886747] CR2: >> 0010 CR3: 01e0f000 CR4: 001407e0 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886777] Stack: >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886798] 000b >> 000c 8810589eb0f8 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886851] 8810589eb008 >> 88084860c000 c02372c0 0400 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886904] 88105bb43c28 >> c03e528a 000c 0004000c >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886957] Call Trace: >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.886989] >> [] sbc_parse_cdb+0x66a/0xa20 [target_core_mod] >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887022] >> [] iblock_parse_cdb+0x15/0x20 [target_core_iblock] >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887077] >> [] target_setup_cmd_from_cdb+0x1c0/0x260 >> [target_core_mod] >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887133] >> [] target_xcopy_setup_pt_cmd+0x8d/0x170 >> [target_core_mod] >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887188] >> [] target_xcopy_read_source.isra.12+0x126/0x220 >> [target_core_mod] >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887243] >> [] ? sched_clock+0x9/0x10 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887279] >> [] target_xcopy_do_work+0xf1/0x370 [target_core_mod] >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887329] >> [] ? __switch_to+0x1e6/0x580 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887361] >> [] process_one_work+0x144/0x490 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887390] >> [] worker_thread+0x11e/0x460 >> Sep 2 12:11:55 roc-4r-scd214 kernel: [86831.887418] >> [] ? create_worker+0x1f0/0x1f0 >> Sep 2 12:11:55
Re: [ceph-users] ESXi/LIO/RBD repeatable problem, hang when cloning VM
On Thu, Sep 3, 2015 at 3:20 AM, Nicholas A. Bellinger wrote: > (RESENDING) > > On Wed, 2015-09-02 at 21:14 -0400, Alex Gorbachev wrote: >> e have experienced a repeatable issue when performing the following: >> >> Ceph backend with no issues, we can repeat any time at will in lab and >> production. Cloning an ESXi VM to another VM on the same datastore on >> which the original VM resides. Practically instantly, the LIO machine >> becomes unresponsive, Pacemaker fails over to another LIO machine and >> that too becomes unresponsive. >> >> Both running Ubuntu 14.04, kernel 4.1 (4.1.0-040100-generic x86_64), >> Ceph Hammer 0.94.2, and have been able to take quite a workoad with no >> issues. >> >> output of /var/log/syslog below. I also have a screen dump of a >> frozen system - attached. >> >> Thank you, >> Alex >> > > The bug-fix patch to address this NULL pointer dereference with >= v4.1 > sbc_check_prot() sanity checks + EXTENDED_COPY I/O emulation has been > sent-out with your Reported-by. > > Please verify with your v4.1 environment that it resolves the original > ESX VAAI CLONE regression with a proper Tested-by tag. > > For now, it has also been queued to target-pending.git/for-next with a > stable CC'. Thank you for providing the patch - I will apply to git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git target-pending.git Best regards, Alex > > Thanks for reporting! > > --nab > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ESXi/LIO/RBD repeatable problem, hang when cloning VM
On Thu, Sep 3, 2015 at 3:20 AM, Nicholas A. Bellinger wrote: > (RESENDING) > > On Wed, 2015-09-02 at 21:14 -0400, Alex Gorbachev wrote: >> e have experienced a repeatable issue when performing the following: >> >> Ceph backend with no issues, we can repeat any time at will in lab and >> production. Cloning an ESXi VM to another VM on the same datastore on >> which the original VM resides. Practically instantly, the LIO machine >> becomes unresponsive, Pacemaker fails over to another LIO machine and >> that too becomes unresponsive. >> >> Both running Ubuntu 14.04, kernel 4.1 (4.1.0-040100-generic x86_64), >> Ceph Hammer 0.94.2, and have been able to take quite a workoad with no >> issues. >> >> output of /var/log/syslog below. I also have a screen dump of a >> frozen system - attached. >> >> Thank you, >> Alex >> > > The bug-fix patch to address this NULL pointer dereference with >= v4.1 > sbc_check_prot() sanity checks + EXTENDED_COPY I/O emulation has been > sent-out with your Reported-by. > > Please verify with your v4.1 environment that it resolves the original > ESX VAAI CLONE regression with a proper Tested-by tag. > > For now, it has also been queued to target-pending.git/for-next with a > stable CC'. > > Thanks for reporting! Thank you for the patch. I have compiled the kernel and tried the cloning - it completed successfully this morning. I will now try to build a package and deploy it on the larger systems where the failures occurred. Once completed I will learn about the Tested-by tag (never done it before) and submit the results. Best regards, Alex > > --nab > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD crash
Hello, We have run into an OSD crash this weekend with the following dump. Please advise what this could be. Best regards, Alex 2015-09-07 14:55:01.345638 7fae6c158700 0 -- 10.80.4.25:6830/2003934 >> 10.80.4.15:6813/5003974 pipe(0x1dd73000 sd=257 :6830 s=2 pgs=14271 cs=251 l=0 c=0x10d34580).fault with nothing to send, going to standby 2015-09-07 14:56:16.948998 7fae643e8700 -1 *** Caught signal (Segmentation fault) ** in thread 7fae643e8700 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: /usr/bin/ceph-osd() [0xacb3ba] 2: (()+0x10340) [0x7faea044e340] 3: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x103) [0x7faea067fac3] 4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long)+0x1b) [0x7faea067fb7b] 5: (operator delete(void*)+0x1f8) [0x7faea068ef68] 6: (std::_Rb_tree > >, std::_Select1st > > >, std::less, std::allocator > > > >::_M_erase(std::_Rb_tree_node > > >*)+0x58) [0xca2438] 7: (std::_Rb_tree > >, std::_Select1st > > >, std::less, std::allocator > > > >::erase(int const&)+0xdf) [0xca252f] 8: (Pipe::writer()+0x93c) [0xca097c] 9: (Pipe::Writer::entry()+0xd) [0xca40dd] 10: (()+0x8182) [0x7faea0446182] 11: (clone()+0x6d) [0x7fae9e9b100d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -1> 2015-08-20 05:32:32.454940 7fae8e897700 0 -- 10.80.4.25:6830/2003934 >> 10.80.4.15:6806/4003754 pipe(0x1992d000 sd=142 :6830 s=0 pgs=0 cs=0 l=0 c=0x12bf5700).accept connect_seq 816 vs existing 815 state standby ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD crash
Hi Brad, This occurred on a system under moderate load - has not happened since and I do not know how to reproduce. Thank you, Alex On Tue, Sep 22, 2015 at 7:29 PM, Brad Hubbard wrote: > - Original Message - > > > From: "Alex Gorbachev" > > To: "ceph-users" > > Sent: Wednesday, 9 September, 2015 6:38:50 AM > > Subject: [ceph-users] OSD crash > > > Hello, > > > We have run into an OSD crash this weekend with the following dump. > Please > > advise what this could be. > > Hello Alex, > > As you know I created http://tracker.ceph.com/issues/13074 for this issue > but > the developers working on it would like any additional information you can > provide about the nature of the issue. Could you take a look? > > Cheers, > Brad > > > Best regards, > > Alex > > > 2015-09-07 14:55:01.345638 7fae6c158700 0 -- 10.80.4.25:6830/2003934 >> > > 10.80.4.15:6813/5003974 pipe(0x1dd73000 sd=257 :6830 s=2 pgs=14271 > cs=251 > > l=0 c=0x10d34580).fault with nothing to send, going to standby > > 2015-09-07 14:56:16.948998 7fae643e8700 -1 *** Caught signal > (Segmentation > > fault) ** > > in thread 7fae643e8700 > > > ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) > > 1: /usr/bin/ceph-osd() [0xacb3ba] > > 2: (()+0x10340) [0x7faea044e340] > > 3: > > > (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > > unsigned long, int)+0x103) [0x7faea067fac3] > > 4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, > > unsigned long)+0x1b) [0x7faea067fb7b] > > 5: (operator delete(void*)+0x1f8) [0x7faea068ef68] > > 6: (std::_Rb_tree > std::allocator > >, std::_Select1st > std::list > > >, std::less, > > std::allocator > std::allocator > > > > >::_M_erase(std::_Rb_tree_node > const, std::list > > >*)+0x58) > [0xca2438] > > 7: (std::_Rb_tree > std::allocator > >, std::_Select1st > std::list > > >, std::less, > > std::allocator > std::allocator > > > >::erase(int const&)+0xdf) [0xca252f] > > 8: (Pipe::writer()+0x93c) [0xca097c] > > 9: (Pipe::Writer::entry()+0xd) [0xca40dd] > > 10: (()+0x8182) [0x7faea0446182] > > 11: (clone()+0x6d) [0x7fae9e9b100d] > > NOTE: a copy of the executable, or `objdump -rdS ` is needed > to > > interpret this. > > > --- begin dump of recent events --- > > -1> 2015-08-20 05:32:32.454940 7fae8e897700 0 -- > 10.80.4.25:6830/2003934 > > >> 10.80.4.15:6806/4003754 pipe(0x1992d000 sd=142 :6830 s=0 pgs=0 cs=0 > l=0 > > c=0x12bf5700).accept connect_seq 816 vs existing 815 state standby > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Diffrent OSD capacity & what is the weight of item
Please review http://docs.ceph.com/docs/master/rados/operations/crush-map/ regarding weights Best regards, Alex On Wed, Sep 23, 2015 at 3:08 AM, wikison wrote: > Hi, > I have four storage machines to build a ceph storage cluster as > storage nodes. Each of them is attached a 120 GB HDD and a 1 TB HDD. Is it > OK to think that those storage devices are same when write a ceph.conf? > For example, when setting *osd pool default pg num* , I thought: *osd > pool default pg num* = (100 * 8 ) / 3 = 266, where *osd pool default > size* = 3 and the number of OSDs is 8 (one Daemon per device). > > And, when Add the OSD to the CRUSH map so that it can begin > receiving data. You may also decompile the CRUSH map, add the OSD to the > device list, add the host as a bucket (if it’s not already in the CRUSH > map), add the device as an item in the host, assign it a weight, recompile > it and set it. > ceph [--cluster {cluster-name}] osd crush add {id-or-name} > {weight} [{bucket-type}={bucket-name} ...] > What is the meaning of weight? How should I set it to satisfy my > hardware condition ? > > > > -- > Zhen Wang > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Potential OSD deadlock?
We had multiple issues with 4TB drives and delays. Here is the configuration that works for us fairly well on Ubuntu (but we are about to significantly increase the IO load so this may change). NTP: always use NTP and make sure it is working - Ceph is very sensitive to time being precise /etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="elevator=noop nomodeset splash=silent vga=normal net.ifnames=0 biosdevname=0 scsi_mod.use_blk_mq=Y" blk_mq really helps with spreading the IO load over multiple cores. I used to use intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll, but it seems allowing idle states actually can improve performance by running CPUs cooler, so will likely remove this soon. chmod -x /etc/init.d/ondemand - in order to prevent CPU throttling use Mellanox OFED on pre-4.x kernels check your flow control settings on server and switch using ethtool test network performance with iperf disable firewall rules or just uninstall firewall (e.g. ufw) Turn off in BIOS any virtualization technology VT-d etc., and (see note above re C-states) maybe also disable power saving features /etc/sysctl.conf: kernel.pid_max = 4194303 vm.swappiness=1 vm.min_free_kbytes=1048576 Hope this helps. Alex On Sun, Oct 4, 2015 at 2:16 AM, Josef Johansson wrote: > Hi, > > I don't know what brand those 4TB spindles are, but I know that mine are > very bad at doing write at the same time as read. Especially small read > write. > > This has an absurdly bad effect when doing maintenance on ceph. That being > said we see a lot of difference between dumpling and hammer in performance > on these drives. Most likely due to hammer able to read write degraded PGs. > > We have run into two different problems along the way, the first was > blocked request where we had to upgrade from 64GB mem on each node to > 256GB. We thought that it was the only safe buy make things better. > > I believe it worked because more reads were cached so we had less mixed > read write on the nodes, thus giving the spindles more room to breath. Now > this was a shot in the dark then, but the price is not that high even to > just try it out.. compared to 6 people working on it. I believe the IO on > disk was not huge either, but what kills the disk is high latency. How much > bandwidth are the disk using? We had very low.. 3-5MB/s. > > The second problem was defragmentations hitting 70%, lowering that to 6% > made a lot of difference. Depending on IO pattern this increases different. > > TL;DR read kills the 4TB spindles. > > Hope you guys clear out of the woods. > /Josef > On 3 Oct 2015 10:10 pm, "Robert LeBlanc" wrote: > >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> We are still struggling with this and have tried a lot of different >> things. Unfortunately, Inktank (now Red Hat) no longer provides >> consulting services for non-Red Hat systems. If there are some >> certified Ceph consultants in the US that we can do both remote and >> on-site engagements, please let us know. >> >> This certainly seems to be network related, but somewhere in the >> kernel. We have tried increasing the network and TCP buffers, number >> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle >> on the boxes, the disks are busy, but not constantly at 100% (they >> cycle from <10% up to 100%, but not 100% for more than a few seconds >> at a time). There seems to be no reasonable explanation why I/O is >> blocked pretty frequently longer than 30 seconds. We have verified >> Jumbo frames by pinging from/to each node with 9000 byte packets. The >> network admins have verified that packets are not being dropped in the >> switches for these nodes. We have tried different kernels including >> the recent Google patch to cubic. This is showing up on three cluster >> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie >> (from CentOS 7.1) with similar results. >> >> The messages seem slightly different: >> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 : >> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for > >> 100.087155 secs >> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 : >> cluster [WRN] slow request 30.041999 seconds old, received at >> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862 >> rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096] >> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag >> points reached >> >> I don't know what "no flag points reached" means. >> >> The problem is most pronounced when we have to reboot an OSD node (1 >> of 13), we will have hundreds of I/O blocked for some times up to 300 >> seconds. It takes a good 15 minutes for things to settle down. The >> production cluster is very busy doing normally 8,000 I/O and peaking >> at 15,000. This is all 4TB spindles with SSD journals and the disks >> are between 25-50% full. We are currently splitting PGs to distribute >> the load better across the disks, but w
Re: [ceph-users] Ceph RBD LIO ESXi Advice?
Hi Timofey, With Nick's, Jan's, RedHat's and others' help we have a stable and, in my best judgement, well performing system using SCST as the iSCSI delivery framework. SCST allows the use of Linux page cache when utilizing the vdisk_fileio backend. LIO should be able to do this to using FILEIO backstore and the block device name as file name, but I have not tried that due to having switched to SCST for stability. The page cache will improve latency due to the reads and writes first occurring in RAM. Naturally, all the usual considerations apply as to the loss of dirty pages on machine crash. So tuning the vm.dirty* parameters is quite important. This setting was critically important to avoid hangs and major issues due to some problem with XFS and page cache on OSD nodes: sysctl vm.min_free_kbytes=1048576 (reserved memory when using vm.swappiness = 1) 10 GbE networking seems to be helping a lot, it could be just the superior switch response on a higher end switch. Using blk_mq scheduler, it's been reported to improve performance on random IO. Good luck! -- Alex Gorbachev Storcium On Sun, Nov 8, 2015 at 5:07 PM, Timofey Titovets wrote: > Big thanks Nick, any way > Now i catch hangs of ESXi and Proxy =_='' > /* Proxy VM: Ubuntu 15.10/Kernel 4.3/LIO/Ceph 0.94/ESXi 6.0 Software > iSCSI*/ > I've moved to NFS-RBD proxy and now try to make it HA > > 2015-11-07 18:59 GMT+03:00 Nick Fisk : > > Hi Timofey, > > > > You are most likely experiencing the effects of Ceph's write latency in > combination with the sync write behaviour of ESXi. You will probably > struggle to get much under 2ms write latency with Ceph, assuming a minimum > of 2 copies in Ceph. This will limit you to around 500iops for a QD of 1. > Because of this you will also experience slow file/VM copies, as ESXi moves > the blocks of data around in 64kb sync IO's. 500x64kb = ~30MB/s. > > > > Moving to 10GB end to end may get you a reasonable boost in performance > as you will be removing a 1ms or so of latency from the network for each > write. Also search the mailing list for small performance tweaks you can > do, like disabling logging. > > > > Other than that the only thing I have found that has chance of giving > you performance similar to other products and/or legacy SAN's is to use > some sort of RBD caching with something like flashcache/enhanceio/bcache o > nyour proxy nodes. However this brings its on challenges and I still > haven't got to a point that I'm happy to deploy it. > > > > I'm surprised you are also not seeing LIO hangs, which several people > including me experience when using RBD+LIO+ESXi, although I haven't checked > recently to see if this is now working better. I would be interesting in > hearing your feedback on this. They normally manifest themselves when an > OSD drops out and IO is suspended for more than 5-10s. > > > > Sorry I couldn't be of more help. > > > > Nick > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of > >> Timofey Titovets > >> Sent: 07 November 2015 11:44 > >> To: ceph-users@lists.ceph.com > >> Subject: [ceph-users] Ceph RBD LIO ESXi Advice? > >> > >> Hi List, > >> I Searching for advice from somebody, who use Legacy client like ESXi > with > >> Ceph > >> > >> I try to build High-performance fault-tolerant storage with Ceph 0.94 > >> > >> In production i have 50+ TB of VMs (~800 VMs) > >> 8 NFS servers each: > >> 2xIntel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 12xSeagate ST2000NM0023 > >> 1xLSI Nytro™ MegaRAID® NMR 8110-4i > >> 96 GB of RAM > >> 4x 1 GBE links in Balance-ALB mode (I don't have problem with network > >> throughput) > >> > >> Now in lab. i have build 3 node cluster like: > >> Kernel 4.2 > >> Intel(R) Xeon(R) CPU 5130 @ 2.00GHz > >> 16 Gb of RAM > >> 6xSeagate ST2000NM0033 > >> 2x 1GBE in Balance-ALB > >> i.e. each node is a MON and 6 OSDs > >> > >> > >> Config like: > >> osd journal size = 16384 > >> osd pool default size = 2 > >> osd pool default min size = 2 > >> osd pool default pg num = 256 > >> osd pool default pgp num = 256 > >> osd crush chooseleaf type = 1 > >> filestore max sync interval = 180 > >> > >> For attach RBD Storage to ESXi i create a 2 VMs: > >> 2 cores > >> 2 GB RAM > >> Kernel 4.3 > >> Each vm map big RBD volume and proxy it by LIO to ESXi ESXi see VMs like >
Re: [ceph-users] network failover with public/custer network - is that possible
On Wednesday, November 25, 2015, Götz Reinicke - IT Koordinator < goetz.reini...@filmakademie.de> wrote: > Hi, > > discussing some design questions we came across the failover possibility > of cephs network configuration. > > If I just have a public network, all traffic is crossing that lan. > > With public and cluster network I can separate the traffic and get some > benefits. > > What if one of the networks fail? e.g. just on one host or the whole > network for all nodes? > > Is there some sort of auto failover to use the other network for all > traffic than? > > How dose that work in real life? :) Or do I have to interact by hand? We have successfully used multiple bonding interfaces, which work correctly with high speed NICs, at least in Ubuntu with 4.x kernels. In combination with MLAG (multiple seitch chassis link aggregation) this provides at least good physical redundancy. I expect to be able for this to improve further with Software Defined Networking solutions getting more popular and making it easier to create such redundant setups. We have not delved into layer 3 solutions, such as OSPF, but these should be helpful as well to add robustness to the Ceph networking backend. Best regards, Alex > > Thanks for feedback and regards . Götz > > > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd merge-diff error
When trying to merge two results of rbd export-diff, the following error occurs: iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151500 spin1/scrun1@autosnap120720151502 /data/volume1/scrun1-120720151502.bck iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151504 spin1/scrun1@autosnap120720151504 /data/volume1/scrun1-120720151504.bck iss@lab2-b1:~$ rbd merge-diff /data/volume1/scrun1-120720151502.bck /data/volume1/scrun1-120720151504.bck /data/volume1/mrg-scrun1-0204.bck Merging image diff: 11% complete...failed. rbd: merge-diff error That's all the output and I have found this link http://tracker.ceph.com/issues/12911 but not sure if the patch should have already been in hammer or how to get it? System: ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) Ubuntu 14.04.3 kernel 4.2.1-040201-generic Thank you -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd merge-diff error
Hi Josh, On Mon, Dec 7, 2015 at 6:50 PM, Josh Durgin wrote: > On 12/07/2015 03:29 PM, Alex Gorbachev wrote: > >> When trying to merge two results of rbd export-diff, the following error >> occurs: >> >> iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151500 >> spin1/scrun1@autosnap120720151502 /data/volume1/scrun1-120720151502.bck >> >> iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151504 >> spin1/scrun1@autosnap120720151504 /data/volume1/scrun1-120720151504.bck >> >> iss@lab2-b1:~$ rbd merge-diff /data/volume1/scrun1-120720151502.bck >> /data/volume1/scrun1-120720151504.bck /data/volume1/mrg-scrun1-0204.bck >> Merging image diff: 11% complete...failed. >> rbd: merge-diff error >> >> That's all the output and I have found this link >> http://tracker.ceph.com/issues/12911 but not sure if the patch should >> have already been in hammer or how to get it? >> > > That patch fixed a bug that was only present after hammer, due to > parallelizing export-diff. You're likely seeing a different (possibly > new) issue. > > Unfortunately there's not much output we can enable for export-diff in > hammer. Could you try running the command via gdb to figure out where > and why it's failing? Make sure you have librbd-dbg installed, then > send the output from gdb doing: > > gdb --args rbd merge-diff /data/volume1/scrun1-120720151502.bck \ > /data/volume1/scrun1-120720151504.bck /data/volume1/mrg-scrun1-0204.bck > break rbd.cc:1931 > break rbd.cc:1935 > break rbd.cc:1967 > break rbd.cc:1985 > break rbd.cc:1999 > break rbd.cc:2008 > break rbd.cc:2021 > break rbd.cc:2053 > break rbd.cc:2098 > run > # (it will run now, stopping when it hits the error) > info locals Will do - how does one load librbd-dbg? I have the following on the system: librbd-dev - RADOS block device client library (development files) librbd1-dbg - debugging symbols for librbd1 is librbd1-dbg sufficient? Also a question - the merge-diff really stitches the to diff files together, not really merges, correct? For example, in the following workflow: export-diff from full image - 10GB export-diff from snap1 - 2 GB export-diff from snap2 - 1 GB My resulting merge export file would be 13GB, correct? Thank you, Alex > > > Josh > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Starting a cluster with one OSD node
On Friday, May 13, 2016, Mike Jacobacci wrote: > Hello, > > I have a quick and probably dumb question… We would like to use Ceph for > our storage, I was thinking of a cluster with 3 Monitor and OSD nodes. I > was wondering if it was a bad idea to start a Ceph cluster with just one > OSD node (10 OSDs, 2 SSDs), then add more nodes as our budget allows? We > want to spread out the purchases of the OSD nodes over a month or two but I > would like to start moving data over ASAP. Hi Mike, Production or test? I would strongly recommend against one OSD node in production. Not only risk of hang and data loss due to e.g. Filesystem issue or kernel, but also as you add nodes the data movement will introduce a good deal of overhead. Regards, Alex > > Cheers, > Mike > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Starting a cluster with one OSD node
> On Friday, May 13, 2016, Mike Jacobacci wrote: > Hello, > > I have a quick and probably dumb question… We would like to use Ceph > for our storage, I was thinking of a cluster with 3 Monitor and OSD > nodes. I was wondering if it was a bad idea to start a Ceph cluster > with just one OSD node (10 OSDs, 2 SSDs), then add more nodes as our > budget allows? We want to spread out the purchases of the OSD nodes > over a month or two but I would like to start moving data over ASAP. Hi Mike, Production or test? I would strongly recommend against one OSD node in production. Not only risk of hang and data loss due to e.g. Filesystem issue or kernel, but also as you add nodes the data movement will introduce a good deal of overhead. >> On May 14, 2016, at 9:56 AM, Christian Balzer wrote: >> >> On Sat, 14 May 2016 09:46:23 -0700 Mike Jacobacci wrote: >> >> >> Hello, >> >>> Hi Alex, >>> >>> Thank you for your response! Yes, this is for a production >>> environment... Do you think the risk of data loss due to the single node >>> be different than if it was an appliance or a Linux box with raid/zfs? >> Depends. >> >> Ceph by default distributes 3 replicas amongst the storage nodes, giving >> you fault tolerances along the lines of RAID6. >> So (again by default), the smallest cluster you want to start with is 3 >> nodes. >> >> OF course you could modify the CRUSH rules to place 3 replicas based on >> OSDs, not nodes. >> >> However that only leaves you with 3 disks worth of capacity in your case >> and still the data movement Alex mentioned when adding more nodes AND >> modifying the CRUSH rules. >> >> Lastly I personally wouldn't deploy anything that's a SPoF in production. >> >> Christian On Sat, May 14, 2016 at 1:08 PM, Mike Jacobacci wrote: > Hi Christian, > > Thank you, I know what I am asking isn't a good idea... I am just trying to > avoid waiting for all three nodes before I began virtualizing our > infrastructure. > > Again thanks for the responses! Hi Mike, I generally do not build production environments on one node for storage, although my group has built really good test/training environments with a single box. What we do there is forgo ceph altogether and install a hardware RAID with your favorite RAID HBA vendor - LSI/Avago, Areca, Adaptec, etc. and export it using SCST as iSCSI. This setup has worked really well so far for its intended use. We tested a two box setup with LSI/Avago SyncroCS, which works well too and there are some good howtos on the web for this - but it seems SyncroCS has been put on ice by Avago, unfortunately. Regarding building a one node setup in ceph and then expanding it, I would not do this. It is easier to do things right up front than to redo later. What you may want to do is use this one node to become familiar with the ceph architecture and do a dry run - however, I would wipe clean and recreate the environment rather than promote it to production. Silly operator errors have come up in the past, like leaving OSD level redundancy instead of setting node redundancy. Also, big data migrations are hard on clients (you can see IO timeouts), as discussed often on this list. So YMMV, but I personally would not rush. Best regards, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Pacemaker Resource Agents for Ceph by Andreas Kurz
Following a conversation with Sage in NYC, I would like to share links to the excellent resource agents for Pacemaker, developed by Andreas Kurz to present Ceph images to iSCSI and FC fabrics. We are using these as part of the Storcium solution, and these RAs have withstood quite a few beatings by clients' IO load. https://github.com/akurz/resource-agents/blob/SCST/heartbeat/SCSTLogicalUnit https://github.com/akurz/resource-agents/blob/SCST/heartbeat/SCSTTarget https://github.com/akurz/resource-agents/blob/SCST/heartbeat/iscsi-scstd -- Alex Gorbachev http://www.iss-integration.com Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Must host bucket name be the same with hostname ?
On Tuesday, June 7, 2016, Christian Balzer wrote: > > Hello, > > you will want to read: > > https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ > > especially section III and IV. > > Another approach w/o editing the CRUSH map is here: > https://elkano.org/blog/ceph-sata-ssd-pools-server-editing-crushmap/ I wonder if this task can be further automated using this feature http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/025030.html http://blog-fromsomedude.rhcloud.com/2016/03/30/CRUSH-location-hook-by-example/ Alex > > Christian > > On Wed, 8 Jun 2016 10:54:36 +0800 秀才 wrote: > > > Hi all, > > > > There are SASes & SSDs in my nodes at the same time. > > Now i want divide them into 2 groups, one composed of SASes and one > > only contained SSDs. When i configure CRUSH rulesets, segment below: > > > > > > # buckets > > host robert-a { > > id -2 # do not change unnecessarily > > # weight 1.640 > > alg straw > > hash 0 # rjenkins1 > > item osd.0 weight 0.250#SAS > > item osd.1 weight 0.250#SAS > > item osd.2 weight 0.250#SSD > > item osd.3 weight 0.250#SSD > > > > } > > > > > > So, i am not sure must host bucket name be the same with hostname. > > > > > > Or host bucket name does no matter? > > > > > > > > Best regards, > > > > Xiucai > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten > Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disaster recovery and backups
On Sunday, June 5, 2016, Gandalf Corvotempesta < gandalf.corvotempe...@gmail.com> wrote: > Let's assume that everything went very very bad and i have to manually > recover a cluster with an unconfigured ceph. > > 1. How can i recover datas directly from raw disks? Is this possible? There have been a few threads on this here, but all look complicated, time consuming and not guaranteed to work, to be used only as last resort. > 2. How can i restore a ceph cluster (and have data back) by using > existing disks? Back up either using RBD export/export-diff or an OS tool. Jewel has rbd mirroring to another cluster. > 3. How do you manage backups for ceph, in huge clusters? We are building an appliance that relies on rbd export-diff. There are well documented commands for it, and my early testing was able to demonstrate successful restore. Best regards, Alex > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Is anyone seeing iissues with task_numa_find_cpu?
After upgrading to kernel 4.4.13 on Ubuntu, we are seeing a few of these issues where an OSD would fail with the stack below. I logged a bug at https://bugzilla.kernel.org/show_bug.cgi?id=121101 and there is a similar description at https://lkml.org/lkml/2016/6/22/102, but the odd part is we have turned off CFQ and blk-mq/scsi-mq and are using just the noop scheduler. Does the ceph kernel code somehow use the fair scheduler code block? Thanks -- Alex Gorbachev Storcium Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684974] CPU: 30 PID: 10403 Comm: ceph-osd Not tainted 4.4.13-040413-generic #201606072354 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684991] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685009] task: 880f79df8000 ti: 880f79fb8000 task.ti: 880f79fb8000 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685024] RIP: 0010:[] [] task_numa_find_cpu+0x22e/0x6f0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685051] RSP: 0018:880f79fbb818 EFLAGS: 00010206 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685063] RAX: RBX: 880f79fbb8b8 RCX: Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685076] RDX: RSI: RDI: 8810352d4800 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685107] RBP: 880f79fbb880 R08: 0001020cf87c R09: 00ff00ff Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685150] R10: 0009 R11: 0006 R12: 8807c3adc4c0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685194] R13: 0006 R14: 033e R15: fec7 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685238] FS: 7f30e46b8700() GS:88105f58() knlGS: Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685283] CS: 0010 DS: ES: CR0: 80050033 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685310] CR2: 1321a000 CR3: 000853598000 CR4: 000406e0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685354] Stack: Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685374] 813d050f 000d 0045 880f79df8000 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685426] 033f 00016b00 033f Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685477] 880f79df8000 880f79fbb8b8 01f4 0054 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685528] Call Trace: Jun 28 09:46:41 roc04r-sca090 kernel: [137912.68] [] ? cpumask_next_and+0x2f/0x40 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685584] [] task_numa_migrate+0x43e/0x9b0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685613] [] ? update_cfs_shares+0xbc/0x100 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685642] [] numa_migrate_preferred+0x79/0x80 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685672] [] task_numa_fault+0x7f4/0xd40 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685700] [] ? timerqueue_del+0x24/0x70 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685729] [] ? should_numa_migrate_memory+0x55/0x130 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685762] [] handle_mm_fault+0xbc0/0x1820 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685793] [] ? __hrtimer_init+0x90/0x90 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685822] [] ? remove_wait_queue+0x4d/0x60 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685853] [] ? poll_freewait+0x4a/0xa0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685882] [] __do_page_fault+0x197/0x400 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685910] [] do_page_fault+0x22/0x30 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685939] [] page_fault+0x28/0x30 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685967] [] ? copy_page_to_iter_iovec+0x5f/0x300 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685997] [] ? select_task_rq_fair+0x625/0x700 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686026] [] copy_page_to_iter+0x16/0xa0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686056] [] skb_copy_datagram_iter+0x14d/0x280 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686087] [] tcp_recvmsg+0x613/0xbe0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686117] [] inet_recvmsg+0x7e/0xb0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686146] [] sock_recvmsg+0x3b/0x50 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686173] [] SYSC_recvfrom+0xe1/0x160 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686202] [] ? ktime_get_ts64+0x45/0xf0 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686230] [] SyS_recvfrom+0xe/0x10 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686259] [] entry_SYSCALL_64_fastpath+0x16/0x71 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686287] Code: 55 b0 4c 89 f7 e8 53 cd ff ff 48 8b 55 b0 49 8b 4e 78 48 8b 82 d8 01 00 00 48 83 c1 01 31 d2 49 0f af 86 b0 00 00 00 4c 8b 73 78 <48> f7 f1 48 8b 4b 20 49 89 c0 48 29 c1 48 8b 45 d0 4c 03 43 48 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.68651
Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?
Hi Stefan, On Tue, Jun 28, 2016 at 1:46 PM, Stefan Priebe - Profihost AG wrote: > Please be aware that you may need even more patches. Overall this needs 3 > patches. Where the first two try to fix a bug and the 3rd one fixes the > fixes + even more bugs related to the scheduler. I've no idea on which patch > level Ubuntu is. Stefan, would you be able to please point to the other two patches beside https://lkml.org/lkml/diff/2016/6/22/102/1 ? Thank you, Alex > > Stefan > > Excuse my typo sent from my mobile phone. > > Am 28.06.2016 um 17:59 schrieb Tim Bishop : > > Yes - I noticed this today on Ubuntu 16.04 with the default kernel. No > useful information to add other than it's not just you. > > Tim. > > On Tue, Jun 28, 2016 at 11:05:40AM -0400, Alex Gorbachev wrote: > > After upgrading to kernel 4.4.13 on Ubuntu, we are seeing a few of > > these issues where an OSD would fail with the stack below. I logged a > > bug at https://bugzilla.kernel.org/show_bug.cgi?id=121101 and there is > > a similar description at https://lkml.org/lkml/2016/6/22/102, but the > > odd part is we have turned off CFQ and blk-mq/scsi-mq and are using > > just the noop scheduler. > > > Does the ceph kernel code somehow use the fair scheduler code block? > > > Thanks > > -- > > Alex Gorbachev > > Storcium > > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684974] CPU: 30 PID: > > 10403 Comm: ceph-osd Not tainted 4.4.13-040413-generic #201606072354 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684991] Hardware name: > > Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 > > 03/04/2015 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685009] task: > > 880f79df8000 ti: 880f79fb8000 task.ti: 880f79fb8000 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685024] RIP: > > 0010:[] [] > > task_numa_find_cpu+0x22e/0x6f0 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685051] RSP: > > 0018:880f79fbb818 EFLAGS: 00010206 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685063] RAX: > > RBX: 880f79fbb8b8 RCX: > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685076] RDX: > > RSI: RDI: 8810352d4800 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685107] RBP: > > 880f79fbb880 R08: 0001020cf87c R09: 00ff00ff > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685150] R10: > > 0009 R11: 0006 R12: 8807c3adc4c0 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685194] R13: > > 0006 R14: 033e R15: fec7 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685238] FS: > > 7f30e46b8700() GS:88105f58() > > knlGS: > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685283] CS: 0010 DS: > > ES: CR0: 80050033 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685310] CR2: > > 1321a000 CR3: 000853598000 CR4: 000406e0 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685354] Stack: > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685374] > > 813d050f 000d 0045 880f79df8000 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685426] > > 033f 00016b00 033f > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685477] > > 880f79df8000 880f79fbb8b8 01f4 0054 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685528] Call Trace: > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.68] > > [] ? cpumask_next_and+0x2f/0x40 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685584] > > [] task_numa_migrate+0x43e/0x9b0 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685613] > > [] ? update_cfs_shares+0xbc/0x100 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685642] > > [] numa_migrate_preferred+0x79/0x80 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685672] > > [] task_numa_fault+0x7f4/0xd40 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685700] > > [] ? timerqueue_del+0x24/0x70 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685729] > > [] ? should_numa_migrate_memory+0x55/0x130 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685762] > > [] handle_mm_fault+0xbc0/0x1820 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685793] > > [] ? __hrtimer_init+0x90/0x90 > > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685822] > > [] ? remove_wait_queue+0x4d/0x60 > > Jun 28 09:46:41 roc04r-sca090
Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?
> Thank you Stefan and Campbell for the info - hope 4.7rc5 resolves this > for us - please note that my workload is purely RBD, no QEMU/KVM. > Also, we do not have CFQ turned on, neither scsi-mq and blk-mq, so I > am surmising ceph-osd must be using something from the fair scheduler. > I read that its IO has been switched to blk-mq internally, so maybe > there is a relationship there. If the OSD code is compiled against the source from a buggy fair scheduler code, then that would be an OSD code issue, correct? > > We had no such problems with kernel 4.2.x, but had other issues with > XFS, which do not seem to happen now. > > Regards, > Alex > >> >> Stefan >> >> Am 29.06.2016 um 11:41 schrieb Campbell Steven: >>> Hi Alex/Stefan, >>> >>> I'm in the middle of testing 4.7rc5 on our test cluster to confirm >>> once and for all this particular issue has been completely resolved by >>> Peter's recent patch to sched/fair.c refereed to by Stefan above. For >>> us anyway the patches that Stefan applied did not solve the issue and >>> neither did any 4.5.x or 4.6.x released kernel thus far, hopefully it >>> does the trick for you. We could get about 4 hours uptime before >>> things went haywire for us. >>> >>> It's interesting how it seems the CEPH workload triggers this bug so >>> well as it's quite a long standing issue that's only just been >>> resolved, another user chimed in on the lkml thread a couple of days >>> ago as well and again his trace had ceph-osd in it as well. >>> >>> https://lkml.org/lkml/headers/2016/6/21/491 >>> >>> Campbell >>> >>> On 29 June 2016 at 18:29, Stefan Priebe - Profihost AG >>> wrote: >>>> >>>> Am 29.06.2016 um 04:30 schrieb Alex Gorbachev: >>>>> Hi Stefan, >>>>> >>>>> On Tue, Jun 28, 2016 at 1:46 PM, Stefan Priebe - Profihost AG >>>>> wrote: >>>>>> Please be aware that you may need even more patches. Overall this needs 3 >>>>>> patches. Where the first two try to fix a bug and the 3rd one fixes the >>>>>> fixes + even more bugs related to the scheduler. I've no idea on which >>>>>> patch >>>>>> level Ubuntu is. >>>>> >>>>> Stefan, would you be able to please point to the other two patches >>>>> beside https://lkml.org/lkml/diff/2016/6/22/102/1 ? >>>> >>>> Sorry sure yes: >>>> >>>> 1. 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a >>>> bounded value") >>>> >>>> 2.) 40ed9cba24bb7e01cc380a02d3f04065b8afae1d ("sched/fair: Fix >>>> post_init_entity_util_avg() serialization") >>>> >>>> 3.) the one listed at lkml. >>>> >>>> Stefan >>>> >>>>> >>>>> Thank you, >>>>> Alex >>>>> >>>>>> >>>>>> Stefan >>>>>> >>>>>> Excuse my typo sent from my mobile phone. >>>>>> >>>>>> Am 28.06.2016 um 17:59 schrieb Tim Bishop : >>>>>> >>>>>> Yes - I noticed this today on Ubuntu 16.04 with the default kernel. No >>>>>> useful information to add other than it's not just you. >>>>>> >>>>>> Tim. >>>>>> >>>>>> On Tue, Jun 28, 2016 at 11:05:40AM -0400, Alex Gorbachev wrote: >>>>>> >>>>>> After upgrading to kernel 4.4.13 on Ubuntu, we are seeing a few of >>>>>> >>>>>> these issues where an OSD would fail with the stack below. I logged a >>>>>> >>>>>> bug at https://bugzilla.kernel.org/show_bug.cgi?id=121101 and there is >>>>>> >>>>>> a similar description at https://lkml.org/lkml/2016/6/22/102, but the >>>>>> >>>>>> odd part is we have turned off CFQ and blk-mq/scsi-mq and are using >>>>>> >>>>>> just the noop scheduler. >>>>>> >>>>>> >>>>>> Does the ceph kernel code somehow use the fair scheduler code block? >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> -- >>>>>> >>>>>> Alex Gorbachev >>>>>> >>>>>> Storcium >
Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?
On Wed, Jun 29, 2016 at 5:41 AM, Campbell Steven wrote: > Hi Alex/Stefan, > > I'm in the middle of testing 4.7rc5 on our test cluster to confirm > once and for all this particular issue has been completely resolved by > Peter's recent patch to sched/fair.c refereed to by Stefan above. For > us anyway the patches that Stefan applied did not solve the issue and > neither did any 4.5.x or 4.6.x released kernel thus far, hopefully it > does the trick for you. We could get about 4 hours uptime before > things went haywire for us. > > It's interesting how it seems the CEPH workload triggers this bug so > well as it's quite a long standing issue that's only just been > resolved, another user chimed in on the lkml thread a couple of days > ago as well and again his trace had ceph-osd in it as well. > > https://lkml.org/lkml/headers/2016/6/21/491 > > Campbell Campbell, any luck with testing 4.7rc5? rc6 came out just now, and I am having trouble booting it on an ubuntu box due to some other unrelated problem. So dropping to kernel 4.2.0 for now, which does not seem to have this load related problem. I looked at the fair.c code in kernel source tree 4.4.14 and it is quite different than Peter's patch (assuming 4.5.x source), so the patch does not apply cleanly. Maybe another 4.4.x kernel will get the update. Thanks, Alex > > On 29 June 2016 at 18:29, Stefan Priebe - Profihost AG > wrote: >> >> Am 29.06.2016 um 04:30 schrieb Alex Gorbachev: >>> Hi Stefan, >>> >>> On Tue, Jun 28, 2016 at 1:46 PM, Stefan Priebe - Profihost AG >>> wrote: >>>> Please be aware that you may need even more patches. Overall this needs 3 >>>> patches. Where the first two try to fix a bug and the 3rd one fixes the >>>> fixes + even more bugs related to the scheduler. I've no idea on which >>>> patch >>>> level Ubuntu is. >>> >>> Stefan, would you be able to please point to the other two patches >>> beside https://lkml.org/lkml/diff/2016/6/22/102/1 ? >> >> Sorry sure yes: >> >> 1. 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a >> bounded value") >> >> 2.) 40ed9cba24bb7e01cc380a02d3f04065b8afae1d ("sched/fair: Fix >> post_init_entity_util_avg() serialization") >> >> 3.) the one listed at lkml. >> >> Stefan >> >>> >>> Thank you, >>> Alex >>> >>>> >>>> Stefan >>>> >>>> Excuse my typo sent from my mobile phone. >>>> >>>> Am 28.06.2016 um 17:59 schrieb Tim Bishop : >>>> >>>> Yes - I noticed this today on Ubuntu 16.04 with the default kernel. No >>>> useful information to add other than it's not just you. >>>> >>>> Tim. >>>> >>>> On Tue, Jun 28, 2016 at 11:05:40AM -0400, Alex Gorbachev wrote: >>>> >>>> After upgrading to kernel 4.4.13 on Ubuntu, we are seeing a few of >>>> >>>> these issues where an OSD would fail with the stack below. I logged a >>>> >>>> bug at https://bugzilla.kernel.org/show_bug.cgi?id=121101 and there is >>>> >>>> a similar description at https://lkml.org/lkml/2016/6/22/102, but the >>>> >>>> odd part is we have turned off CFQ and blk-mq/scsi-mq and are using >>>> >>>> just the noop scheduler. >>>> >>>> >>>> Does the ceph kernel code somehow use the fair scheduler code block? >>>> >>>> >>>> Thanks >>>> >>>> -- >>>> >>>> Alex Gorbachev >>>> >>>> Storcium >>>> >>>> >>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684974] CPU: 30 PID: >>>> >>>> 10403 Comm: ceph-osd Not tainted 4.4.13-040413-generic #201606072354 >>>> >>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684991] Hardware name: >>>> >>>> Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 >>>> >>>> 03/04/2015 >>>> >>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685009] task: >>>> >>>> 880f79df8000 ti: 880f79fb8000 task.ti: 880f79fb8000 >>>> >>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685024] RIP: >>>> >>>> 0010:[] [] >>>> >>>> task_numa_find_cpu+0x22e/0x6f0 >>>> >>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685051] RSP:
Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
HI Nick, On Fri, Jul 1, 2016 at 2:11 PM, Nick Fisk wrote: > However, there are a number of pain points with iSCSI + ESXi + RBD and they > all mainly centre on write latency. It seems VMFS was designed around the > fact that Enterprise storage arrays service writes in 10-100us, whereas Ceph > will service them in 2-10ms. > > 1. Thin Provisioning makes things slow. I believe the main cause is that when > growing and zeroing the new blocks, metadata needs to be updated and the > block zero'd. Both issue small IO which would normally not be a problem, but > with Ceph it becomes a bottleneck to overall IO on the datastore. > > 2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN > will coalesce these back into a stream of larger IO's before committing to > disk. However with Ceph each IO takes 2-10ms and so everything seems slow. > The future feature of persistent RBD cache may go a long way to helping with > this. Are you referring to ESXi snapshots? Specifically, if a VM is running off a snapshot (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1015180), its IO will drop to 64KB "grains"? > 3. >2TB VMDK's with snapshots use a different allocation mode, which happens > in 4kb chunks instead of 64kb ones. This makes the problem 16 times worse > than above. > > 4. Any of the above will also apply when migrating machines around, so VM's > can takes hours/days to move. > > 5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, > you get thin provisioning, but no pagecache or readahead, so performance can > nose dive if this is needed. Would not FILEIO also leverage the Linux scheduler to do IO coalescing and help with (2) ? Since FILEIO also uses the dirty flush mechanism in page cache (and makes IO somewhat crash-unsafe at the same time). > 6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to > seeing APD/PDL even when you think you have finally got everything working > great. We were used to seeing APD/PDL all the time with LIO, but pretty much have not seen any with SCST > 3.1. Most of the ESXi problems are with just with high latency periods, which are not a problem for the hypervisor itself, but rather for the databases or applications inside VMs. Thanks, Alex > > > Normal IO from eager zeroed VM's with no snapshots, however should perform > ok. So depends what your workload is. > > > And then comes NFS. It's very easy to setup, very easy to configure for HA, > and works pretty well overall. You don't seem to get any of the IO size > penalties when using snapshots. If you mount with discard, thin provisioning > is done by Ceph. You can defragment the FS on the proxy node and several > other things that you can't do with VMFS. Just make sure you run the server > in sync mode to avoid data loss. > > The only downside is that every IO causes an IO to the FS and one to the FS > journal, so you effectively double your IO. But if your Ceph backend can > support it, then it shouldn't be too much of a problem. > > Now to the original poster, assuming the iSCSI node is just kernel mounting > the RBD, I would run iostat on it, to try and see what sort of latency you > are seeing at that point. Also do the same with esxtop +u, and look at the > write latency there, both whilst running the fio in the VM. This should > hopefully let you see if there is just a gradual increase as you go from hop > to hop or if there is an obvious culprit. > > Can you also confirm your kernel version? > > With 1GB networking I think you will struggle to get your write latency much > below 10-15ms, but from your example ~30ms is still a bit high. I wonder if > the default queue depths on your iSCSI target are too low as well? > > Nick > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Oliver Dzombic >> Sent: 01 July 2016 09:27 >> To: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] >> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad >> >> Hi, >> >> my experience: >> >> ceph + iscsi ( multipath ) + vmware == worst >> >> Better you search for another solution. >> >> vmware + nfs + vmware might have a much better performance. >> >> >> >> If you are able to get vmware run with iscsi and ceph, i would be >> >>very<< intrested in what/how you did that. >> >> -- >> Mit freundlichen Gruessen / Best regards >> >> Oliver Dzombic >> IP-Interactive >> >> mailto:i...@ip-interactive.de >> >> Anschrift: >> >> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 >> 63571 Gelnhausen >> >> HRB 93402 beim Amtsgericht Hanau >> Geschäftsführung: Oliver Dzombic >> >> Steuer Nr.: 35 236 3622 1 >> UST ID: DE274086107 >> >> >> Am 01.07.2016 um 07:04 schrieb mq: >> > Hi list >> > I have tested suse enterprise storage3 using 2 iscsi gateway attached >> > to vmware. The performance is bad. I have turn off VAAI followin
Re: [ceph-users] Backing up RBD snapshots to a different cloud service
Hi Brendan, On Friday, July 8, 2016, Brendan Moloney wrote: > Hi, > > We have a smallish Ceph cluster for RBD images. We use snapshotting for > local incremental backups. I would like to start sending some of these > snapshots to an external cloud service (likely Amazon) for disaster > recovery purposes. > > Does anyone have advice on how to do this? I suppose I could just use the > rbd export/diff commands but some of our RBD images are quite large > (multiple terabytes) so I can imagine this becoming quite inefficient. We > would either need to keep all snapshots indefinitely and retrieve every > single snapshot to recover or we would have to occasionally send a new full > disk image. > > I guess doing the backups on the object level could potentially avoid > these issues, but I am not sure how to go about that. > We are currently rolling out a solution that utilizes merge-diff command to continuously create synthetic fulls at the remote site. The remote site needs to be more than just storage, e.g. a Linux VM or such, but as long as the continuity of snapshots is maintained, you should be able to recover from just the one image. Detecting start and end snapshot of a diff export file is not hard, I asked details earlier on this list, and would be happy to send you code stubs in Perl if you are interested. Another option, which we have not yet tried with RBD exports is the borgbackup project, which offers excellent deduplication. HTH, Alex > > > Any advice is greatly appreciated. > > Thanks, > Brendan > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph + vmware
Hi Oliver, On Friday, July 8, 2016, Oliver Dzombic wrote: > Hi, > > does anyone have experience how to connect vmware with ceph smart ? > > iSCSI multipath does not really worked well. > NFS could be, but i think thats just too much layers in between to have > some useable performance. > > Systems like ScaleIO have developed a vmware addon to talk with it. > > Is there something similar out there for ceph ? > > What are you using ? We use RBD with SCST, Pacemaker and EnhanceIO (for read only SSD caching). The HA agents are open source, there are several options for those. Currently running 3 VMware clusters with 15 hosts total, and things are quite decent. Regards, Alex Gorbachev Storcium > > Thank you ! > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:i...@ip-interactive.de > > Anschrift: > > IP Interactive UG ( haftungsbeschraenkt ) > Zum Sonnenberg 1-3 > 63571 Gelnhausen > > HRB 93402 beim Amtsgericht Hanau > Geschäftsführung: Oliver Dzombic > > Steuer Nr.: 35 236 3622 1 > UST ID: DE274086107 > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?
On Mon, Jul 18, 2016 at 4:41 AM, Василий Ангапов wrote: > Guys, > > This bug is hitting me constantly, may be once per several days. Does > anyone know is there a solution already? I see there is a fix available, and am waiting for a backport to a longterm kernel: https://lkml.org/lkml/2016/7/12/919 https://lkml.org/lkml/2016/7/12/297 -- Alex Gorbachev Storcium > > 2016-07-05 11:47 GMT+03:00 Nick Fisk : >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Alex Gorbachev >>> Sent: 04 July 2016 20:50 >>> To: Campbell Steven >>> Cc: ceph-users ; Tim Bishop >> li...@bishnet.net> >>> Subject: Re: [ceph-users] Is anyone seeing iissues with >>> task_numa_find_cpu? >>> >>> On Wed, Jun 29, 2016 at 5:41 AM, Campbell Steven >>> wrote: >>> > Hi Alex/Stefan, >>> > >>> > I'm in the middle of testing 4.7rc5 on our test cluster to confirm >>> > once and for all this particular issue has been completely resolved by >>> > Peter's recent patch to sched/fair.c refereed to by Stefan above. For >>> > us anyway the patches that Stefan applied did not solve the issue and >>> > neither did any 4.5.x or 4.6.x released kernel thus far, hopefully it >>> > does the trick for you. We could get about 4 hours uptime before >>> > things went haywire for us. >>> > >>> > It's interesting how it seems the CEPH workload triggers this bug so >>> > well as it's quite a long standing issue that's only just been >>> > resolved, another user chimed in on the lkml thread a couple of days >>> > ago as well and again his trace had ceph-osd in it as well. >>> > >>> > https://lkml.org/lkml/headers/2016/6/21/491 >>> > >>> > Campbell >>> >>> Campbell, any luck with testing 4.7rc5? rc6 came out just now, and I am >>> having trouble booting it on an ubuntu box due to some other unrelated >>> problem. So dropping to kernel 4.2.0 for now, which does not seem to have >>> this load related problem. >>> >>> I looked at the fair.c code in kernel source tree 4.4.14 and it is quite >> different >>> than Peter's patch (assuming 4.5.x source), so the patch does not apply >>> cleanly. Maybe another 4.4.x kernel will get the update. >> >> I put in a new 16.04 node yesterday and went straight to 4.7.rc6. It's been >> backfilling for just under 24 hours now with no drama. Disks are set to use >> CFQ. >> >>> >>> Thanks, >>> Alex >>> >>> >>> >>> > >>> > On 29 June 2016 at 18:29, Stefan Priebe - Profihost AG >>> > wrote: >>> >> >>> >> Am 29.06.2016 um 04:30 schrieb Alex Gorbachev: >>> >>> Hi Stefan, >>> >>> >>> >>> On Tue, Jun 28, 2016 at 1:46 PM, Stefan Priebe - Profihost AG >>> >>> wrote: >>> >>>> Please be aware that you may need even more patches. Overall this >>> >>>> needs 3 patches. Where the first two try to fix a bug and the 3rd >>> >>>> one fixes the fixes + even more bugs related to the scheduler. I've >>> >>>> no idea on which patch level Ubuntu is. >>> >>> >>> >>> Stefan, would you be able to please point to the other two patches >>> >>> beside https://lkml.org/lkml/diff/2016/6/22/102/1 ? >>> >> >>> >> Sorry sure yes: >>> >> >>> >> 1. 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a >>> >> bounded value") >>> >> >>> >> 2.) 40ed9cba24bb7e01cc380a02d3f04065b8afae1d ("sched/fair: Fix >>> >> post_init_entity_util_avg() serialization") >>> >> >>> >> 3.) the one listed at lkml. >>> >> >>> >> Stefan >>> >> >>> >>> >>> >>> Thank you, >>> >>> Alex >>> >>> >>> >>>> >>> >>>> Stefan >>> >>>> >>> >>>> Excuse my typo sent from my mobile phone. >>> >>>> >>> >>>> Am 28.06.2016 um 17:59 schrieb Tim Bishop : >>> >>>> >>> >>>> Yes - I noticed this today on Ubuntu 16.04 with the default kernel. >>> >>>> No
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
Hi Vlad, On Mon, Jul 25, 2016 at 10:44 PM, Vladislav Bolkhovitin wrote: > Hi, > > I would suggest to rebuild SCST in the debug mode (after "make 2debug"), then > before > calling the unmap command enable "scsi" and "debug" logging for scst and > scst_vdisk > modules by 'echo add scsi >/sys/kernel/scst_tgt/trace_level; echo "add scsi" >>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level; echo "add debug" >>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level', then check, if for >>the unmap > command vdisk_unmap_range() is reporting running blkdev_issue_discard() in > the kernel > logs. > > To double check, you might also add trace statement just before > blkdev_issue_discard() > in vdisk_unmap_range(). With the debug settings on, I am seeing the below output - this means that discard is being sent to the backing (RBD) device, correct? Including the ceph-users list to see if there is a reason RBD is not processing this discard/unmap. Thank you, -- Alex Gorbachev Storcium Jul 26 08:23:38 e1 kernel: [ 858.324715] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b552940, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:23:38 e1 kernel: [ 858.324740] [20426]: vdisk_parse_offset:2930:cmd 88201b552c00, lba_start 0, loff 0, data_len 24 Jul 26 08:23:38 e1 kernel: [ 858.324743] [20426]: vdisk_unmap_range:3810:Unmapping lba 61779968 (blocks 8192) Jul 26 08:23:38 e1 kernel: [ 858.336218] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b552c00, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:23:38 e1 kernel: [ 858.336232] [20426]: vdisk_parse_offset:2930:cmd 88201b552ec0, lba_start 0, loff 0, data_len 24 Jul 26 08:23:38 e1 kernel: [ 858.336234] [20426]: vdisk_unmap_range:3810:Unmapping lba 61788160 (blocks 8192) Jul 26 08:23:38 e1 kernel: [ 858.351446] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b552ec0, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:23:38 e1 kernel: [ 858.351468] [20426]: vdisk_parse_offset:2930:cmd 88201b553180, lba_start 0, loff 0, data_len 24 Jul 26 08:23:38 e1 kernel: [ 858.351471] [20426]: vdisk_unmap_range:3810:Unmapping lba 61796352 (blocks 8192) Jul 26 08:23:38 e1 kernel: [ 858.373407] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b553180, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:23:38 e1 kernel: [ 858.373422] [20426]: vdisk_parse_offset:2930:cmd 88201b553440, lba_start 0, loff 0, data_len 24 Jul 26 08:23:38 e1 kernel: [ 858.373424] [20426]: vdisk_unmap_range:3810:Unmapping lba 61804544 (blocks 8192) Jul 26 08:24:04 e1 kernel: [ 884.170201] [6290]: scst_cmd_init_done:829:CDB: Jul 26 08:24:04 e1 kernel: [ 884.170202] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F Jul 26 08:24:04 e1 kernel: [ 884.170205]0: 42 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00 B... Jul 26 08:24:04 e1 kernel: [ 884.170268] [6290]: scst: scst_parse_cmd:1312:op_name (cmd 88201b556300), direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24, out_bufflen=0, (expected len data 24, expected len DIF 0, out expected len 0), flags=0x80260, internal 0, naca 0 Jul 26 08:24:04 e1 kernel: [ 884.173983] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b556b40, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:24:04 e1 kernel: [ 884.173998] [20426]: vdisk_parse_offset:2930:cmd 88201b556e00, lba_start 0, loff 0, data_len 24 Jul 26 08:24:04 e1 kernel: [ 884.174001] [20426]: vdisk_unmap_range:3810:Unmapping lba 74231808 (blocks 8192) Jul 26 08:24:04 e1 kernel: [ 884.174224] [6290]: scst: scst_cmd_init_done:828:NEW CDB: len 16, lun 16, initiator iqn.1995-05.com.vihl2.ibft, target iqn.2008-10.net.storcium:scst.1, queue_type 1, tag 4005936 (cmd 88201b5565c0, sess 880ffa2c) Jul 26 08:24:04 e1 kernel: [ 884.174227] [6290]: scst_cmd_init_done:829:CDB: Jul 26 08:24:04 e1 kernel: [ 884.174228] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F Jul 26 08:24:04 e1 kernel: [ 884.174231]0: 42 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00 B... Jul 26 08:24:04 e1 kernel: [ 884.174256] [6290]: scst: scst_parse_cmd:1312:op_name (cmd 88201b5565c0), direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24, out_bufflen=0, (expected len data 24, expected len DIF 0, out expected len 0), flags=0x80260, internal 0, naca 0 > > Alex Gorbachev wrote on 07/23/2016 08:48 PM: >> Hi Nick, Vlad, SCST Team, >> >>>>> I have been looking at using the rbd-nbd tool, so that the caching is >>>> provided by librbd and then use BLOCKIO with SCST. This will however need >>>> some work on the SCST resource agents to ensure the librbd cache is >
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
One other experiment: just running blkdiscard against the RBD block device completely clears it, to the point where the rbd-diff method reports 0 blocks utilized. So to summarize: - ESXi sending UNMAP via SCST does not seem to release storage from RBD (BLOCKIO handler that is supposed to work with UNMAP) - blkdiscard does release the space -- Alex Gorbachev Storcium On Wed, Jul 27, 2016 at 11:55 AM, Alex Gorbachev wrote: > Hi Vlad, > > On Mon, Jul 25, 2016 at 10:44 PM, Vladislav Bolkhovitin wrote: >> Hi, >> >> I would suggest to rebuild SCST in the debug mode (after "make 2debug"), >> then before >> calling the unmap command enable "scsi" and "debug" logging for scst and >> scst_vdisk >> modules by 'echo add scsi >/sys/kernel/scst_tgt/trace_level; echo "add scsi" >>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level; echo "add debug" >>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level', then check, if for >>>the unmap >> command vdisk_unmap_range() is reporting running blkdev_issue_discard() in >> the kernel >> logs. >> >> To double check, you might also add trace statement just before >> blkdev_issue_discard() >> in vdisk_unmap_range(). > > With the debug settings on, I am seeing the below output - this means > that discard is being sent to the backing (RBD) device, correct? > > Including the ceph-users list to see if there is a reason RBD is not > processing this discard/unmap. > > Thank you, > -- > Alex Gorbachev > Storcium > > Jul 26 08:23:38 e1 kernel: [ 858.324715] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b552940, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:23:38 e1 kernel: [ 858.324740] [20426]: > vdisk_parse_offset:2930:cmd 88201b552c00, lba_start 0, loff 0, > data_len 24 > Jul 26 08:23:38 e1 kernel: [ 858.324743] [20426]: > vdisk_unmap_range:3810:Unmapping lba 61779968 (blocks 8192) > Jul 26 08:23:38 e1 kernel: [ 858.336218] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b552c00, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:23:38 e1 kernel: [ 858.336232] [20426]: > vdisk_parse_offset:2930:cmd 88201b552ec0, lba_start 0, loff 0, > data_len 24 > Jul 26 08:23:38 e1 kernel: [ 858.336234] [20426]: > vdisk_unmap_range:3810:Unmapping lba 61788160 (blocks 8192) > Jul 26 08:23:38 e1 kernel: [ 858.351446] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b552ec0, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:23:38 e1 kernel: [ 858.351468] [20426]: > vdisk_parse_offset:2930:cmd 88201b553180, lba_start 0, loff 0, > data_len 24 > Jul 26 08:23:38 e1 kernel: [ 858.351471] [20426]: > vdisk_unmap_range:3810:Unmapping lba 61796352 (blocks 8192) > Jul 26 08:23:38 e1 kernel: [ 858.373407] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b553180, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:23:38 e1 kernel: [ 858.373422] [20426]: > vdisk_parse_offset:2930:cmd 88201b553440, lba_start 0, loff 0, > data_len 24 > Jul 26 08:23:38 e1 kernel: [ 858.373424] [20426]: > vdisk_unmap_range:3810:Unmapping lba 61804544 (blocks 8192) > > Jul 26 08:24:04 e1 kernel: [ 884.170201] [6290]: scst_cmd_init_done:829:CDB: > Jul 26 08:24:04 e1 kernel: [ 884.170202] > (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F > Jul 26 08:24:04 e1 kernel: [ 884.170205]0: 42 00 00 00 00 00 00 > 00 18 00 00 00 00 00 00 00 B... > Jul 26 08:24:04 e1 kernel: [ 884.170268] [6290]: scst: > scst_parse_cmd:1312:op_name (cmd 88201b556300), > direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24, > out_bufflen=0, (expected len data 24, expected len DIF 0, out expected > len 0), flags=0x80260, internal 0, naca 0 > Jul 26 08:24:04 e1 kernel: [ 884.173983] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b556b40, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:24:04 e1 kernel: [ 884.173998] [20426]: > vdisk_parse_offset:2930:cmd 88201b556e00, lba_start 0, loff 0, > data_len 24 > Jul 26 08:24:04 e1 kernel: [ 884.174001] [20426]: > vdisk_unmap_range:3810:Unmapping lba 74231808 (blocks 8192) > Jul 26 08:24:04 e1 kernel: [ 884.174224] [6290]: scst: > scst_cmd_init_done:828:NEW CDB: len 16, lun 16, initiator > iqn.1995-05.com.vihl2.ibft, target iqn.2008-10.net.storcium:scst.1, > queue_type 1, tag 4005936 (cmd 88201b5565c0, sess > 880ffa2c) > Jul 26 08:24:04 e1 kernel: [ 884.174227] [6290]: scst_cmd_init_done:829:CDB: > Jul 26 08:24:04 e1
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
Hi Vlad, On Wednesday, July 27, 2016, Vladislav Bolkhovitin wrote: > > Alex Gorbachev wrote on 07/27/2016 10:33 AM: > > One other experiment: just running blkdiscard against the RBD block > > device completely clears it, to the point where the rbd-diff method > > reports 0 blocks utilized. So to summarize: > > > > - ESXi sending UNMAP via SCST does not seem to release storage from > > RBD (BLOCKIO handler that is supposed to work with UNMAP) > > > > - blkdiscard does release the space > > How did you run blkdiscard? It might be that blkdiscard discarded big > areas, while ESXi > sending UNMAP commands for areas smaller, than min size, which could be > discarded, or > not aligned as needed, so those discard requests just ignored. I indeed ran blkdiscard on the whole device. So the question to the Ceph list is below what length discard is ignored? I saw at least one other user post a similar issue with ESXi-SCST-RBD. > > For completely correct test you need to run blkdiscard for exactly the > same areas, both > start and size, as the ESXi UNMAP requests you are seeing on the SCST > traces. I am running a test with the debug settings you provided, and will keep this thread updated with results. Much appreciate the guidance. Alex > > Vlad > > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
> > On Wednesday, July 27, 2016, Vladislav Bolkhovitin wrote: >> >> >> Alex Gorbachev wrote on 07/27/2016 10:33 AM: >> > One other experiment: just running blkdiscard against the RBD block >> > device completely clears it, to the point where the rbd-diff method >> > reports 0 blocks utilized. So to summarize: >> > >> > - ESXi sending UNMAP via SCST does not seem to release storage from >> > RBD (BLOCKIO handler that is supposed to work with UNMAP) >> > >> > - blkdiscard does release the space >> >> How did you run blkdiscard? It might be that blkdiscard discarded big >> areas, while ESXi >> sending UNMAP commands for areas smaller, than min size, which could be >> discarded, or >> not aligned as needed, so those discard requests just ignored. Here is the output of the debug, many more of these statements before and after. Is it correct to state then that SCST is indeed executing the discard and the RBD device is ignoring it (since the used size in ceph is not diminishing)? Jul 30 21:08:46 e1 kernel: [ 3032.199972] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570716160, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.202622] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570724352, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.207214] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570732544, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.210395] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570740736, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.212951] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570748928, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.216187] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570757120, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.219299] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570765312, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.222658] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570773504, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.225948] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570781696, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.230092] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570789888, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.234153] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570798080, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.238001] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570806272, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.240876] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570814464, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.242771] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570822656, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.244943] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570830848, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.247506] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570839040, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.250090] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570847232, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.253229] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570855424, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.256001] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570863616, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.259204] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570871808, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.261368] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 57088, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.264025] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570888192, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.266737] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570896384, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.270143] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570904576, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.273975] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570912768, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.278163] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570920960, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.282250] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570929152, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.285932] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570937344, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.289736] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570945536, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.292506] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570953728, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.294706] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570961920, nr_sects
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
RBD illustration showing RBD ignoring discard until a certain threshold - why is that? This behavior is unfortunately incompatible with ESXi discard (UNMAP) behavior. Is there a way to lower the discard sensitivity on RBD devices? root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 40960 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 409600 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 782336 KB -- Alex Gorbachev Storcium On Sat, Jul 30, 2016 at 9:11 PM, Alex Gorbachev wrote: >> >> On Wednesday, July 27, 2016, Vladislav Bolkhovitin wrote: >>> >>> >>> Alex Gorbachev wrote on 07/27/2016 10:33 AM: >>> > One other experiment: just running blkdiscard against the RBD block >>> > device completely clears it, to the point where the rbd-diff method >>> > reports 0 blocks utilized. So to summarize: >>> > >>> > - ESXi sending UNMAP via SCST does not seem to release storage from >>> > RBD (BLOCKIO handler that is supposed to work with UNMAP) >>> > >>> > - blkdiscard does release the space >>> >>> How did you run blkdiscard? It might be that blkdiscard discarded big >>> areas, while ESXi >>> sending UNMAP commands for areas smaller, than min size, which could be >>> discarded, or >>> not aligned as needed, so those discard requests just ignored. > > Here is the output of the debug, many more of these statements before > and after. Is it correct to state then that SCST is indeed executing > the discard and the RBD device is ignoring it (since the used size in > ceph is not diminishing)? > > Jul 30 21:08:46 e1 kernel: [ 3032.199972] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570716160, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.202622] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570724352, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.207214] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570732544, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.210395] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570740736, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.212951] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570748928, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.216187] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570757120, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.219299] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570765312, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.222658] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570773504, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.225948] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570781696, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.230092] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570789888, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.234153] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570798080, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.238001] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570806272, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.240876] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570814464, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.242771] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570822656, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.244943] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570830848, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.247506] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570839040, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.250090] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570847232, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.253229] [22016]: > vdi
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
Hi Ilya, On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: > On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev > wrote: >> RBD illustration showing RBD ignoring discard until a certain >> threshold - why is that? This behavior is unfortunately incompatible >> with ESXi discard (UNMAP) behavior. >> >> Is there a way to lower the discard sensitivity on RBD devices? >> >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 782336 KB > > Think about it in terms of underlying RADOS objects (4M by default). > There are three cases: > > discard range | command > - > whole object| delete > object's tail | truncate > object's head | zero > > Obviously, only delete and truncate free up space. In all of your > examples, except the last one, you are attempting to discard the head > of the (first) object. > > You can free up as little as a sector, as long as it's the tail: > > OffsetLength Type > 0 4194304 data > > # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 > > OffsetLength Type > 0 4193792 data Looks like ESXi is sending in each discard/unmap with the fixed granularity of 8192 sectors, which is passed verbatim by SCST. There is a slight reduction in size via rbd diff method, but now I understand that actual truncate only takes effect when the discard happens to clip the tail of an image. So far looking at https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 ...the only variable we can control is the count of 8192-sector chunks and not their size. Which means that most of the ESXi discard commands will be disregarded by Ceph. Vlad, is 8192 sectors coming from ESXi, as in the debug: Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector 1342099456, nr_sects 8192) Thank you, Alex > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin wrote: > Alex Gorbachev wrote on 08/01/2016 04:05 PM: >> Hi Ilya, >> >> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: >>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev >>> wrote: >>>> RBD illustration showing RBD ignoring discard until a certain >>>> threshold - why is that? This behavior is unfortunately incompatible >>>> with ESXi discard (UNMAP) behavior. >>>> >>>> Is there a way to lower the discard sensitivity on RBD devices? >>>> >> >>>> >>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>> print SUM/1024 " KB" }' >>>> 819200 KB >>>> >>>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>> print SUM/1024 " KB" }' >>>> 782336 KB >>> >>> Think about it in terms of underlying RADOS objects (4M by default). >>> There are three cases: >>> >>> discard range | command >>> - >>> whole object| delete >>> object's tail | truncate >>> object's head | zero >>> >>> Obviously, only delete and truncate free up space. In all of your >>> examples, except the last one, you are attempting to discard the head >>> of the (first) object. >>> >>> You can free up as little as a sector, as long as it's the tail: >>> >>> OffsetLength Type >>> 0 4194304 data >>> >>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>> >>> OffsetLength Type >>> 0 4193792 data >> >> Looks like ESXi is sending in each discard/unmap with the fixed >> granularity of 8192 sectors, which is passed verbatim by SCST. There >> is a slight reduction in size via rbd diff method, but now I >> understand that actual truncate only takes effect when the discard >> happens to clip the tail of an image. >> >> So far looking at >> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >> >> ...the only variable we can control is the count of 8192-sector chunks >> and not their size. Which means that most of the ESXi discard >> commands will be disregarded by Ceph. >> >> Vlad, is 8192 sectors coming from ESXi, as in the debug: >> >> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >> 1342099456, nr_sects 8192) > > Yes, correct. However, to make sure that VMware is not (erroneously) enforced > to do this, you need to perform one more check. > > 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct > granularity and alignment (4M, I guess?) This seems to reflect the granularity (4194304), which matches the 8192 pages (8192 x 512 = 4194304). However, there is no alignment value. Can discard_alignment be specified with RBD? > > 2. Connect to the this iSCSI device from a Linux box and run sg_inq -p 0xB0 > /dev/ > > SCST should correctly report those values for unmap parameters (in blocks). > > If in both cases you see correct the same values, then this is VMware issue, > because it is ignoring what it is told to do (generate appropriately sized > and aligned UNMAP requests). If either Ceph, or SCST doesn't show correct > numbers, then the broken party should be fixed. > > Vlad > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov wrote: > On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev > wrote: >> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin wrote: >>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>>> Hi Ilya, >>>> >>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: >>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev >>>>> wrote: >>>>>> RBD illustration showing RBD ignoring discard until a certain >>>>>> threshold - why is that? This behavior is unfortunately incompatible >>>>>> with ESXi discard (UNMAP) behavior. >>>>>> >>>>>> Is there a way to lower the discard sensitivity on RBD devices? >>>>>> >>>> >>>>>> >>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>> print SUM/1024 " KB" }' >>>>>> 819200 KB >>>>>> >>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>> print SUM/1024 " KB" }' >>>>>> 782336 KB >>>>> >>>>> Think about it in terms of underlying RADOS objects (4M by default). >>>>> There are three cases: >>>>> >>>>> discard range | command >>>>> - >>>>> whole object| delete >>>>> object's tail | truncate >>>>> object's head | zero >>>>> >>>>> Obviously, only delete and truncate free up space. In all of your >>>>> examples, except the last one, you are attempting to discard the head >>>>> of the (first) object. >>>>> >>>>> You can free up as little as a sector, as long as it's the tail: >>>>> >>>>> OffsetLength Type >>>>> 0 4194304 data >>>>> >>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>>>> >>>>> OffsetLength Type >>>>> 0 4193792 data >>>> >>>> Looks like ESXi is sending in each discard/unmap with the fixed >>>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>>> is a slight reduction in size via rbd diff method, but now I >>>> understand that actual truncate only takes effect when the discard >>>> happens to clip the tail of an image. >>>> >>>> So far looking at >>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >>>> >>>> ...the only variable we can control is the count of 8192-sector chunks >>>> and not their size. Which means that most of the ESXi discard >>>> commands will be disregarded by Ceph. >>>> >>>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>>> >>>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>>> 1342099456, nr_sects 8192) >>> >>> Yes, correct. However, to make sure that VMware is not (erroneously) >>> enforced to do this, you need to perform one more check. >>> >>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct >>> granularity and alignment (4M, I guess?) >> >> This seems to reflect the granularity (4194304), which matches the >> 8192 pages (8192 x 512 = 4194304). However, there is no alignment >> value. >> >> Can discard_alignment be specified with RBD? > > It's exported as a read-only sysfs attribute, just like > discard_granularity: > > # cat /sys/block/rbd0/discard_alignment > 4194304 Ah thanks Ilya, it is indeed there. Vlad, your email says to look for discard_alignment in /sys/block//queue, but for RBD it's in /sys/block/ - could this be the source of the issue? Here is what I get querying the iscsi-exported RBD device on Linux: root@kio1:/sys/block/sdf# sg_inq -p 0xB0 /dev/sdf VPD INQUIRY: Block limits page (SBC) Maximum compare and write length: 255 blocks Optimal transfer length granularity: 8 blocks Maximum transfer length: 16384 blocks Optimal transfer length: 1024 blocks Maximum prefetch, xdread, xdwrite transfer length: 0 blocks Maximum unmap LBA count: 8192 Maximum unmap block descriptor count: 4294967295 Optimal unmap granularity: 8192 Unmap granularity alignment valid: 1 Unmap granularity alignment: 8192 > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin wrote: > Alex Gorbachev wrote on 08/02/2016 07:56 AM: >> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov wrote: >>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev >>> wrote: >>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin >>>> wrote: >>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>>>>> Hi Ilya, >>>>>> >>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: >>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev >>>>>>> wrote: >>>>>>>> RBD illustration showing RBD ignoring discard until a certain >>>>>>>> threshold - why is that? This behavior is unfortunately incompatible >>>>>>>> with ESXi discard (UNMAP) behavior. >>>>>>>> >>>>>>>> Is there a way to lower the discard sensitivity on RBD devices? >>>>>>>> >>>>>> >>>>>>>> >>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>> print SUM/1024 " KB" }' >>>>>>>> 819200 KB >>>>>>>> >>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>> print SUM/1024 " KB" }' >>>>>>>> 782336 KB >>>>>>> >>>>>>> Think about it in terms of underlying RADOS objects (4M by default). >>>>>>> There are three cases: >>>>>>> >>>>>>> discard range | command >>>>>>> - >>>>>>> whole object| delete >>>>>>> object's tail | truncate >>>>>>> object's head | zero >>>>>>> >>>>>>> Obviously, only delete and truncate free up space. In all of your >>>>>>> examples, except the last one, you are attempting to discard the head >>>>>>> of the (first) object. >>>>>>> >>>>>>> You can free up as little as a sector, as long as it's the tail: >>>>>>> >>>>>>> OffsetLength Type >>>>>>> 0 4194304 data >>>>>>> >>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>>>>>> >>>>>>> OffsetLength Type >>>>>>> 0 4193792 data >>>>>> >>>>>> Looks like ESXi is sending in each discard/unmap with the fixed >>>>>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>>>>> is a slight reduction in size via rbd diff method, but now I >>>>>> understand that actual truncate only takes effect when the discard >>>>>> happens to clip the tail of an image. >>>>>> >>>>>> So far looking at >>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >>>>>> >>>>>> ...the only variable we can control is the count of 8192-sector chunks >>>>>> and not their size. Which means that most of the ESXi discard >>>>>> commands will be disregarded by Ceph. >>>>>> >>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>>>>> >>>>>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>>>>> 1342099456, nr_sects 8192) >>>>> >>>>> Yes, correct. However, to make sure that VMware is not (erroneously) >>>>> enforced to do this, you need to perform one more check. >>>>> >>>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here >>>>> correct granularity and alignment (4M, I guess?) >>>> >>>> This seems to reflect the granularity (4194304), which matches the >>>> 8192 pages (8192 x 512 = 4194304). However, there is no alignment >>>> value. >>>> >>>> Can discard_alignment be specified with RBD? >>> >>> It's exported as a read-only sysfs attribute, just like >>> discard_granularity: >>> >>> # cat /sys/block/rbd0/discard_alignment >>> 4194304 >> >> Ah thanks Ilya, it is indeed there. Vlad, your email says to look for >> discard_alignment in /sys/block//queue, but for RBD it's in >> /sys/block/ - could this be the source of the issue? > > No. As you can see below, the alignment reported correctly. So, this must be > VMware > issue, because it is ignoring the alignment parameter. You can try to align > your VMware > partition on 4M boundary, it might help. Is this not a mismatch: - From sg_inq: Unmap granularity alignment: 8192 - From "cat /sys/block/rbd0/discard_alignment": 4194304 I am compiling the latest SCST trunk now. Thanks, Alex > >> Here is what I get querying the iscsi-exported RBD device on Linux: >> >> root@kio1:/sys/block/sdf# sg_inq -p 0xB0 /dev/sdf >> VPD INQUIRY: Block limits page (SBC) >> Maximum compare and write length: 255 blocks >> Optimal transfer length granularity: 8 blocks >> Maximum transfer length: 16384 blocks >> Optimal transfer length: 1024 blocks >> Maximum prefetch, xdread, xdwrite transfer length: 0 blocks >> Maximum unmap LBA count: 8192 >> Maximum unmap block descriptor count: 4294967295 >> Optimal unmap granularity: 8192 >> Unmap granularity alignment valid: 1 >> Unmap granularity alignment: 8192 > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev wrote: > On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin wrote: >> Alex Gorbachev wrote on 08/02/2016 07:56 AM: >>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov wrote: >>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev >>>> wrote: >>>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin >>>>> wrote: >>>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>>>>>> Hi Ilya, >>>>>>> >>>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: >>>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev >>>>>>>> wrote: >>>>>>>>> RBD illustration showing RBD ignoring discard until a certain >>>>>>>>> threshold - why is that? This behavior is unfortunately incompatible >>>>>>>>> with ESXi discard (UNMAP) behavior. >>>>>>>>> >>>>>>>>> Is there a way to lower the discard sensitivity on RBD devices? >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>>> print SUM/1024 " KB" }' >>>>>>>>> 819200 KB >>>>>>>>> >>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>>> print SUM/1024 " KB" }' >>>>>>>>> 782336 KB >>>>>>>> >>>>>>>> Think about it in terms of underlying RADOS objects (4M by default). >>>>>>>> There are three cases: >>>>>>>> >>>>>>>> discard range | command >>>>>>>> - >>>>>>>> whole object| delete >>>>>>>> object's tail | truncate >>>>>>>> object's head | zero >>>>>>>> >>>>>>>> Obviously, only delete and truncate free up space. In all of your >>>>>>>> examples, except the last one, you are attempting to discard the head >>>>>>>> of the (first) object. >>>>>>>> >>>>>>>> You can free up as little as a sector, as long as it's the tail: >>>>>>>> >>>>>>>> OffsetLength Type >>>>>>>> 0 4194304 data >>>>>>>> >>>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>>>>>>> >>>>>>>> OffsetLength Type >>>>>>>> 0 4193792 data >>>>>>> >>>>>>> Looks like ESXi is sending in each discard/unmap with the fixed >>>>>>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>>>>>> is a slight reduction in size via rbd diff method, but now I >>>>>>> understand that actual truncate only takes effect when the discard >>>>>>> happens to clip the tail of an image. >>>>>>> >>>>>>> So far looking at >>>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >>>>>>> >>>>>>> ...the only variable we can control is the count of 8192-sector chunks >>>>>>> and not their size. Which means that most of the ESXi discard >>>>>>> commands will be disregarded by Ceph. >>>>>>> >>>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>>>>>> >>>>>>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>>>>>> 1342099456, nr_sects 8192) >>>>>> >>>>>> Yes, correct. However, to make sure that VMware is not (erroneously) >>>>>> enforced to do this, you need to perform one more check. >>>>>> >>>>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should re
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Wed, Aug 3, 2016 at 10:54 AM, Alex Gorbachev wrote: > On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev > wrote: >> On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin wrote: >>> Alex Gorbachev wrote on 08/02/2016 07:56 AM: >>>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov wrote: >>>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev >>>>> wrote: >>>>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin >>>>>> wrote: >>>>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>>>>>>> Hi Ilya, >>>>>>>> >>>>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov >>>>>>>> wrote: >>>>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev >>>>>>>>> wrote: >>>>>>>>>> RBD illustration showing RBD ignoring discard until a certain >>>>>>>>>> threshold - why is that? This behavior is unfortunately incompatible >>>>>>>>>> with ESXi discard (UNMAP) behavior. >>>>>>>>>> >>>>>>>>>> Is there a way to lower the discard sensitivity on RBD devices? >>>>>>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>>>> print SUM/1024 " KB" }' >>>>>>>>>> 819200 KB >>>>>>>>>> >>>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >>>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>>>> print SUM/1024 " KB" }' >>>>>>>>>> 782336 KB >>>>>>>>> >>>>>>>>> Think about it in terms of underlying RADOS objects (4M by default). >>>>>>>>> There are three cases: >>>>>>>>> >>>>>>>>> discard range | command >>>>>>>>> - >>>>>>>>> whole object| delete >>>>>>>>> object's tail | truncate >>>>>>>>> object's head | zero >>>>>>>>> >>>>>>>>> Obviously, only delete and truncate free up space. In all of your >>>>>>>>> examples, except the last one, you are attempting to discard the head >>>>>>>>> of the (first) object. >>>>>>>>> >>>>>>>>> You can free up as little as a sector, as long as it's the tail: >>>>>>>>> >>>>>>>>> OffsetLength Type >>>>>>>>> 0 4194304 data >>>>>>>>> >>>>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>>>>>>>> >>>>>>>>> OffsetLength Type >>>>>>>>> 0 4193792 data >>>>>>>> >>>>>>>> Looks like ESXi is sending in each discard/unmap with the fixed >>>>>>>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>>>>>>> is a slight reduction in size via rbd diff method, but now I >>>>>>>> understand that actual truncate only takes effect when the discard >>>>>>>> happens to clip the tail of an image. >>>>>>>> >>>>>>>> So far looking at >>>>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >>>>>>>> >>>>>>>> ...the only variable we can control is the count of 8192-sector chunks >>>>>>>> and not their size. Which means that most of the ESXi discard >>>>>>>> commands will be disregarded by Ceph. >>>>>>>> >>>>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>>>>>>> >>>>>>>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Tuesday, August 2, 2016, Ilya Dryomov wrote: > On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev > wrote: > > On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin > wrote: > >> Alex Gorbachev wrote on 08/01/2016 04:05 PM: > >>> Hi Ilya, > >>> > >>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov > wrote: > >>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev < > a...@iss-integration.com > wrote: > >>>>> RBD illustration showing RBD ignoring discard until a certain > >>>>> threshold - why is that? This behavior is unfortunately incompatible > >>>>> with ESXi discard (UNMAP) behavior. > >>>>> > >>>>> Is there a way to lower the discard sensitivity on RBD devices? > >>>>> > >>> > >>>>> > >>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 > >>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > >>>>> print SUM/1024 " KB" }' > >>>>> 819200 KB > >>>>> > >>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 > >>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > >>>>> print SUM/1024 " KB" }' > >>>>> 782336 KB > >>>> > >>>> Think about it in terms of underlying RADOS objects (4M by default). > >>>> There are three cases: > >>>> > >>>> discard range | command > >>>> - > >>>> whole object| delete > >>>> object's tail | truncate > >>>> object's head | zero > >>>> > >>>> Obviously, only delete and truncate free up space. In all of your > >>>> examples, except the last one, you are attempting to discard the head > >>>> of the (first) object. > >>>> > >>>> You can free up as little as a sector, as long as it's the tail: > >>>> > >>>> OffsetLength Type > >>>> 0 4194304 data > >>>> > >>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 > >>>> > >>>> OffsetLength Type > >>>> 0 4193792 data > >>> > >>> Looks like ESXi is sending in each discard/unmap with the fixed > >>> granularity of 8192 sectors, which is passed verbatim by SCST. There > >>> is a slight reduction in size via rbd diff method, but now I > >>> understand that actual truncate only takes effect when the discard > >>> happens to clip the tail of an image. > >>> > >>> So far looking at > >>> https://kb.vmware.com/selfservice/microsites/search. > do?language=en_US&cmd=displayKC&externalId=2057513 > >>> > >>> ...the only variable we can control is the count of 8192-sector chunks > >>> and not their size. Which means that most of the ESXi discard > >>> commands will be disregarded by Ceph. > >>> > >>> Vlad, is 8192 sectors coming from ESXi, as in the debug: > >>> > >>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector > >>> 1342099456, nr_sects 8192) > >> > >> Yes, correct. However, to make sure that VMware is not (erroneously) > enforced to do this, you need to perform one more check. > >> > >> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here > correct granularity and alignment (4M, I guess?) > > > > This seems to reflect the granularity (4194304), which matches the > > 8192 pages (8192 x 512 = 4194304). However, there is no alignment > > value. > > > > Can discard_alignment be specified with RBD? > > It's exported as a read-only sysfs attribute, just like > discard_granularity: > > # cat /sys/block/rbd0/discard_alignment > 4194304 > Is there a way to perhaps increase the discard granularity? The way I see it based on the discussion so far, here is why discard/unmap is failing to work with VMWare: - RBD provides space in 4MB blocks, which must be discarded entirely, or at least hitting the tail. - SCST communicates to ESXi that discard alignment is 4MB and discard granularity is also 4MB - ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free anything What is it were possible to make a 6MB discard granularity? Thank you, Alex > > > Thanks, > > Ilya > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
> I'm confused. How can a 4M discard not free anything? It's either > going to hit an entire object or two adjacent objects, truncating the > tail of one and zeroing the head of another. Using rbd diff: > > $ rbd diff test | grep -A 1 25165824 > 25165824 4194304 data > 29360128 4194304 data > > # a 4M discard at 1M into a RADOS object > $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 > > $ rbd diff test | grep -A 1 25165824 > 25165824 1048576 data > 29360128 4194304 data I have tested this on a small RBD device with such offsets and indeed, the discard works as you describe, Ilya. Looking more into why ESXi's discard is not working. I found this message in kern.log on Ubuntu on creation of the SCST LUN, which shows unmap_alignment 0: Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin provisioning for device /dev/rbd/spin1/unmap1t Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI target virtual disk p_iSCSILun_sclun945 (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, nblocks=838860800, cyln=409600) Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0, d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945, initiator copy_manager_sess) even though: root@e1:/sys/block/rbd29# cat discard_alignment 4194304 So somehow the discard_alignment is not making it into the LUN. Could this be the issue? Thanks, Alex Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin provisioning for device /dev/rbd/spin1/unmap1t Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI target virtual disk p_iSCSILun_sclun945 (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, nblocks=838860800, cyln=409600) > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Friday, August 5, 2016, matthew patton wrote: > > - ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free > anything > > the proper solution here is to: > * quit worrying about it and buy sufficient disk in the first place, it's > not exactly expensive > I would do that for one or a couple environments, or if I sold drives :). However, the two use cases that are of importance to my group still warrant figuring this out: - Large medical image collections or frequently modified database files (quite a few deletes and creates) - Passing VMWare certification. It means a lot to people without deep dtorage knowledge, to make a decision on adopting a technology > * ask VMware to have the decency to add a flag to vmkfstools to specify > the offset > Preaching to the choir! I will ask. Hope someone will listen. > > * create a small dummy VMFS on the block device that allows you to create > a second filesystem behind it that's aligned on a 4MB boundary. Or perhaps > simpler, create a thick-zeroed VMDK (3+minimum size + extra) on the VMFS > such that the next VMDK created falls on the desired boundary. > I wonder how to do this for the test, or use a small partition like Vlad described. I will try that with one of their unmap tests > > * use NFS like *deity* intended like any other sane person, nobody uses > block storage anymore for precisely these kinds of reasons. > > Working in that direction too. A bit concerned of the double writes of the backing filesystem, then double writes for RADOS. Per Nick, this still works better than block. But having gone through 95% of certification for block, I feel like I should finish it before jumping on to the next thing. Thank you for your input, it is very practical and helpful long term. Alex > > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov wrote: > On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev > wrote: >>> I'm confused. How can a 4M discard not free anything? It's either >>> going to hit an entire object or two adjacent objects, truncating the >>> tail of one and zeroing the head of another. Using rbd diff: >>> >>> $ rbd diff test | grep -A 1 25165824 >>> 25165824 4194304 data >>> 29360128 4194304 data >>> >>> # a 4M discard at 1M into a RADOS object >>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 >>> >>> $ rbd diff test | grep -A 1 25165824 >>> 25165824 1048576 data >>> 29360128 4194304 data >> >> I have tested this on a small RBD device with such offsets and indeed, >> the discard works as you describe, Ilya. >> >> Looking more into why ESXi's discard is not working. I found this >> message in kern.log on Ubuntu on creation of the SCST LUN, which shows >> unmap_alignment 0: >> >> Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) >> Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin >> provisioning for device /dev/rbd/spin1/unmap1t >> Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, >> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 >> Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI >> target virtual disk p_iSCSILun_sclun945 >> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, >> nblocks=838860800, cyln=409600) >> Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: >> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 >> Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: >> scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0, >> d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945, >> initiator copy_manager_sess) >> >> even though: >> >> root@e1:/sys/block/rbd29# cat discard_alignment >> 4194304 >> >> So somehow the discard_alignment is not making it into the LUN. Could >> this be the issue? > > No, if you are not seeing *any* effect, the alignment is pretty much > irrelevant. Can you do the following on a small test image? > > - capture "rbd diff" output > - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace > - issue a few discards with blkdiscard > - issue a few unmaps with ESXi, preferrably with SCST debugging enabled > - capture "rbd diff" output again > > and attach all of the above? (You might need to install a blktrace > package.) > Latest results from VMWare validation tests: Each test creates and deletes a virtual disk, then calls ESXi unmap for what ESXi maps to that volume: Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829 Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837 Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824 Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837 Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837 At the end, the compounded used size via rbd diff is 608 GB from 775GB of data. So we release only about 20% via discards in the end. Thank you, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Sat, Aug 13, 2016 at 12:36 PM, Alex Gorbachev wrote: > On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov wrote: >> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev >> wrote: >>>> I'm confused. How can a 4M discard not free anything? It's either >>>> going to hit an entire object or two adjacent objects, truncating the >>>> tail of one and zeroing the head of another. Using rbd diff: >>>> >>>> $ rbd diff test | grep -A 1 25165824 >>>> 25165824 4194304 data >>>> 29360128 4194304 data >>>> >>>> # a 4M discard at 1M into a RADOS object >>>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 >>>> >>>> $ rbd diff test | grep -A 1 25165824 >>>> 25165824 1048576 data >>>> 29360128 4194304 data >>> >>> I have tested this on a small RBD device with such offsets and indeed, >>> the discard works as you describe, Ilya. >>> >>> Looking more into why ESXi's discard is not working. I found this >>> message in kern.log on Ubuntu on creation of the SCST LUN, which shows >>> unmap_alignment 0: >>> >>> Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) >>> Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin >>> provisioning for device /dev/rbd/spin1/unmap1t >>> Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, >>> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 >>> Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI >>> target virtual disk p_iSCSILun_sclun945 >>> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, >>> nblocks=838860800, cyln=409600) >>> Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: >>> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 >>> Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: >>> scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0, >>> d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945, >>> initiator copy_manager_sess) >>> >>> even though: >>> >>> root@e1:/sys/block/rbd29# cat discard_alignment >>> 4194304 >>> >>> So somehow the discard_alignment is not making it into the LUN. Could >>> this be the issue? >> >> No, if you are not seeing *any* effect, the alignment is pretty much >> irrelevant. Can you do the following on a small test image? >> >> - capture "rbd diff" output >> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace >> - issue a few discards with blkdiscard >> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled >> - capture "rbd diff" output again >> >> and attach all of the above? (You might need to install a blktrace >> package.) >> > > Latest results from VMWare validation tests: > > Each test creates and deletes a virtual disk, then calls ESXi unmap > for what ESXi maps to that volume: > > Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829 > > Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837 > > Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824 > > Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837 > > Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837 > > At the end, the compounded used size via rbd diff is 608 GB from 775GB > of data. So we release only about 20% via discards in the end. Ilya has analyzed the discard pattern, and indeed the problem is that ESXi appears to disregard the discard alignment attribute. Therefore, discards are shifted by 1M, and are not hitting the tail of objects. Discards work much better on the EagerZeroedThick volumes, likely due to contiguous data. I will proceed with the rest of testing, and will post any tips or best practice results as they become available. Thank you for everyone's help and advice! Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Sat, Aug 13, 2016 at 4:51 PM, Alex Gorbachev wrote: > On Sat, Aug 13, 2016 at 12:36 PM, Alex Gorbachev > wrote: >> On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov wrote: >>> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev >>> wrote: >>>>> I'm confused. How can a 4M discard not free anything? It's either >>>>> going to hit an entire object or two adjacent objects, truncating the >>>>> tail of one and zeroing the head of another. Using rbd diff: >>>>> >>>>> $ rbd diff test | grep -A 1 25165824 >>>>> 25165824 4194304 data >>>>> 29360128 4194304 data >>>>> >>>>> # a 4M discard at 1M into a RADOS object >>>>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 >>>>> >>>>> $ rbd diff test | grep -A 1 25165824 >>>>> 25165824 1048576 data >>>>> 29360128 4194304 data >>>> >>>> I have tested this on a small RBD device with such offsets and indeed, >>>> the discard works as you describe, Ilya. >>>> >>>> Looking more into why ESXi's discard is not working. I found this >>>> message in kern.log on Ubuntu on creation of the SCST LUN, which shows >>>> unmap_alignment 0: >>>> >>>> Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) >>>> Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin >>>> provisioning for device /dev/rbd/spin1/unmap1t >>>> Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, >>>> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 >>>> Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI >>>> target virtual disk p_iSCSILun_sclun945 >>>> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, >>>> nblocks=838860800, cyln=409600) >>>> Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: >>>> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 >>>> Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: >>>> scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0, >>>> d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945, >>>> initiator copy_manager_sess) >>>> >>>> even though: >>>> >>>> root@e1:/sys/block/rbd29# cat discard_alignment >>>> 4194304 >>>> >>>> So somehow the discard_alignment is not making it into the LUN. Could >>>> this be the issue? >>> >>> No, if you are not seeing *any* effect, the alignment is pretty much >>> irrelevant. Can you do the following on a small test image? >>> >>> - capture "rbd diff" output >>> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace >>> - issue a few discards with blkdiscard >>> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled >>> - capture "rbd diff" output again >>> >>> and attach all of the above? (You might need to install a blktrace >>> package.) >>> >> >> Latest results from VMWare validation tests: >> >> Each test creates and deletes a virtual disk, then calls ESXi unmap >> for what ESXi maps to that volume: >> >> Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829 >> >> Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837 >> >> Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824 >> >> Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837 >> >> Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837 >> >> At the end, the compounded used size via rbd diff is 608 GB from 775GB >> of data. So we release only about 20% via discards in the end. > > Ilya has analyzed the discard pattern, and indeed the problem is that > ESXi appears to disregard the discard alignment attribute. Therefore, > discards are shifted by 1M, and are not hitting the tail of objects. > > Discards work much better on the EagerZeroedThick volumes, likely due > to contiguous data. > > I will proceed with the rest of testing, and will post any tips or > best practice results as they become available. > > Thank you for everyone's help and advice! Testing completed - the discards definitely follow the alignment pattern: - 4MB objects and VMFS5 - only some discards due to 1MB discard not often hitting the tail of object - 1MB objects - practically 100% space reclaim I have not tried shifting the VMFS5 filesystem, as the test will not work with that. Also not sure how to properly incorporate into VMWare routine operation. So, as a best practice: If you want efficient ESXi space reclaim with RBD and VMFS5, use 1 MB object size in Ceph Best regards, -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?
On Tue, Jul 19, 2016 at 12:04 PM, Alex Gorbachev wrote: > On Mon, Jul 18, 2016 at 4:41 AM, Василий Ангапов wrote: >> Guys, >> >> This bug is hitting me constantly, may be once per several days. Does >> anyone know is there a solution already? > > > I see there is a fix available, and am waiting for a backport to a > longterm kernel: > > https://lkml.org/lkml/2016/7/12/919 > > https://lkml.org/lkml/2016/7/12/297 > > -- > Alex Gorbachev > Storcium No more issues on the latest kernel builds. Alex > > > > >> >> 2016-07-05 11:47 GMT+03:00 Nick Fisk : >>>> -Original Message- >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>>> Alex Gorbachev >>>> Sent: 04 July 2016 20:50 >>>> To: Campbell Steven >>>> Cc: ceph-users ; Tim Bishop >>> li...@bishnet.net> >>>> Subject: Re: [ceph-users] Is anyone seeing iissues with >>>> task_numa_find_cpu? >>>> >>>> On Wed, Jun 29, 2016 at 5:41 AM, Campbell Steven >>>> wrote: >>>> > Hi Alex/Stefan, >>>> > >>>> > I'm in the middle of testing 4.7rc5 on our test cluster to confirm >>>> > once and for all this particular issue has been completely resolved by >>>> > Peter's recent patch to sched/fair.c refereed to by Stefan above. For >>>> > us anyway the patches that Stefan applied did not solve the issue and >>>> > neither did any 4.5.x or 4.6.x released kernel thus far, hopefully it >>>> > does the trick for you. We could get about 4 hours uptime before >>>> > things went haywire for us. >>>> > >>>> > It's interesting how it seems the CEPH workload triggers this bug so >>>> > well as it's quite a long standing issue that's only just been >>>> > resolved, another user chimed in on the lkml thread a couple of days >>>> > ago as well and again his trace had ceph-osd in it as well. >>>> > >>>> > https://lkml.org/lkml/headers/2016/6/21/491 >>>> > >>>> > Campbell >>>> >>>> Campbell, any luck with testing 4.7rc5? rc6 came out just now, and I am >>>> having trouble booting it on an ubuntu box due to some other unrelated >>>> problem. So dropping to kernel 4.2.0 for now, which does not seem to have >>>> this load related problem. >>>> >>>> I looked at the fair.c code in kernel source tree 4.4.14 and it is quite >>> different >>>> than Peter's patch (assuming 4.5.x source), so the patch does not apply >>>> cleanly. Maybe another 4.4.x kernel will get the update. >>> >>> I put in a new 16.04 node yesterday and went straight to 4.7.rc6. It's been >>> backfilling for just under 24 hours now with no drama. Disks are set to use >>> CFQ. >>> >>>> >>>> Thanks, >>>> Alex >>>> >>>> >>>> >>>> > >>>> > On 29 June 2016 at 18:29, Stefan Priebe - Profihost AG >>>> > wrote: >>>> >> >>>> >> Am 29.06.2016 um 04:30 schrieb Alex Gorbachev: >>>> >>> Hi Stefan, >>>> >>> >>>> >>> On Tue, Jun 28, 2016 at 1:46 PM, Stefan Priebe - Profihost AG >>>> >>> wrote: >>>> >>>> Please be aware that you may need even more patches. Overall this >>>> >>>> needs 3 patches. Where the first two try to fix a bug and the 3rd >>>> >>>> one fixes the fixes + even more bugs related to the scheduler. I've >>>> >>>> no idea on which patch level Ubuntu is. >>>> >>> >>>> >>> Stefan, would you be able to please point to the other two patches >>>> >>> beside https://lkml.org/lkml/diff/2016/6/22/102/1 ? >>>> >> >>>> >> Sorry sure yes: >>>> >> >>>> >> 1. 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a >>>> >> bounded value") >>>> >> >>>> >> 2.) 40ed9cba24bb7e01cc380a02d3f04065b8afae1d ("sched/fair: Fix >>>> >> post_init_entity_util_avg() serialization") >>>> >> >>>> >> 3.) the one listed at lkml. >>>> >> >>>> >> Stefan >>>> >> >>>> &g
Re: [ceph-users] Ceph + VMware + Single Thread Performance
Hi Nick, On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk wrote: >> -Original Message- >> From: w...@globe.de [mailto:w...@globe.de] >> Sent: 21 July 2016 13:23 >> To: n...@fisk.me.uk; 'Horace Ng' >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >> Okay and what is your plan now to speed up ? > > Now I have come up with a lower latency hardware design, there is not much > further improvement until persistent RBD caching is implemented, as you will > be moving the SSD/NVME closer to the client. But I'm happy with what I can > achieve at the moment. You could also experiment with bcache on the RBD. Reviving this thread, would you be willing to share the details of the low latency hardware design? Are you optimizing for NFS or iSCSI? Thank you, Alex > >> >> Would it help to put in multiple P3700 per OSD Node to improve performance >> for a single Thread (example Storage VMotion) ? > > Most likely not, it's all the other parts of the puzzle which are causing the > latency. ESXi was designed for storage arrays that service IO's in 100us-1ms > range, Ceph is probably about 10x slower than this, hence the problem. > Disable the BBWC on a RAID controller or SAN and you will the same behaviour. > >> >> Regards >> >> >> Am 21.07.16 um 14:17 schrieb Nick Fisk: >> >> -Original Message- >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> >> Of w...@globe.de >> >> Sent: 21 July 2016 13:04 >> >> To: n...@fisk.me.uk; 'Horace Ng' >> >> Cc: ceph-users@lists.ceph.com >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >> >> >> Hi, >> >> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production >> >> right now? >> > It's just been built, not running yet. >> > >> >> So if you start a storage migration you get only 200 MByte/s right? >> > I wish. My current cluster (not this new one) would storage migrate at >> > ~10-15MB/s. Serial latency is the problem, without being able to >> > buffer, ESXi waits on an ack for each IO before sending the next. Also it >> > submits the migrations in 64kb chunks, unless you get VAAI >> working. I think esxi will try and do them in parallel, which will help as >> well. >> > >> >> I think it would be awesome if you get 1000 MByte/s >> >> >> >> Where is the Bottleneck? >> > Latency serialisation, without a buffer, you can't drive the devices >> > to 100%. With buffered IO (or high queue depths) I can max out the >> > journals. >> > >> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from >> >> the P3700. >> >> >> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your >> >> -ssd-is-suitable-as-a-journal-device/ >> >> >> >> How could it be that the rbd client performance is 50% slower? >> >> >> >> Regards >> >> >> >> >> >> Am 21.07.16 um 12:15 schrieb Nick Fisk: >> >>> I've had a lot of pain with this, smaller block sizes are even worse. >> >>> You want to try and minimize latency at every point as there is no >> >>> buffering happening in the iSCSI stack. This means:- >> >>> >> >>> 1. Fast journals (NVME or NVRAM) >> >>> 2. 10GB or better networking >> >>> 3. Fast CPU's (Ghz) >> >>> 4. Fix CPU c-state's to C1 >> >>> 5. Fix CPU's Freq to max >> >>> >> >>> Also I can't be sure, but I think there is a metadata update >> >>> happening with VMFS, particularly if you are using thin VMDK's, this >> >>> can also be a major bottleneck. For my use case, I've switched over to >> >>> NFS as it has given much more performance at scale and >> less headache. >> >>> >> >>> For the RADOS Run, here you go (400GB P3700): >> >>> >> >>> Total time run: 60.026491 >> >>> Total writes made: 3104 >> >>> Write size: 4194304 >> >>> Object size:4194304 >> >>> Bandwidth (MB/sec): 206.842 >> >>> Stddev Bandwidth: 8.10412 >> >>> Max bandwidth (MB/sec): 224 >> >>> Min bandwidth (MB/sec): 180 >> >>> Average IOPS: 51 >> >>> Stddev IOPS:2 >> >>> Max IOPS: 56 >> >>> Min IOPS: 45 >> >>> Average Latency(s): 0.0193366 >> >>> Stddev Latency(s): 0.00148039 >> >>> Max latency(s): 0.0377946 >> >>> Min latency(s): 0.015909 >> >>> >> >>> Nick >> >>> >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >> Behalf Of Horace >> Sent: 21 July 2016 10:26 >> To: w...@globe.de >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >> Hi, >> >> Same here, I've read some blog saying that vmware will frequently >> verify the locking on VMFS over iSCSI, hence it will have much slower >> performance than NFS (with different locking mechanism). >> >> Regards, >> Horace Ng >> >> - Original Message - >> From: w...@globe.de >> To: ceph-users@list
Re: [ceph-users] Ceph + VMware + Single Thread Performance
On Sunday, August 21, 2016, Wilhelm Redbrake wrote: > Hi Nick, > i understand all of your technical improvements. > But: why do you Not use a simple for example Areca Raid Controller with 8 > gb Cache and Bbu ontop in every ceph node. > Configure n Times RAID 0 on the Controller and enable Write back Cache. > That must be a latency "Killer" like in all the prop. Storage arrays or > Not ?? > > Best Regards !! What we saw specifically with Areca cards is that performance is excellent in benchmarking and for bursty loads. However, once we started loading with more constant workloads (we replicate databases and files to our Ceph cluster), this looks to have saturated the relatively small Areca NVDIMM caches and we went back to pure drive based performance. So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That worked, but now the overall latency is really high at times, not always. Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with too many IOPS, which get their latency sky high. Overall we are functioning fine, but I sure would like storage vmotion and other large operations faster. I am thinking I will test a few different schedulers and readahead settings to see if we can improve this by parallelizing reads. Also will test NFS, but need to determine whether to do krbd/knfsd or something more interesting like CephFS/Ganesha. Thanks for your very valuable info on analysis and hw build. Alex > > > > Am 21.08.2016 um 09:31 schrieb Nick Fisk >: > > >> -Original Message- > >> From: Alex Gorbachev [mailto:a...@iss-integration.com ] > >> Sent: 21 August 2016 04:15 > >> To: Nick Fisk > > >> Cc: w...@globe.de ; Horace Ng >; ceph-users > > >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance > >> > >> Hi Nick, > >> > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk > wrote: > >>>> -Original Message- > >>>> From: w...@globe.de [mailto:w...@globe.de ] > >>>> Sent: 21 July 2016 13:23 > >>>> To: n...@fisk.me.uk ; 'Horace Ng' > > >>>> Cc: ceph-users@lists.ceph.com > >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance > >>>> > >>>> Okay and what is your plan now to speed up ? > >>> > >>> Now I have come up with a lower latency hardware design, there is not > much further improvement until persistent RBD caching is > >> implemented, as you will be moving the SSD/NVME closer to the client. > But I'm happy with what I can achieve at the moment. You > >> could also experiment with bcache on the RBD. > >> > >> Reviving this thread, would you be willing to share the details of the > low latency hardware design? Are you optimizing for NFS or > >> iSCSI? > > > > Both really, just trying to get the write latency as low as possible, as > you know, vmware does everything with lots of unbuffered small io's. Eg > when you migrate a VM or as thin vmdk's grow. > > > > Even storage vmotions which might kick off 32 threads, as they all > roughly fall on the same PG, there still appears to be a bottleneck with > contention on the PG itself. > > > > These were the sort of things I was trying to optimise for, to make the > time spent in Ceph as minimal as possible for each IO. > > > > So onto the hardware. Through reading various threads and experiments on > my own I came to the following conclusions. > > > > -You need highest possible frequency on the CPU cores, which normally > also means less of them. > > -Dual sockets are probably bad and will impact performance. > > -Use NVME's for journals to minimise latency > > > > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an > Intel P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has > 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required. > Actually this design as well as being very performant for Ceph, also works > out very cheap as you are using low end server parts. The whole lot + > 12x7.2k disks all goes into a 1U case. > > > > During testing I noticed that by default c-states and p-states slaughter > performance. After forcing max cstate to 1 and forcing the CPU frequency up > to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or > around 1600IOPs, this is at QD=1. > > > > Few other observations: > > 1. Power usage is around 150-200W for this config with 12x7.2k disks > > 2. CPU u
Re: [ceph-users] Kernel mounted RBD's hanging
On Thu, Jun 29, 2017 at 10:30 AM Nick Fisk wrote: > Hi All, > > Putting out a call for help to see if anyone can shed some light on this. > > Configuration: > Ceph cluster presenting RBD's->XFS->NFS->ESXi > Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a > pacemaker cluster > Both OSD's and clients are go into a pair of switches, single L2 domain (no > sign from pacemaker that there is network connectivity issues) > > Symptoms: > - All RBD's on a single client randomly hang for 30s to several minutes, > confirmed by pacemaker and ESXi hosts complaining > - Cluster load is minimal when this happens most times > - All other clients with RBD's are not affected (Same RADOS pool), so its > seems more of a client issue than cluster issue > - It looks like pacemaker tries to also stop RBD+FS resource, but this also > hangs > - Eventually pacemaker succeeds in stopping resources and immediately > restarts them, IO returns to normal > - No errors, slow requests, or any other non normal Ceph status is reported > on the cluster or ceph.log > - Client logs show nothing apart from pacemaker > > Things I've tried: > - Different kernels (potentially happened less with older kernels, but > can't > be 100% sure) > - Disabling scrubbing and anything else that could be causing high load > - Enabling Kernel RBD debugging (Problem maybe happens a couple of times a > day, debug logging was not practical as I can't reproduce on demand) > > Anyone have any ideas? Nick, are you using any network aggregation, LACP? Can you drop to a simplest possible configuration to make sure there's nothing on the network switch side? Do you check the ceph.log for any anomalies? Any occurrences on OSD nodes, anything in their OSD logs or syslogs? Aany odd page cache settings on the clients? Alex > > Thanks, > Nick > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi Tenancy in Ceph RBD Cluster
On Mon, Jun 26, 2017 at 2:00 AM Mayank Kumar wrote: > Hi Ceph Users > I am relatively new to Ceph and trying to Provision CEPH RBD Volumes using > Kubernetes. > > I would like to know what are the best practices for hosting a multi > tenant CEPH cluster. Specifically i have the following questions:- > > - Is it ok to share a single Ceph Pool amongst multiple tenants ? If yes, > how do you guarantee that volumes of one Tenant are not > accessible(mountable/mapable/unmappable/deleteable/mutable) to other > tenants ? > - Can a single Ceph Pool have multiple admin and user keyrings generated > for rbd create and rbd map commands ? This way i want to assign different > keyrings to each tenant > > - can a rbd map command be run remotely for any node on which we want to > mount RBD Volumes or it must be run from the same node on which we want to > mount ? Is this going to be possible in the future ? > > - In terms of ceph fault tolerance and resiliency, is one ceph pool per > customer a better design or a single pool must be shared with mutiple > customers > - In a single pool for all customers, how can we get the ceph statistics > per customer ? Is it possible to somehow derive this from the RBD volumes ? > Is this post helpful? https://blog-fromsomedude.rhcloud.com/2016/04/26/Allowing-a-RBD-client-to-map-only-one-RBD > Thanks for your responses > Mayank > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Kernel mounted RBD's hanging
On Fri, Jun 30, 2017 at 8:12 AM Nick Fisk wrote: > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 30 June 2017 03:54 > *To:* Ceph Users ; n...@fisk.me.uk > > > *Subject:* Re: [ceph-users] Kernel mounted RBD's hanging > > > > > > On Thu, Jun 29, 2017 at 10:30 AM Nick Fisk wrote: > > Hi All, > > Putting out a call for help to see if anyone can shed some light on this. > > Configuration: > Ceph cluster presenting RBD's->XFS->NFS->ESXi > Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a > pacemaker cluster > Both OSD's and clients are go into a pair of switches, single L2 domain (no > sign from pacemaker that there is network connectivity issues) > > Symptoms: > - All RBD's on a single client randomly hang for 30s to several minutes, > confirmed by pacemaker and ESXi hosts complaining > - Cluster load is minimal when this happens most times > - All other clients with RBD's are not affected (Same RADOS pool), so its > seems more of a client issue than cluster issue > - It looks like pacemaker tries to also stop RBD+FS resource, but this also > hangs > - Eventually pacemaker succeeds in stopping resources and immediately > restarts them, IO returns to normal > - No errors, slow requests, or any other non normal Ceph status is reported > on the cluster or ceph.log > - Client logs show nothing apart from pacemaker > > Things I've tried: > - Different kernels (potentially happened less with older kernels, but > can't > be 100% sure) > - Disabling scrubbing and anything else that could be causing high load > - Enabling Kernel RBD debugging (Problem maybe happens a couple of times a > day, debug logging was not practical as I can't reproduce on demand) > > Anyone have any ideas? > > > > Nick, are you using any network aggregation, LACP? Can you drop to a > simplest possible configuration to make sure there's nothing on the network > switch side? > > > > Hi Alex, > > > > The OSD nodes are active/backup bond and the active Nic on each one, all > goes into the same switch. The NFS gateways are currently VM’s, but again > the hypervisor is using the Nic on the same switch. The cluster and public > networks are vlans on the same Nic and I don’t get any alerts from > monitoring/pacemaker to suggest there are comms issues. But I will look > into getting some ping logs done to see if they reveal anything. > Any chance this could be a hypervisor or VM-related issue? Any possibility to run one gateway temporarily as a physical machine? > > > > Do you check the ceph.log for any anomalies? > > > > Yep, completely clean > > > > Any occurrences on OSD nodes, anything in their OSD logs or syslogs? > > > > Not that I can see. I’m using cache tiering, so all IO travels through a > few OSD’s. I guess this might make it easier to try and see whats going on. > But the random nature of it, means it’s not always easy to catch. > > > > Aany odd page cache settings on the clients? > > > > The only customizations on the clients are readahead, some TCP tunings and > min free kbytes. > > > > Alex > > > > > > Thanks, > Nick > > _______ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://xo4t.mj.am/lnk/AEUAMGSsyuUAAFhNkjYAADNJBWwAAACRXwBZVkBBEimV6rRsR9ueEOKOWc4YEwAAlBI/1/KaykvSTe4bVbKn7nnq-msA/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t> > > -- > > -- > > Alex Gorbachev > > Storcium > > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI production ready?
On Sat, Jul 15, 2017 at 11:02 PM Alvaro Soto wrote: > Hi guys, > does anyone know any news about in what release iSCSI interface is going > to be production ready, if not yet? > > I mean without the use of a gateway, like a different endpoint connector > to a CEPH cluster. > We very successfully use SCST with Pacemaker HA. > Thanks in advance. > Best. > > -- > > ATTE. Alvaro Soto Escobar > > -- > Great people talk about ideas, > average people talk about things, > small people talk ... about other people. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD encryption options?
On Mon, Aug 21, 2017 at 9:03 PM Daniel K wrote: > Are there any client-side options to encrypt an RBD device? > > Using latest luminous RC, on Ubuntu 16.04 and a 4.10 kernel > > I assumed adding client site encryption would be as simple as using > luks/dm-crypt/cryptsetup after adding the RBD device to /etc/ceph/rbdmap > and enabling the rbdmap service -- but I failed to consider the order of > things loading and it appears that the RBD gets mapped too late for > dm-crypt to recognize it as valid.It just keeps telling me it's not a valid > LUKS device. > > I know you can run the OSDs on an encrypted drive, but I was hoping for > something client side since it's not exactly simple(as far as I can tell) > to restrict client access to a single(or group) of RBDs within a shared > pool. > Daniel, we used info from here for single or multiple RBD mappings to client https://blog-fromsomedude.rhcloud.com/2016/04/26/Allowing-a-RBD-client-to-map-only-one-RBD Also, I ran into the race condition with zfs, and would up putting zfs and rbdmap into rc.local. It should work for dm-crypt as well. Regards, Alex > Any suggestions? > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PCIe journal benefit for SSD OSDs
We are planning a Jewel filestore based cluster for a performance sensitive healthcare client, and the conservative OSD choice is Samsung SM863A. I am going to put an 8GB Areca HBA in front of it to cache small metadata operations, but was wondering if anyone has seen a positive impact from also using PCIe journals (e.g. Intel P3700 or even the older 910 series) in front of such SSDs? Thanks for any info you can share. -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph release cadence
On Wed, Sep 6, 2017 at 11:23 AM Sage Weil wrote: > Hi everyone, > > Traditionally, we have done a major named "stable" release twice a year, > and every other such release has been an "LTS" release, with fixes > backported for 1-2 years. > > With kraken and luminous we missed our schedule by a lot: instead of > releasing in October and April we released in January and August. > > A few observations: > > - Not a lot of people seem to run the "odd" releases (e.g., infernalis, > kraken). This limits the value of actually making them. It also means > that those who *do* run them are running riskier code (fewer users -> more > bugs). > > - The more recent requirement that upgrading clusters must make a stop at > each LTS (e.g., hammer -> luminous not supported, must go hammer -> jewel > -> lumninous) has been hugely helpful on the development side by reducing > the amount of cross-version compatibility code to maintain and reducing > the number of upgrade combinations to test. > > - When we try to do a time-based "train" release cadence, there always > seems to be some "must-have" thing that delays the release a bit. This > doesn't happen as much with the odd releases, but it definitely happens > with the LTS releases. When the next LTS is a year away, it is hard to > suck it up and wait that long. > > A couple of options: > > * Keep even/odd pattern, and continue being flexible with release dates > > + flexible > - unpredictable > - odd releases of dubious value > > * Keep even/odd pattern, but force a 'train' model with a more regular > cadence > > + predictable schedule > - some features will miss the target and be delayed a year > > * Drop the odd releases but change nothing else (i.e., 12-month release > cadence) > > + eliminate the confusing odd releases with dubious value > > * Drop the odd releases, and aim for a ~9 month cadence. This splits the > difference between the current even/odd pattern we've been doing. > > + eliminate the confusing odd releases with dubious value > + waiting for the next release isn't quite as bad > - required upgrades every 9 months instead of ever 12 months > > * Drop the odd releases, but relax the "must upgrade through every LTS" to > allow upgrades across 2 versions (e.g., luminous -> mimic or luminous -> > nautilus). Shorten release cycle (~6-9 months). > > + more flexibility for users > + downstreams have greater choice in adopting an upstrema release > - more LTS branches to maintain > - more upgrade paths to consider > > Other options we should consider? Other thoughts? As a mission critical system user, I am in favor of dropping odd releases and going to a 9 month cycle. We never run the odd releases as too risky. A good deal if functionality comes in updates, and usually the Ceph team brings them in gently, with the more experimental features off by default. I suspect the 9 month even cycle will also make it easier to perform more incremental upgrades, i.e. small jumps rather than big leaps. > > Thanks! > sage > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mon health status gone from display
In Jewel and prior there was a health status for MONs in ceph -s JSON output, this seems to be gone now. Is there a place where a status of a given monitor is shown in Luminous? Thank you -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] BlueStore questions about workflow and performance
Hi Sam, On Mon, Oct 2, 2017 at 6:01 PM Sam Huracan wrote: > Anyone can help me? > > On Oct 2, 2017 17:56, "Sam Huracan" wrote: > >> Hi, >> >> I'm reading this document: >> >> http://storageconference.us/2017/Presentations/CephObjectStore-slides.pdf >> >> I have 3 questions: >> >> 1. BlueStore writes both data (to raw block device) and metadata (to >> RockDB) simultaneously, or sequentially? >> >> 2. From my opinion, performance of BlueStore can not compare to FileStore >> using SSD Journal, because performance of raw disk is less than using >> buffer. (this is buffer purpose). How do you think? >> >> 3. Do setting Rock DB and Rock DB Wal in SSD only enhance write, read >> performance? or both? >> >> Hope your answer, >> > I am researching the same thing, but recommend you look at http://ceph.com/community/new-luminous-bluestore And also search for Bluestore cache to answer some questions. My test Luminous cluster so far is not as performant as I would like, but I have not yet put a serious effort into tuning it, amd it does seem stable. Hth, Alex >> >> _______ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] BlueStore questions about workflow and performance
Hi Mark, great to hear from you! On Tue, Oct 3, 2017 at 9:16 AM Mark Nelson wrote: > > > On 10/03/2017 07:59 AM, Alex Gorbachev wrote: > > Hi Sam, > > > > On Mon, Oct 2, 2017 at 6:01 PM Sam Huracan > <mailto:nowitzki.sa...@gmail.com>> wrote: > > > > Anyone can help me? > > > > On Oct 2, 2017 17:56, "Sam Huracan" > <mailto:nowitzki.sa...@gmail.com>> wrote: > > > > Hi, > > > > I'm reading this document: > > > http://storageconference.us/2017/Presentations/CephObjectStore-slides.pdf > > > > I have 3 questions: > > > > 1. BlueStore writes both data (to raw block device) and metadata > > (to RockDB) simultaneously, or sequentially? > > > > 2. From my opinion, performance of BlueStore can not compare to > > FileStore using SSD Journal, because performance of raw disk is > > less than using buffer. (this is buffer purpose). How do you > think? > > > > 3. Do setting Rock DB and Rock DB Wal in SSD only enhance > > write, read performance? or both? > > > > Hope your answer, > > > > > > I am researching the same thing, but recommend you look > > at http://ceph.com/community/new-luminous-bluestore > > > > And also search for Bluestore cache to answer some questions. My test > > Luminous cluster so far is not as performant as I would like, but I have > > not yet put a serious effort into tuning it, amd it does seem stable. > > > > Hth, Alex > > Hi Alex, > > If you see anything specific please let us know. There are a couple of > corner cases where bluestore is likely to be slower than filestore > (specifically small sequential reads/writes with no client side cache or > read ahead). I've also seen some cases where filestore has higher read > throughput potential (4MB seq reads with multiple NVMe drives per OSD > node). In many other cases bluestore is faster (and sometimes much > faster) than filestore in our tests. Writes in general tend to be > faster and high volume object creation is much faster with much lower > tail latencies (filestore really suffers in this test due to PG splitting). I have two pretty well tuned filestore Jewel clusters running SATA HDDs on dedicated hardware. For the Luminous cluster, I wanted to do a POC on a VMWare fully meshed (trendy moniker: hyperconverged) setup, using only SSDs, Luminous and Bluestore. Our workloads are unusual in that RBDs are exported via iSCSI or NFS back to VMWare and consumed by e.g. Windows VMs (we support heathcare and corporate business systems), or Linux VMs direct from Ceph. What I did so far is dedicate a hardware JBOD with an Areca HBA (you turned me on to those a few years ago :) to each OSD VM. Using 6 Smartstorage SSD OSDs per each OSD VM with 3 of these VMs total and 2x 20 Gb shared network uplinks, I am getting about a third of performance of my hardware Jewel cluster with 24 Lenovo enterprise SATA drives, measured as 4k block reads and writes in single and 32 multiple streams. Not apples to apples definitely, so I plan to play with Bluestore cache. One question: does Bluestore distinguish between SSD and HDD based on CRUSH class assignment? I will check the effect of giving a lot of RAM and CPU cores to OSD VMs, as well as increasing spindles and using different JBODs. Thank you for reaching out. Regards, Alex > > Mark > > > > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > > -- > > Alex Gorbachev > > Storcium > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD Mirror between two separate clusters named ceph
I am testing rbd mirroring, and have two existing clusters named ceph in their ceph.conf. Each cluster has a separate fsid. On one cluster, I renamed ceph.conf into remote-mirror.conf and ceph.client.admin.keyring to remote-mirror.client.admin.keyring, but it looks like this is not sufficient: root@lab2-mon3:/etc/ceph# rbd --cluster remote-mirror mirror pool peer add spin2 client.admin@remote-mirror rbd: error adding mirror peer 2017-10-05 19:40:52.003289 7f290935c100 -1 librbd: Cannot add self as remote peer Short of creating a whole new cluster, are there any options to make such configuration work? Thank you, -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD Mirror between two separate clusters named ceph
> > On Thu, Oct 5, 2017 at 7:45 PM, Alex Gorbachev > wrote: >> I am testing rbd mirroring, and have two existing clusters named ceph >> in their ceph.conf. Each cluster has a separate fsid. On one >> cluster, I renamed ceph.conf into remote-mirror.conf and >> ceph.client.admin.keyring to remote-mirror.client.admin.keyring, but >> it looks like this is not sufficient: >> >> root@lab2-mon3:/etc/ceph# rbd --cluster remote-mirror mirror pool peer >> add spin2 client.admin@remote-mirror >> rbd: error adding mirror peer >> 2017-10-05 19:40:52.003289 7f290935c100 -1 librbd: Cannot add self as >> remote peer >> >> Short of creating a whole new cluster, are there any options to make >> such configuration work? On Thu, Oct 5, 2017 at 8:13 PM, Jason Dillaman wrote: > The "cluster" name is really just the name of the configuration file. > The only issue with your command-line is that you should connect to > the "local" cluster to add a peer as a remote cluster: > > rbd --cluster ceph mirror pool peer add spin2 client.admin@remote-mirror Thank you Jason, works perfectly now. I used this link to get a bit of context on local vs. remote https://cloud.garr.it/support/kb/ceph/ceph-enabling-rbd-mirror/ Summary: It's OK to have both local and remote clusters named ceph, just need to copy and rename the .conf and keyring files. Best regards, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backup VM (Base image + snapshot)
On Sat, Oct 14, 2017 at 12:25 PM, Oscar Segarra wrote: > Hi, > > In my VDI environment I have configured the suggested ceph > design/arquitecture: > > http://docs.ceph.com/docs/giant/rbd/rbd-snapshot/ > > Where I have a Base Image + Protected Snapshot + 100 clones (one for each > persistent VDI). > > Now, I'd like to configure a backup script/mechanism to perform backups of > each persistent VDI VM to an external (non ceph) device, like NFS or > something similar... > > Then, some questions: > > 1.- Does anybody have been able to do this kind of backups? Yes, we have been using export-diff successfully (note this is off a snapshot and not a clone) to back up and restore ceph images to non-ceph storage. You can use merge-diff to create "synthetic fulls" and even do some basic replication to another cluster. http://ceph.com/geen-categorie/incremental-snapshots-with-rbd/ http://docs.ceph.com/docs/master/dev/rbd-export/ http://cephnotes.ksperis.com/blog/2014/08/12/rbd-replication -- Alex Gorbachev Storcium > 2.- Is it possible to export BaseImage in qcow2 format and snapshots in > qcow2 format as well as "linked clones" ? > 3.- Is it possible to export the Base Image in raw format, snapshots in raw > format as well and, when recover is required, import both images and > "relink" them? > 4.- What is the suggested solution for this scenario? > > Thanks a lot everybody! > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Changing device-class using crushtool
Hi Wido, On Wed, Jan 10, 2018 at 11:09 AM, Wido den Hollander wrote: > Hi, > > Is there a way to easily modify the device-class of devices on a offline > CRUSHMap? > > I know I can decompile the CRUSHMap and do it, but that's a lot of work in a > large environment. > > In larger environments I'm a fan of downloading the CRUSHMap, modifying it > to my needs, testing it and injecting it at once into the cluster. > > crushtool can do a lot, you can also run tests using device classes, but > there doesn't seem to be a way to modify the device-class using crushtool, > is that correct? This is how we do it in Storcium based on http://docs.ceph.com/docs/master/rados/operations/crush-map/ ceph osd crush rm-device-class ceph osd crush set-device-class -- Best regards, Alex Gorbachev Storcium > > Wido > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Future
Hi Massimiliano, On Thu, Jan 11, 2018 at 6:15 AM, Massimiliano Cuttini wrote: > Hi everybody, > > i'm always looking at CEPH for the future. > But I do see several issue that are leaved unresolved and block nearly > future adoption. > I would like to know if there are some answear already: > > 1) Separation between Client and Server distribution. > At this time you have always to update client & server in order to match the > same distribution of Ceph. > This is ok in the early releases but in future I do expect that the > ceph-client is ONE, not many for every major version. > The client should be able to self determinate what version of the protocol > and what feature are enabable and connect to at least 3 or 5 older major > version of Ceph by itself. > > 2) Kernel is old -> feature mismatch > Ok, kernel is old, and so? Just do not use it and turn to NBD. > And please don't let me even know, just virtualize under the hood. > > 3) Management complexity > Ceph is amazing, but is just too big to have everything under control (too > many services). > Now there is a management console, but as far as I read this management > console just show basic data about performance. > So it doesn't manage at all... it's just a monitor... > > In the end You have just to manage everything by your command-line. > In order to manage by web it's mandatory: > > create, delete, enable, disable services > If I need to run ISCSI redundant gateway, do I really need to cut&paste > command from your online docs? > Of course no. You just can script it better than what every admin can do. > Just give few arguments on the html forms and that's all. > > create, delete, enable, disable users > I have to create users and keys for 24 servers. Do you really think it's > possible to make it without some bad transcription or bad cut&paste of the > keys across all servers. > Everybody end by just copy the admin keys across all servers, giving very > unsecure full permission to all clients. > > create MAPS (server, datacenter, rack, node, osd). > This is mandatory to design how the data need to be replicate. > It's not good create this by script or shell, it's needed a graph editor > which can dive you the perpective of what will be copied where. > > check hardware below the hood > It's missing the checking of the health of the hardware below. > But Ceph was born as a storage software that ensure redundacy and protect > you from single failure. > So WHY did just ignore to check the healths of disks with SMART? > FreeNAS just do a better work on this giving lot of tools to understand > which disks is which and if it will fail in the nearly future. > Of course also Ceph could really forecast issues by itself and need to start > to integrate with basic hardware IO. > For example, should be possible to enable disable UID on the disks in order > to know which one need to be replace. As a technical note, we ran into this need with Storcium, and it is pretty easy to utilize UID indicators using both Areca and LSI/Avago HBAs. You will need the standard control tools available from their web sites, as well as hardware that supports SGPIO (most enterprise JBODs and drives do). There's likely similar options to other HBAs. Areca: UID on: cli64 curctrl=1 set password= cli64 curctrl= disk identify drv= UID OFF: cli64 curctrl=1 set password= cli64 curctrl= disk identify drv=0 LSI/Avago: UID on: sas2ircu locate : ON UID OFF: sas2ircu locate : OFF HTH, Alex Gorbachev Storcium > I guess this kind of feature are quite standard across all linux > distributions. > > The management complexity can be completly overcome with a great Web > Manager. > A Web Manager, in the end is just a wrapper for Shell Command from the > CephAdminNode to others. > If you think about it a wrapper is just tons of time easier to develop than > what has been already developed. > I do really see that CEPH is the future of storage. But there is some > quick-avoidable complexity that need to be reduced. > > If there are already some plan for these issue I really would like to know. > > Thanks, > Max > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CRUSH map cafe or CRUSH map generator
I would love to do this, but presently do not have the resources. Would there be (or would anyone be interested in starting) a CRUSH map cafe site for sharing common CRUSH maps, or a PG-calc style CRUSH map generator for common use cases? It seems there's a lot of discussions about best practices and simple use cases, which could be automated this way. -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Two datacenter resilient design with a quorum site
I found a few WAN RBD cluster design discussions, but not a local one, so was wonderinng if anyone has experience with a resilience-oriented short distance (<10 km, redundant fiber connections) cluster in two datacenters with a third site for quorum purposes only? I can see two types of scenarios: 1. Two (or even number) of OSD nodes at each site, 4x replication (size 4, min_size 2). Three MONs, one at each site to handle split brain. Question: How does the cluster handle the loss of communication between the OSD sites A and B, while both can communicate with the quorum site C? It seems, one of the sites should suspend, as OSDs will not be able to communicate between sites. 2. 3x replication for performance or cost (size 3, min_size 2 - or even min_size 1 and strict monitoring). Two replicas and two MONs at one site and one replica and one MON at the other site. Question: in case of a permanent failure of the main site (with two replicas), how to manually force the other site (with one replica and MON) to provide storage? I would think a CRUSH map change and modifying ceph.conf to include just one MON, then build two more MONs locally and add? -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Two datacenter resilient design with a quorum site
On Tue, Jan 16, 2018 at 2:17 PM, Gregory Farnum wrote: > On Tue, Jan 16, 2018 at 6:07 AM Alex Gorbachev > wrote: >> >> I found a few WAN RBD cluster design discussions, but not a local one, >> so was wonderinng if anyone has experience with a resilience-oriented >> short distance (<10 km, redundant fiber connections) cluster in two >> datacenters with a third site for quorum purposes only? >> >> I can see two types of scenarios: >> >> 1. Two (or even number) of OSD nodes at each site, 4x replication >> (size 4, min_size 2). Three MONs, one at each site to handle split >> brain. >> >> Question: How does the cluster handle the loss of communication >> between the OSD sites A and B, while both can communicate with the >> quorum site C? It seems, one of the sites should suspend, as OSDs >> will not be able to communicate between sites. > > > Sadly this won't work — the OSDs on each side will report their peers on the > other side down, but both will be able to connect to a live monitor. > (Assuming the quorum site holds the leader monitor, anyway — if one of the > main sites holds what should be the leader, you'll get into a monitor > election storm instead.) You'll need your own netsplit monitoring to shut > down one site if that kind of connection cut is a possibility. What about running a split brain aware too, such as Pacemaker, and running a copy of the same VM as a mon at each site? In case of a split brain network separation, Pacemaker would (aware via third site) stop the mon on site A and bring up the mon on site B (or whatever the rules are set to). I read earlier that a mon with the same IP, name and keyring would just look to the ceph cluster as a very old mon, but still able to vote for quorum. Vincent Godin also described an HSRP based method, which would accomplish this goal via network routing. That seems like a good approach too, I just need to check on HSRP availability. > >> >> >> 2. 3x replication for performance or cost (size 3, min_size 2 - or >> even min_size 1 and strict monitoring). Two replicas and two MONs at >> one site and one replica and one MON at the other site. >> >> Question: in case of a permanent failure of the main site (with two >> replicas), how to manually force the other site (with one replica and >> MON) to provide storage? I would think a CRUSH map change and >> modifying ceph.conf to include just one MON, then build two more MONs >> locally and add? > > > Yep, pretty much that. You won't need to change ceph.conf to just one mon so > much as to include the current set of mons and update the monmap. I believe > that process is in the disaster recovery section of the docs. Thank you. Alex > -Greg > >> >> >> -- >> Alex Gorbachev >> Storcium >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ideal Bluestore setup
Hi Ean, I don't have any experience with less than 8 drives per OSD node, and the setup heavily depends on what you want to use it for. Assuming small proof of concept with not much requirement for performance (due to low spindle count), I would do this: On Mon, Jan 22, 2018 at 1:28 PM, Ean Price wrote: > Hi folks, > > I’m not sure the ideal setup for bluestore given the set of hardware I have > to work with so I figured I would ask the collective wisdom of the ceph > community. It is a small deployment so the hardware is not all that > impressive, but I’d still like to get some feedback on what would be the > preferred and most maintainable setup. > > We have 5 ceph OSD hosts with the following setup: > > 16 GB RAM > 1 PCI-E NVRAM 128GB > 1 SSD 250 GB > 2 HDD 1 TB each > > I was thinking to put: > > OS on NVRAM with 2x20 GB partitions for bluestore’s WAL and rocksdb I would put the OS on the SSD and not colocate with WAL/DB. I would also put WAL/DB on the NVMe drive as the fastest. > And either use bcache with the SSD to cache the 2x HDDs or possibly use > Ceph’s built in cache tiering. Ceph cache tiering is likely out of the range of this setup, and requires a very clear understanding of the workload. I would not use it. No experience with bcache, but again seems to be a bit of overkill for a small setup like this. Simple = stable. > > My questions are: > > 1) is a 20GB logical volume adequate for the WAL and db with a 1TB HDD or > should it be larger? I believe so, yes. If it spills over, the data will just go onto the drives. > > 2) or - should I put the rocksdb on the SSD and just leave the WAL on the > NVRAM device? You are likely better off with WAL and DB on the NVRAM > > 3) Lastly, what are the downsides of bcache vs Ceph’s cache tiering? I see > both are used in production so I’m not sure which is the better choice for us. > > Performance is, of course, important but maintainability and stability are > definitely more important. I would avoid both bcache and tiering to simplify the configuration, and seriously consider larger nodes if possible, and more OSD drives. HTH, -- Alex Gorbachev Storcium > > Thanks in advance for your advice! > > Best, > Ean > > > > > > -- > __ > > This message contains information which may be confidential. Unless you > are the addressee (or authorized to receive for the addressee), you may not > use, copy, or disclose to anyone the message or any information contained > in the message. If you have received the message in error, please advise > the sender by reply e-mail or contact the sender at Price Paper & Twine > Company by phone at (516) 378-7842 and delete the message. Thank you very > much. > > __ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Newbie question: stretch ceph cluster
n 3 be skipped so that for each piece > of data, there are 3 replicas - original one, replica in same room, and > replica in other room, in order to save some space? > > Besides, would also like to ask if it's correct that the cluster will > continue to work (degraded) if one room is lost? > > Will there be any better way to setup such 'stretched' cluster between 2 > DCs? They're extension instead of real DR site... > > Sorry for the newbie questions and we'll proceed to have more study and > experiment on this. > > Thanks a lot. > > > > > > > So that any one of following failure won't affect the cluster's > operation and data availability: > > any one component in either data center failure of either one of the > data center > > > Is it possible? > > In general this is possible, but I would consider that replica=2 is > not a good idea. In case of a failure scenario or just maintenance and > one DC is powered off and just one single disk fails on the other DC, > this can already lead to data loss. My advice here would be, if anyhow > possible, please don't do replica=2. > > In case one data center failure case, seems replication can't occur any > more. Any CRUSH rule can achieve this purpose? > > > Sorry for the newbie question. > > > Thanks a lot. > > Regards > > /st wong > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, > HRB > 21284 (AG Nürnberg) > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > > On Sunday, August 21, 2016, Wilhelm Redbrake wrote: > > Hi Nick, > i understand all of your technical improvements. > But: why do you Not use a simple for example Areca Raid Controller with 8 > gb Cache and Bbu ontop in every ceph node. > Configure n Times RAID 0 on the Controller and enable Write back Cache. > That must be a latency "Killer" like in all the prop. Storage arrays or > Not ?? > > Best Regards !! > > > > What we saw specifically with Areca cards is that performance is excellent > in benchmarking and for bursty loads. However, once we started loading with > more constant workloads (we replicate databases and files to our Ceph > cluster), this looks to have saturated the relatively small Areca NVDIMM > caches and we went back to pure drive based performance. > > > > Yes, I think that is a valid point. Although low latency, you are still > having to write to the disks twice (journal+data), so once the cache’s on > the cards start filling up, you are going to hit problems. > > > > > > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per > 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That > worked, but now the overall latency is really high at times, not always. > Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS > drives with too many IOPS, which get their latency sky high. Overall we are > functioning fine, but I sure would like storage vmotion and other large > operations faster. > > > > > > Yeah this is the biggest pain point I think. Normal VM ops are fine, but > if you ever have to move a multi-TB VM, it’s just too slow. > > > > If you use iscsi with vaai and are migrating a thick provisioned vmdk, > then performance is actually quite good, as the block sizes used for the > copy are a lot bigger. > > > > However, my use case required thin provisioned VM’s + snapshots and I > found that using iscsi you have no control over the fragmentation of the > vmdk’s and so the read performance is then what suffers (certainly with > 7.2k disks) > > > > Also with thin provisioned vmdk’s I think I was seeing PG contention with > the updating of the VMFS metadata, although I can’t be sure. > > > > > > I am thinking I will test a few different schedulers and readahead > settings to see if we can improve this by parallelizing reads. Also will > test NFS, but need to determine whether to do krbd/knfsd or something more > interesting like CephFS/Ganesha. > > > > As you know I’m on NFS now. I’ve found it a lot easier to get going and a > lot less sensitive to making config adjustments without suddenly everything > dropping offline. The fact that you can specify the extent size on XFS > helps massively with using thin vmdks/snapshots to avoid fragmentation. > Storage v-motions are a bit faster than iscsi, but I think I am hitting PG > contention when esxi tries to write 32 copy threads to the same object. > There is probably some tuning that could be done here (RBD striping???) but > this is the best it’s been for a long time and I’m reluctant to fiddle any > further. > > > > But as mentioned above, thick vmdk’s with vaai might be a really good fit. > Any chance thin vs. thick difference could be related to discards? I saw zillions of them in recent testing. > > > Thanks for your very valuable info on analysis and hw build. > > > > Alex > > > > > > > Am 21.08.2016 um 09:31 schrieb Nick Fisk : > > >> -Original Message- > >> From: Alex Gorbachev [mailto:a...@iss-integration.com] > >> Sent: 21 August 2016 04:15 > >> To: Nick Fisk > >> Cc: w...@globe.de; Horace Ng ; ceph-users < > ceph-users@lists.ceph.com> > >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance > >> > >> Hi Nick, > >> > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk wrote: > >>>> -Original Message- > >>>> From: w...@globe.de [mailto:w...@globe.de] > >>>> Sent: 21 July 2016 13:23 > >>>> To: n...@fisk.me.uk; 'Horace Ng' > >>>> Cc: ceph-users@lists.ceph.com > >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance > >>>> > >>>> Okay and what is your plan now to speed up ? > >>> > >>> Now I have come up with a lower latency hardware design, there is not > much further improvement until persistent RBD caching is > >> implemented, as you will be moving the SSD/NVME closer to the client. > But I'm happy with what I can achieve at the mo
Re: [ceph-users] udev rule to set readahead on Ceph RBD's
On Mon, Aug 22, 2016 at 3:29 PM, Wido den Hollander wrote: > >> Op 22 augustus 2016 om 21:22 schreef Nick Fisk : >> >> >> > -Original Message- >> > From: Wido den Hollander [mailto:w...@42on.com] >> > Sent: 22 August 2016 18:22 >> > To: ceph-users ; n...@fisk.me.uk >> > Subject: Re: [ceph-users] udev rule to set readahead on Ceph RBD's >> > >> > >> > > Op 22 augustus 2016 om 15:17 schreef Nick Fisk : >> > > >> > > >> > > Hope it's useful to someone >> > > >> > > https://gist.github.com/fiskn/6c135ab218d35e8b53ec0148fca47bf6 >> > > >> > >> > Thanks for sharing. Might this be worth adding it to ceph-common? >> >> Maybe, Ilya kindly set the default for krbd to 4MB last year in the kernel, >> but maybe having this available would be handy if people ever want a >> different default. It could be set to 4MB as well, with a note somewhere to >> point people at its direction if they need to change it. >> > > I think it might be handy to have the udev file as redundancy. That way it > can easily be changed by users. The udev file is already present, they just > have to modify it. > >> > >> > And is 16MB something we should want by default or does this apply to your >> > situation better? >> >> It sort of applies to me. With a 4MB readahead you will probably struggle to >> get much more than around 50-80MB/s sequential reads, as the read ahead will >> only ever hit 1 object at a time. If you want to get nearer 200MB/s then you >> need to set either 16 or 32MB readahead. I need it to stream to LTO6 tape. >> Depending on what you are doing this may or may not be required. >> > > Ah, yes. I a kind of similar use-case I went for using 64MB objects > underneath a RBD device. We needed high sequential Write and Read performance > on those RBD devices since we were storing large files on there. > > Different approach, kind of similar result. Question: what scheduler were you guys using to facilitate the readahead on the RBD client? Have you noticed any difference between different elevators and have you tried blk-mq/scsi-mq? Thank you. -- Alex Gorbachev Storcium > > Wido > >> > >> > Wido >> > >> > > >> > > ___ >> > > ceph-users mailing list >> > > ceph-users@lists.ceph.com >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Storcium has been certified by VMWare
I wanted to share that we have passed testing and received VMWare HCL certification for the ISS STORCIUM solution using Ceph Hammer as back end and SCST with Pacemaker as iSCSI delivery HA gateway. Thank you for all of your hard and continuous work on these projects. We will make sure that we continue to promote, improve, support, and deploy open source storage and compute solutions for healthcare and business applications. http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=san&productid=41781&deviceCategory=san&details=1&keyword=41781&isSVA=0&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph + VMware + Single Thread Performance
HI Nick, On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk wrote: > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 21 August 2016 15:27 > *To:* Wilhelm Redbrake > *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users < > ceph-users@lists.ceph.com> > *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Sunday, August 21, 2016, Wilhelm Redbrake wrote: > > Hi Nick, > i understand all of your technical improvements. > But: why do you Not use a simple for example Areca Raid Controller with 8 > gb Cache and Bbu ontop in every ceph node. > Configure n Times RAID 0 on the Controller and enable Write back Cache. > That must be a latency "Killer" like in all the prop. Storage arrays or > Not ?? > > Best Regards !! > > > > What we saw specifically with Areca cards is that performance is excellent > in benchmarking and for bursty loads. However, once we started loading with > more constant workloads (we replicate databases and files to our Ceph > cluster), this looks to have saturated the relatively small Areca NVDIMM > caches and we went back to pure drive based performance. > > > > Yes, I think that is a valid point. Although low latency, you are still > having to write to the disks twice (journal+data), so once the cache’s on > the cards start filling up, you are going to hit problems. > > > > > > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per > 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That > worked, but now the overall latency is really high at times, not always. > Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS > drives with too many IOPS, which get their latency sky high. Overall we are > functioning fine, but I sure would like storage vmotion and other large > operations faster. > > > > > > Yeah this is the biggest pain point I think. Normal VM ops are fine, but > if you ever have to move a multi-TB VM, it’s just too slow. > > > > If you use iscsi with vaai and are migrating a thick provisioned vmdk, > then performance is actually quite good, as the block sizes used for the > copy are a lot bigger. > > > > However, my use case required thin provisioned VM’s + snapshots and I > found that using iscsi you have no control over the fragmentation of the > vmdk’s and so the read performance is then what suffers (certainly with > 7.2k disks) > > > > Also with thin provisioned vmdk’s I think I was seeing PG contention with > the updating of the VMFS metadata, although I can’t be sure. > > > > > > I am thinking I will test a few different schedulers and readahead > settings to see if we can improve this by parallelizing reads. Also will > test NFS, but need to determine whether to do krbd/knfsd or something more > interesting like CephFS/Ganesha. > > > > As you know I’m on NFS now. I’ve found it a lot easier to get going and a > lot less sensitive to making config adjustments without suddenly everything > dropping offline. The fact that you can specify the extent size on XFS > helps massively with using thin vmdks/snapshots to avoid fragmentation. > Storage v-motions are a bit faster than iscsi, but I think I am hitting PG > contention when esxi tries to write 32 copy threads to the same object. > There is probably some tuning that could be done here (RBD striping???) but > this is the best it’s been for a long time and I’m reluctant to fiddle any > further. > We have moved ahead and added NFS support to Storcium, and now able ti run NFS servers with Pacemaker in HA mode (all agents are public at https://github.com/akurz/resource-agents/tree/master/heartbeat). I can confirm that VM performance is definitely better and benchmarks are more smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy on writes, but smooth on reads, likely due to the bursty nature of OSD filesystems when dealing with that small IO size). Were you using extsz=16384 at creation time for the filesystem? I saw kernel memory deadlock messages during vmotion, such as: XFS: nfsd(102545) possible memory allocation deadlock size 40320 in kmem_alloc (mode:0x2400240) And analyzing fragmentation: root@roc-5r-scd218:~# xfs_db -r /dev/rbd21 xfs_db> frag -d actual 0, ideal 0, fragmentation factor 0.00% xfs_db> frag -f actual 1863960, ideal 74, fragmentation factor 100.00% Just from two vmotions. Are you seeing anything similar? Thank you, Alex > > > But as mentioned above, thick vmdk’s with vaai might be a really good fit. > > > > Thanks for your very valuable info on analysis and hw build. > > > > Alex > > > > > > > Am 21.08.2016 um 09:31 schrieb Nick
Re: [ceph-users] Ceph + VMware + Single Thread Performance
On Saturday, September 3, 2016, Alex Gorbachev wrote: > HI Nick, > > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk > wrote: > >> *From:* Alex Gorbachev [mailto:a...@iss-integration.com >> ] >> *Sent:* 21 August 2016 15:27 >> *To:* Wilhelm Redbrake > > >> *Cc:* n...@fisk.me.uk ; >> Horace Ng > >; ceph-users < >> ceph-users@lists.ceph.com >> > >> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >> >> >> >> >> On Sunday, August 21, 2016, Wilhelm Redbrake > > wrote: >> >> Hi Nick, >> i understand all of your technical improvements. >> But: why do you Not use a simple for example Areca Raid Controller with 8 >> gb Cache and Bbu ontop in every ceph node. >> Configure n Times RAID 0 on the Controller and enable Write back Cache. >> That must be a latency "Killer" like in all the prop. Storage arrays or >> Not ?? >> >> Best Regards !! >> >> >> >> What we saw specifically with Areca cards is that performance is >> excellent in benchmarking and for bursty loads. However, once we started >> loading with more constant workloads (we replicate databases and files to >> our Ceph cluster), this looks to have saturated the relatively small Areca >> NVDIMM caches and we went back to pure drive based performance. >> >> >> >> Yes, I think that is a valid point. Although low latency, you are still >> having to write to the disks twice (journal+data), so once the cache’s on >> the cards start filling up, you are going to hit problems. >> >> >> >> >> >> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per >> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That >> worked, but now the overall latency is really high at times, not always. >> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS >> drives with too many IOPS, which get their latency sky high. Overall we are >> functioning fine, but I sure would like storage vmotion and other large >> operations faster. >> >> >> >> >> >> Yeah this is the biggest pain point I think. Normal VM ops are fine, but >> if you ever have to move a multi-TB VM, it’s just too slow. >> >> >> >> If you use iscsi with vaai and are migrating a thick provisioned vmdk, >> then performance is actually quite good, as the block sizes used for the >> copy are a lot bigger. >> >> >> >> However, my use case required thin provisioned VM’s + snapshots and I >> found that using iscsi you have no control over the fragmentation of the >> vmdk’s and so the read performance is then what suffers (certainly with >> 7.2k disks) >> >> >> >> Also with thin provisioned vmdk’s I think I was seeing PG contention with >> the updating of the VMFS metadata, although I can’t be sure. >> >> >> >> >> >> I am thinking I will test a few different schedulers and readahead >> settings to see if we can improve this by parallelizing reads. Also will >> test NFS, but need to determine whether to do krbd/knfsd or something more >> interesting like CephFS/Ganesha. >> >> >> >> As you know I’m on NFS now. I’ve found it a lot easier to get going and a >> lot less sensitive to making config adjustments without suddenly everything >> dropping offline. The fact that you can specify the extent size on XFS >> helps massively with using thin vmdks/snapshots to avoid fragmentation. >> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG >> contention when esxi tries to write 32 copy threads to the same object. >> There is probably some tuning that could be done here (RBD striping???) but >> this is the best it’s been for a long time and I’m reluctant to fiddle any >> further. >> > > We have moved ahead and added NFS support to Storcium, and now able ti run > NFS servers with Pacemaker in HA mode (all agents are public at > https://github.com/akurz/resource-agents/tree/master/heartbeat). I can > confirm that VM performance is definitely better and benchmarks are more > smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy > on writes, but smooth on reads, likely due to the bursty nature of OSD > filesystems when dealing with that small IO size). > > Were you using extsz=16384 at creation time for the filesystem? I saw > kernel memory deadlock messages during vmotion, such as: > > XFS: nfsd(102545) possible memory allocation deadlock size 40320 in >
[ceph-users] Ubuntu latest ceph-deploy fails to install hammer
This problem seems to occur with the latest ceph-deploy version 1.5.35 [lab2-mon3][DEBUG ] Fetched 5,382 kB in 4s (1,093 kB/s) [lab2-mon3][DEBUG ] Reading package lists... [lab2-mon3][INFO ] Running command: env DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get --assume-yes -q --no-install-recommends install -o Dpkg::Options::=--force-confnew ceph-osd ceph-mds ceph-mon radosgw [lab2-mon3][DEBUG ] Reading package lists... [lab2-mon3][DEBUG ] Building dependency tree... [lab2-mon3][DEBUG ] Reading state information... [lab2-mon3][WARNIN] E: Unable to locate package ceph-osd [lab2-mon3][WARNIN] E: Unable to locate package ceph-mon [lab2-mon3][ERROR ] RuntimeError: command returned non-zero exit status: 100 [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get --assume-yes -q --no-install-recommends install -o Dpkg::Options::=--force-confnew ceph-osd ceph-mds ceph-mon radosgw -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ubuntu latest ceph-deploy fails to install hammer
Confirmed - older version of ceph-deploy is working fine. Odd as there is a large number of Hammer users out there. Thank you for the explanation and fix. -- Alex Gorbachev Storcium On Fri, Sep 9, 2016 at 12:15 PM, Vasu Kulkarni wrote: > There is a known issue with latest ceph-deploy with *hammer*, the > package split in later releases after *hammer* is the root cause, > If you use ceph-deploy 1.5.25 (older version) it will work. you can > get 1.5.25 from pypi > > http://tracker.ceph.com/issues/17128 > > On Fri, Sep 9, 2016 at 8:28 AM, Shain Miley wrote: >> Alex, >> I ran into this issue yesterday as well. >> >> I ended up just installing ceph via apt-get locally on the new server. >> >> I have not been able to get an actual osd added to the cluster at this point >> though (see my emails over the last 2 days or so). >> >> Please let me know if you end up able to add an osd properly with 1.5.35. >> >> Thanks, >> >> Shain >> >>> On Sep 9, 2016, at 11:12 AM, Alex Gorbachev >>> wrote: >>> >>> This problem seems to occur with the latest ceph-deploy version 1.5.35 >>> >>> [lab2-mon3][DEBUG ] Fetched 5,382 kB in 4s (1,093 kB/s) >>> [lab2-mon3][DEBUG ] Reading package lists... >>> [lab2-mon3][INFO ] Running command: env >>> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get >>> --assume-yes -q --no-install-recommends install -o >>> Dpkg::Options::=--force-confnew ceph-osd ceph-mds ceph-mon radosgw >>> [lab2-mon3][DEBUG ] Reading package lists... >>> [lab2-mon3][DEBUG ] Building dependency tree... >>> [lab2-mon3][DEBUG ] Reading state information... >>> [lab2-mon3][WARNIN] E: Unable to locate package ceph-osd >>> [lab2-mon3][WARNIN] E: Unable to locate package ceph-mon >>> [lab2-mon3][ERROR ] RuntimeError: command returned non-zero exit status: 100 >>> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env >>> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get >>> --assume-yes -q --no-install-recommends install -o >>> Dpkg::Options::=--force-confnew ceph-osd ceph-mds ceph-mon radosgw >>> >>> -- >>> Alex Gorbachev >>> Storcium >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph + VMware + Single Thread Performance
Confirming again much better performance with ESXi and NFS on RBD using the XFS hint Nick uses, below. I saw high load averages on the NFS server nodes, corresponding to iowait, does not seem to cause too much trouble so far. Here are HDtune Pro testing results from some recent runs. The puzzling part is better random IO performance with 16 mb object size on both iSCSI and NFS. I my thinking this should be slower, however, this has been confirmed by the timed vmotion tests and more random IO tests by my coworker as well: Test_type read MB/s write MB/s read iops write iops read multi iops write multi iops NFS 1mb 460 103 8753 66 47466 1616 NFS 4mb 441 147 8863 82 47556 764 iSCSI 1mb 117 76 326 90 672 938 iSCSI 4mb 275 60 205 24 2015 1212 NFS 16mb 455 177 7761 119 36403 3175 iSCSI 16mb 300 65 1117 237 12389 1826 ( prettier view at http://storcium.blogspot.com/2016/09/latest-tests-on-nfs-vs.html ) Alex > > From: Alex Gorbachev [mailto:a...@iss-integration.com] > Sent: 04 September 2016 04:45 > To: Nick Fisk > Cc: Wilhelm Redbrake ; Horace Ng ; > ceph-users > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Saturday, September 3, 2016, Alex Gorbachev > wrote: > > HI Nick, > > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk wrote: > > From: Alex Gorbachev [mailto:a...@iss-integration.com] > Sent: 21 August 2016 15:27 > To: Wilhelm Redbrake > Cc: n...@fisk.me.uk; Horace Ng ; ceph-users > > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Sunday, August 21, 2016, Wilhelm Redbrake wrote: > > Hi Nick, > i understand all of your technical improvements. > But: why do you Not use a simple for example Areca Raid Controller with 8 gb > Cache and Bbu ontop in every ceph node. > Configure n Times RAID 0 on the Controller and enable Write back Cache. > That must be a latency "Killer" like in all the prop. Storage arrays or Not ?? > > Best Regards !! > > > > What we saw specifically with Areca cards is that performance is excellent in > benchmarking and for bursty loads. However, once we started loading with more > constant workloads (we replicate databases and files to our Ceph cluster), > this looks to have saturated the relatively small Areca NVDIMM caches and we > went back to pure drive based performance. > > > > Yes, I think that is a valid point. Although low latency, you are still > having to write to the disks twice (journal+data), so once the cache’s on the > cards start filling up, you are going to hit problems. > > > > > > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 > HDDs) in hopes that it would help reduce the noisy neighbor impact. That > worked, but now the overall latency is really high at times, not always. Red > Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with > too many IOPS, which get their latency sky high. Overall we are functioning > fine, but I sure would like storage vmotion and other large operations faster. > > > > > > Yeah this is the biggest pain point I think. Normal VM ops are fine, but if > you ever have to move a multi-TB VM, it’s just too slow. > > > > If you use iscsi with vaai and are migrating a thick provisioned vmdk, then > performance is actually quite good, as the block sizes used for the copy are > a lot bigger. > > > > However, my use case required thin provisioned VM’s + snapshots and I found > that using iscsi you have no control over the fragmentation of the vmdk’s and > so the read performance is then what suffers (certainly with 7.2k disks) > > > > Also with thin provisioned vmdk’s I think I was seeing PG contention with the > updating of the VMFS metadata, although I can’t be sure. > > > > > > I am thinking I will test a few different schedulers and readahead settings > to see if we can improve this by parallelizing reads. Also will test NFS, but > need to determine whether to do krbd/knfsd or something more interesting like > CephFS/Ganesha. > > > > As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot > less sensitive to making config adjustments without suddenly everything > dropping offline. The fact that you can specify the extent size on XFS helps > massively with using thin vmdks/snapshots to avoid fragmentation. Storage > v-motions are a bit faster than iscsi, but I think I am hitting PG contention > when esxi tries to write 32 copy threads to the same object. There is > probably some tuning that could be done here (RBD striping???) but this is > the best it’s been for a long time and I’m reluctant to fiddle any further. > > > > W
Re: [ceph-users] Ceph + VMware + Single Thread Performance
On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk wrote: > > > > > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 04 September 2016 04:45 > *To:* Nick Fisk > *Cc:* Wilhelm Redbrake ; Horace Ng ; > ceph-users > *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Saturday, September 3, 2016, Alex Gorbachev > wrote: > > HI Nick, > > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk wrote: > > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 21 August 2016 15:27 > *To:* Wilhelm Redbrake > *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users < > ceph-users@lists.ceph.com> > *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Sunday, August 21, 2016, Wilhelm Redbrake wrote: > > Hi Nick, > i understand all of your technical improvements. > But: why do you Not use a simple for example Areca Raid Controller with 8 > gb Cache and Bbu ontop in every ceph node. > Configure n Times RAID 0 on the Controller and enable Write back Cache. > That must be a latency "Killer" like in all the prop. Storage arrays or > Not ?? > > Best Regards !! > > > > What we saw specifically with Areca cards is that performance is excellent > in benchmarking and for bursty loads. However, once we started loading with > more constant workloads (we replicate databases and files to our Ceph > cluster), this looks to have saturated the relatively small Areca NVDIMM > caches and we went back to pure drive based performance. > > > > Yes, I think that is a valid point. Although low latency, you are still > having to write to the disks twice (journal+data), so once the cache’s on > the cards start filling up, you are going to hit problems. > > > > > > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per > 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That > worked, but now the overall latency is really high at times, not always. > Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS > drives with too many IOPS, which get their latency sky high. Overall we are > functioning fine, but I sure would like storage vmotion and other large > operations faster. > > > > > > Yeah this is the biggest pain point I think. Normal VM ops are fine, but > if you ever have to move a multi-TB VM, it’s just too slow. > > > > If you use iscsi with vaai and are migrating a thick provisioned vmdk, > then performance is actually quite good, as the block sizes used for the > copy are a lot bigger. > > > > However, my use case required thin provisioned VM’s + snapshots and I > found that using iscsi you have no control over the fragmentation of the > vmdk’s and so the read performance is then what suffers (certainly with > 7.2k disks) > > > > Also with thin provisioned vmdk’s I think I was seeing PG contention with > the updating of the VMFS metadata, although I can’t be sure. > > > > > > I am thinking I will test a few different schedulers and readahead > settings to see if we can improve this by parallelizing reads. Also will > test NFS, but need to determine whether to do krbd/knfsd or something more > interesting like CephFS/Ganesha. > > > > As you know I’m on NFS now. I’ve found it a lot easier to get going and a > lot less sensitive to making config adjustments without suddenly everything > dropping offline. The fact that you can specify the extent size on XFS > helps massively with using thin vmdks/snapshots to avoid fragmentation. > Storage v-motions are a bit faster than iscsi, but I think I am hitting PG > contention when esxi tries to write 32 copy threads to the same object. > There is probably some tuning that could be done here (RBD striping???) but > this is the best it’s been for a long time and I’m reluctant to fiddle any > further. > > > > We have moved ahead and added NFS support to Storcium, and now able ti run > NFS servers with Pacemaker in HA mode (all agents are public at > https://github.com/akurz/resource-agents/tree/master/heartbeat > <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAFhNkjYAADNJBWwAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/1/SaDNCfweUSbAAalNO6TCqg/aHR0cHM6Ly9naXRodWIuY29tL2FrdXJ6L3Jlc291cmNlLWFnZW50cy90cmVlL21hc3Rlci9oZWFydGJlYXQ>). > I can confirm that VM performance is definitely better and benchmarks are > more smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is > choppy on writes, but smooth on reads, likely due to the bursty nature of > OSD filesystems when dealing with that small IO size). > > > > Were you using extsz=16384 at creation time for the files
Re: [ceph-users] Ceph + VMware + Single Thread Performance
-- Alex Gorbachev Storcium On Sun, Sep 11, 2016 at 12:54 PM, Nick Fisk wrote: > > > > > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 11 September 2016 16:14 > > *To:* Nick Fisk > *Cc:* Wilhelm Redbrake ; Horace Ng ; > ceph-users > *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk wrote: > > > > > > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 04 September 2016 04:45 > *To:* Nick Fisk > *Cc:* Wilhelm Redbrake ; Horace Ng ; > ceph-users > *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > > On Saturday, September 3, 2016, Alex Gorbachev > wrote: > > HI Nick, > > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk wrote: > > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 21 August 2016 15:27 > *To:* Wilhelm Redbrake > *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users < > ceph-users@lists.ceph.com> > *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Sunday, August 21, 2016, Wilhelm Redbrake wrote: > > Hi Nick, > i understand all of your technical improvements. > But: why do you Not use a simple for example Areca Raid Controller with 8 > gb Cache and Bbu ontop in every ceph node. > Configure n Times RAID 0 on the Controller and enable Write back Cache. > That must be a latency "Killer" like in all the prop. Storage arrays or > Not ?? > > Best Regards !! > > > > What we saw specifically with Areca cards is that performance is excellent > in benchmarking and for bursty loads. However, once we started loading with > more constant workloads (we replicate databases and files to our Ceph > cluster), this looks to have saturated the relatively small Areca NVDIMM > caches and we went back to pure drive based performance. > > > > Yes, I think that is a valid point. Although low latency, you are still > having to write to the disks twice (journal+data), so once the cache’s on > the cards start filling up, you are going to hit problems. > > > > > > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per > 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That > worked, but now the overall latency is really high at times, not always. > Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS > drives with too many IOPS, which get their latency sky high. Overall we are > functioning fine, but I sure would like storage vmotion and other large > operations faster. > > > > > > Yeah this is the biggest pain point I think. Normal VM ops are fine, but > if you ever have to move a multi-TB VM, it’s just too slow. > > > > If you use iscsi with vaai and are migrating a thick provisioned vmdk, > then performance is actually quite good, as the block sizes used for the > copy are a lot bigger. > > > > However, my use case required thin provisioned VM’s + snapshots and I > found that using iscsi you have no control over the fragmentation of the > vmdk’s and so the read performance is then what suffers (certainly with > 7.2k disks) > > > > Also with thin provisioned vmdk’s I think I was seeing PG contention with > the updating of the VMFS metadata, although I can’t be sure. > > > > > > I am thinking I will test a few different schedulers and readahead > settings to see if we can improve this by parallelizing reads. Also will > test NFS, but need to determine whether to do krbd/knfsd or something more > interesting like CephFS/Ganesha. > > > > As you know I’m on NFS now. I’ve found it a lot easier to get going and a > lot less sensitive to making config adjustments without suddenly everything > dropping offline. The fact that you can specify the extent size on XFS > helps massively with using thin vmdks/snapshots to avoid fragmentation. > Storage v-motions are a bit faster than iscsi, but I think I am hitting PG > contention when esxi tries to write 32 copy threads to the same object. > There is probably some tuning that could be done here (RBD striping???) but > this is the best it’s been for a long time and I’m reluctant to fiddle any > further. > > > > We have moved ahead and added NFS support to Storcium, and now able ti run > NFS servers with Pacemaker in HA mode (all agents are public at > https://github.com/akurz/resource-agents/tree/master/heartbeat > <http://xo4t.mj.am/lnk/AEEAFVynFzgAAFhNkjYAADNJBWwAAACRXwBX1Yw-HyPgnby2QY24q0KYBbMaNgAAlBI/1/jhUfi_RqmIdhFLdYFMjkzg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFGT1RpTVA0QUFBQUFBQUFBQUZoTmtqWU
Re: [ceph-users] Ceph + VMWare
On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry wrote: > Hey guys, > > Starting to buckle down a bit in looking at how we can better set up > Ceph for VMWare integration, but I need a little info/help from you > folks. > > If you currently are using Ceph+VMWare, or are exploring the option, > I'd like some simple info from you: > > 1) Company > 2) Current deployment size > 3) Expected deployment growth > 4) Integration method (or desired method) ex: iscsi, native, etc > > Just casting the net so we know who is interested and might want to > help us shape and/or test things in the future if we can make it > better. Thanks. > Hi Patrick, We have Storcium certified with VMWare, and we use it ourselves: Ceph Hammer latest SCST redundant Pacemaker based delivery front ends - our agents are published on github EnhanceIO for read caching at delivery layer NFS v3, and iSCSI and FC delivery Our deployment size we use ourselves is 700 TB raw. Challenges are as others described, but HA and multi host access works fine courtesy of SCST. Write amplification is a challenge on spinning disks. Happy to share more. Alex > > -- > > Best Regards, > > Patrick McGarry > Director Ceph Community || Red Hat > http://ceph.com || http://community.redhat.com > @scuttlemonkey || @ceph > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph + VMWare
On Tuesday, October 18, 2016, Frédéric Nass wrote: > Hi Alex, > > Just to know, what kind of backstore are you using whithin Storcium ? > vdisk_fileio > or vdisk_blockio ? > > I see your agents can handle both : http://www.spinics.net/lists/ > ceph-users/msg27817.html > Hi Frédéric, We use all of them, and NFS as well, which has been performing quite well. Vdisk_fileio is a bit dangerous in write cache mode. Also, for some reason, object size of 16MB for RBD does better with VMWare. Storcium gives you a choice for each LUN. The challenge has been figuring out optimal workloads under highly varied use cases. I see better results with NVMe journals and write combining HBAs, e.g. Areca. Regards, Alex > Regards, > > Frédéric. > > Le 06/10/2016 à 16:01, Alex Gorbachev a écrit : > > On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry > wrote: > > Hey guys, > > Starting to buckle down a bit in looking at how we can better set up > Ceph for VMWare integration, but I need a little info/help from you > folks. > > If you currently are using Ceph+VMWare, or are exploring the option, > I'd like some simple info from you: > > 1) Company > 2) Current deployment size > 3) Expected deployment growth > 4) Integration method (or desired method) ex: iscsi, native, etc > > Just casting the net so we know who is interested and might want to > help us shape and/or test things in the future if we can make it > better. Thanks. > > > Hi Patrick, > > We have Storcium certified with VMWare, and we use it ourselves: > > Ceph Hammer latest > > SCST redundant Pacemaker based delivery front ends - our agents are > published on github > > EnhanceIO for read caching at delivery layer > > NFS v3, and iSCSI and FC delivery > > Our deployment size we use ourselves is 700 TB raw. > > Challenges are as others described, but HA and multi host access works > fine courtesy of SCST. Write amplification is a challenge on spinning > disks. > > Happy to share more. > > Alex > > > -- > > Best Regards, > > Patrick McGarry > Director Ceph Community || Red Hathttp://ceph.com || > http://community.redhat.com > @scuttlemonkey || @ceph > ___ > ceph-users mailing listceph-us...@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitoring Overhead
Hi Ashley, On Monday, October 24, 2016, Ashley Merrick wrote: > Hello, > > Thanks both for your responses, defiantly looking at collectd + graphite, > just wanted to see what overheads where like, far from in a situation that > would choke the cluster but wanted to check first. > > I run ceph -s with json output, parse that (with e.g. Perl, or you can use Python etc) and store in mysql database. This provides a few snapshots and simple at a glance analysis. Overhead is practically none. For OSDs things are trickier, but for simplicity's sake we run iostat for a few cycles and parse that output, then aggregate. Collectd and graphite look really nice. Regards, Alex > Thanks, > Ashley > > -Original Message- > From: Christian Balzer [mailto:ch...@gol.com ] > Sent: 24 October 2016 11:04 > To: ceph-users@lists.ceph.com > Cc: John Spray >; Ashley Merrick < > ash...@amerrick.co.uk > > Subject: Re: [ceph-users] Monitoring Overhead > > > Hello, > > On Mon, 24 Oct 2016 10:46:31 +0100 John Spray wrote: > > > On Mon, Oct 24, 2016 at 4:21 AM, Ashley Merrick > wrote: > > > Hello, > > > > > > > > > > > > This may come across as a simple question but just wanted to check. > > > > > > > > > > > > I am looking at importing live data from my cluster via ceph -s > > > e.t.c into a graphical graph interface so I can monitor performance > > > / iops / e.t.c overtime. > > > > > > > > > > > > I am looking to pull this data from one or more monitor nodes, when > > > the data is retrieved for the ceph -s output is this information > > > that the monitor already has locally or is there an overhead that is > > > applied to the whole cluster to retrieve this data every time the > command is executed? > > > > It's all from the local state on the mons, the OSDs aren't involved at > > all in responding to the status command. > > > That said, as mentioned before on this ML, the output of "ceph -s" is a > sample from a window and only approaching something reality if sampled and > divided of a long period. > > If you need something that involves "what happened on OSD x at time y", > collectd and graphite (or deviations of if) are your friends, but they do > cost you a CPU cycle or two. > OTOH, if your OSDs or MONs were to choke from that kind of monitoring, > you're walking on very thin ice already. > > Christian > > > Cheers, > > John > > > > > > > > > > > > > > Reason I ask is I want to make sure I am not applying unnecessary > > > overhead and load onto all OSD node’s to retrieve this data at a > > > near live view, I fully understand it will apply a small amount of > > > load / CPU on the local MON to process the command, I am more > interesting in overall cluster. > > > > > > > > > > > > Thanks, > > > > > > Ashley > > > > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten > Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10Gbit switch advice for small ceph cluster upgrade
On Thursday, October 27, 2016, Jelle de Jong wrote: > Hello everybody, > > I want to upgrade my small ceph cluster to 10Gbit networking and would > like some recommendation from your experience. > > What is your recommend budget 10Gbit switch suitable for Ceph? We use Mellanox SX1036 and SX1012, which can function in 10 and 56GbE modes. It uses QSFP, Twinax or MPO, which terminates with LC fiber connections. While not dirt cheap, or entry level, we like these as being considerably cheaper than even a decent SDN solution. We have been able to build MLAG and leaf and spine solutions pretty easily with these. > > I would like to use X550-T1 intel adapters in my nodes. > > Or is fibre recommended? > X520-DA2 > X520-SR1 > > Kind regards, > > Jelle de Jong > GNU/Linux Consultant > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com