Re: [ceph-users] Good way to monitor detailed latency/throughput
On Fri, 05 Sep 2014 16:23:13 +0200 Josef Johansson wrote: > Hi, > > How do you guys monitor the cluster to find disks that behave bad, or > VMs that impact the Ceph cluster? > > I'm looking for something where I could get a good bird-view of > latency/throughput, that uses something easy like SNMP. > You mean there is another form of monitoring than waiting for the users/customers to yell at you you because performance sucks? ^o^ The first part is relatively easy, run something like "iostat -y -x 300" and feed the output into snmp via the extend functionality. Maybe somebody has done that already, but it would be trivial anyway. The hard part here is what to do with that data, just graphing it is great for post-mortem analysis or if you have 24h staff staring blindly at monitors. Deciding what numbers warrant a warning or even a notification (in Nagios terms) is going to be much harder. Take this iostat -x output (all activities since boot) for example: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.300.022.61 0.43 405.49 308.65 0.03 10.440.50 10.52 0.83 0.22 sdb 0.00 0.300.012.55 0.27 379.16 296.20 0.03 11.310.73 11.35 0.80 0.21 sdc 0.00 0.290.022.44 0.38 376.57 307.23 0.03 11.820.56 11.89 0.84 0.21 sdd 0.00 0.290.012.42 0.24 369.05 304.43 0.03 11.510.63 11.55 0.84 0.20 sde 0.02 266.520.652.9372.56 365.03 244.67 0.29 79.751.65 97.16 1.60 0.57 sdg 0.01 0.970.720.6576.33 187.84 384.75 0.09 69.061.85 143.21 2.87 0.39 sdf 0.01 0.870.680.5967.04 167.94 369.82 0.09 67.582.79 143.18 3.44 0.44 sdh 0.00 0.940.940.6474.87 182.81 327.19 0.09 57.341.91 139.22 2.79 0.44 sdj 0.01 0.960.930.6575.76 187.75 331.78 0.10 62.761.81 149.88 2.72 0.43 sdk 0.01 1.021.000.6777.78 188.83 320.46 0.08 47.021.66 115.02 2.53 0.42 sdi 0.01 0.930.960.6174.38 173.72 317.35 0.22 140.562.16 358.85 3.49 0.54 sdl 0.01 0.920.710.6272.57 175.19 373.05 0.09 65.362.01 138.19 3.03 0.40 sda to sdd are SSDs. So for starters, you can't compare them with spinning rust. So if you were to look for outliers, all of sde to sdl (actual disks) are suspiciously slow. ^o^ And if you look at sde it seems to be faster than the rest, but that is because the original drive was replaced and thus the new one has seen less action than the rest. The actual wonky drive is sdi, looking at await/w_await and svctm. This drive sometimes goes into a state (for 10-20 hours at a time) where it can only perform at half speed. These are the same drives when running a rados bench against the cluster, sdi is currently not wonky and performing at full speed: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sde 0.00 173.000.00 236.60 0.00 91407.20 772.67 76.40 338.320.00 338.32 4.21 99.60 sdg 0.00 153.000.40 234.60 1.60 88052.40 749.40 83.61 359.95 23.00 360.52 4.24 99.68 sdf 0.00 147.300.50 206.00 2.00 68918.40 667.51 50.15 264.40 65.60 264.88 4.45 91.88 sdh 0.00 158.100.80 170.90 3.20 66077.20 769.72 31.31 153.45 12.50 154.11 5.40 92.76 sdj 0.00 158.000.60 207.00 2.40 77455.20 746.22 61.61 296.78 55.33 297.48 4.79 99.52 sdk 0.00 160.900.90 242.30 3.60 92251.20 758.67 57.11 234.84 40.44 235.57 4.06 98.68 sdi 0.00 166.701.00 190.90 4.00 69919.20 728.75 60.15 282.98 24.00 284.34 5.16 99.00 sdl 0.00 131.900.80 207.10 3.20 85014.00 817.87 92.10 412.02 53.00 413.41 4.79 99.52 Now things are more uniform (of course ceph never is really uniform and sdh was more busy and thus slower in the next sample). If sdi were in its half speed mode, it would at 100% (all the time while the other drives were not and often even idle) and with a svctm of about 15 and w_await well over 800. You could simply say that with this baseline, anything that goes over 500 w_await is worthy an alert, but it might only get there if your cluster is sufficiently busy. To really find a "slow" disk, you need to compare identical disks having the same workload. Personally I'm still not sure what formula to use, even though it is so blatantly obvious and v
Re: [ceph-users] SSD journal deployment experiences
On Fri, 5 Sep 2014 09:42:02 + Dan Van Der Ster wrote: > > > On 05 Sep 2014, at 11:04, Christian Balzer wrote: > > > > On Fri, 5 Sep 2014 07:46:12 + Dan Van Der Ster wrote: > >> > >>> On 05 Sep 2014, at 03:09, Christian Balzer wrote: > >>> > >>> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote: > >>> > On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster > wrote: > [snip] > > 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful > > is the backfilling which results from an SSD failure? Have you > > considered tricks like increasing the down out interval so > > backfilling doesn’t happen in this case (leaving time for the SSD > > to be replaced)? > > > > Replacing a failed SSD won't help your backfill. I haven't actually > tested it, but I'm pretty sure that losing the journal effectively > corrupts your OSDs. I don't know what steps are required to > complete this operation, but it wouldn't surprise me if you need to > re-format the OSD. > > >>> This. > >>> All the threads I've read about this indicate that journal loss > >>> during operation means OSD loss. Total OSD loss, no recovery. > >>> From what I gathered the developers are aware of this and it might be > >>> addressed in the future. > >>> > >> > >> I suppose I need to try it then. I don’t understand why you can't just > >> use ceph-osd -i 10 --mkjournal to rebuild osd 10’s journal, for > >> example. > >> > > I think the logic is if you shut down an OSD cleanly beforehand you can > > just do that. > > However from what I gathered there is no logic to re-issue transactions > > that made it to the journal but not the filestore. > > So a journal SSD failing mid-operation with a busy OSD would certainly > > be in that state. > > > > I had thought that the journal write and the buffered filestore write > happen at the same time. Nope, definitely not. That's why we have tunables like the ones at: http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals And people (me included) tend to crank that up (to eleven ^o^). The write-out to the filestore may start roughly at the same time as the journal gets things, but it can and will fall behind. > So all the previous journal writes that > succeeded are already on their way to the filestore. My (could be > incorrect) understanding is that the real purpose of the journal is to > be able to replay writes after a power outage (since the buffered > filestore writes would be lost in that case). If there is no power > outage, then filestore writes are still good regardless of a journal > failure. > From Cephs perspective a write is successful once it is on all replica size journals. I think (hope) that what you wrote up there to be true, but that doesn't change the fact that journal data not even on the way to the filestore yet is the crux here. > > > I'm sure (hope) somebody from the Ceph team will pipe up about this. > > Ditto! > Guess it will be next week... > > >>> Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5 > >>> ratio is sensible. However these will be the ones limiting your max > >>> sequential write speed if that is of importance to you. In nearly all > >>> use cases you run out of IOPS (on your HDDs) long before that becomes > >>> an issue, though. > >> > >> IOPS is definitely the main limit, but we also only have 1 single > >> 10Gig-E NIC on these servers, so 4 drives that can write (even only > >> 200MB/s) would be good enough. > >> > > Fair enough. ^o^ > > > >> Also, we’ll put the SSDs in the first four ports of an SAS2008 HBA > >> which is shared with the other 20 spinning disks. Counting the double > >> writes, the HBA will run out of bandwidth before these SSDs, I expect. > >> > > Depends on what PCIe slot it is and so forth. A 2008 should give you > > 4GB/s, enough to keep the SSDs happy at least. ^o^ > > > > A 2008 has only 8 SAS/SATA ports, so are you using port expanders on > > your case backplane? > > In that case you might want to spread the SSDs out over channels, as in > > have 3 HDDs sharing one channel with one SSD. > > We use a Promise VTrak J830sS, and now I’ll got ask our hardware team if > there would be any benefit to store the SSDs row or column wise. > Ah, a storage pod. So you have that and a real OSD head server, something like a 1U machine or Supermicro Twin? Looking at the specs of it I would assume 3 drive per expander, so having one SSD mixed with 2 HDDs should definitely be beneficial. > With the current config, when I dd to all drives in parallel I can write > at 24*74MB/s = 1776MB/s. > That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 lanes, so as far as that bus goes, it can do 4GB/s. And given your storage pod I assume it is connected with 2 mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA bandwidth. How fast can your "eco 5900rpm" drive
Re: [ceph-users] Huge issues with slow requests
Also putting this on the list. On 06 Sep 2014, at 13:36, Josef Johansson wrote: > Hi, > > Same issues again, but I think we found the drive that causes the problems. > > But this is causing problems as it’s trying to do a recover to that osd at > the moment. > > So we’re left with the status message > > 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 > active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB used, > 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; 41424/15131923 > degraded (0.274%); recovering 0 o/s, 2035KB/s > > > It’s improving, but way too slowly. If I restart the recovery (ceph osd set > no recovery /unset) it doesn’t change the osd what I can see. > > Any ideas? > > Cheers, > Josef > > On 05 Sep 2014, at 11:26, Luis Periquito wrote: > >> Only time I saw such behaviour was when I was deleting a big chunk of data >> from the cluster: all the client activity was reduced, the op/s were almost >> non-existent and there was unjustified delays all over the cluster. But all >> the disks were somewhat busy in atop/iotstat. >> >> >> On 5 September 2014 09:51, David wrote: >> Hi, >> >> Indeed strange. >> >> That output was when we had issues, seems that most operations were blocked >> / slow requests. >> >> A ”baseline” output is more like today: >> >> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9273KB/s >> rd, 24650KB/s wr, 2755op/s >> 2014-09-05 10:44:30.125637 mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s >> rd, 20430KB/s wr, 2294op/s >> 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9216KB/s >> rd, 20062KB/s wr, 2488op/s >> 2014-09-05 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12511KB/s >> rd, 15739KB/s wr, 2488op/s >> 2014-09-05 10:44:33.161210 mon.0 [INF] pgmap v12582763: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 18593KB/s >> rd, 14880KB/s wr, 2609op/s >> 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17720KB/s >> rd, 22964KB/s wr, 3257op/s >> 2014-09-05 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 19230KB/s >> rd, 18901KB/s wr, 3199op/s >> 2014-09-05 10:44:36.213535 mon.0 [INF] pgmap v12582766: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17630KB/s >> rd, 18855KB/s wr, 3131op/s >> 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12262KB/s >> rd, 18627KB/s wr, 2595op/s >> 2014-09-05 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17697KB/s >> rd, 17572KB/s wr, 2156op/s >> 2014-09-05 10:44:39.239409 mon.0 [INF] pgmap v12582769: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 20300KB/s >> rd, 19735KB/s wr, 2197op/s >> 2014-09-05 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 14656KB/s >> rd, 15460KB/s wr, 2199op/s >> 2014-09-05 10:44:41.269736 mon.0 [INF] pgmap v12582771: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 8969KB/s >> rd, 11918KB/s wr, 1951op/s >> 2014-09-05 10:44:42.276192 mon.0 [INF] pgmap v12582772: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 7272KB/s >> rd, 10644KB/s wr, 1832op/s >> 2014-09-05 10:44:43.291817 mon.0 [INF] pgmap v12582773: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9316KB/s >> rd, 16610KB/s wr, 2412op/s >> 2014-09-05 10:44:44.295469 mon.0 [INF] pgmap v12582774: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9257KB/s >> rd, 19953KB/s wr, 2633op/s >> 2014-09-05 10:44:45.315774 mon.0 [INF] pgmap v12582775: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9718KB/s >> rd, 14298KB/s wr, 2101op/s >> 2014-09-05 10:44:46.326783 mon.0 [INF] pgmap v12582776: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 20877KB/s >> rd, 12822KB/s wr, 2447op/s >> 2014-09-05 10:44:47.327537 mon.0 [INF] pgmap v12582777: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 18447KB/s >> rd, 12945KB/s wr, 2226op/s >> 2014-09-05 10:44:48.348725 mon.0 [INF] pgmap v12582778: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB u
Re: [ceph-users] Huge issues with slow requests
Hello, On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: > Also putting this on the list. > > On 06 Sep 2014, at 13:36, Josef Johansson wrote: > > > Hi, > > > > Same issues again, but I think we found the drive that causes the > > problems. > > > > But this is causing problems as it’s trying to do a recover to that > > osd at the moment. > > > > So we’re left with the status message > > > > 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 > > active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB > > used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; > > 41424/15131923 degraded (0.274%); recovering 0 o/s, 2035KB/s > > > > > > It’s improving, but way too slowly. If I restart the recovery (ceph > > osd set no recovery /unset) it doesn’t change the osd what I can see. > > > > Any ideas? > > I don't know the state of your cluster, i.e. what caused the recovery to start (how many OSDs went down?). If you have a replication of 3 and only one OSD was involved, what is stopping you from taking that wonky drive/OSD out? If you don't know that or want to play it safe, how about setting the weight of that OSD to 0? While that will AFAICT still result in all primary PGs to be evacuated off it, no more writes will happen to it and reads might be faster. In either case, it shouldn't slow down the rest of your cluster anymore. Regards, Christian > > Cheers, > > Josef > > > > On 05 Sep 2014, at 11:26, Luis Periquito > > wrote: > > > >> Only time I saw such behaviour was when I was deleting a big chunk of > >> data from the cluster: all the client activity was reduced, the op/s > >> were almost non-existent and there was unjustified delays all over > >> the cluster. But all the disks were somewhat busy in atop/iotstat. > >> > >> > >> On 5 September 2014 09:51, David wrote: > >> Hi, > >> > >> Indeed strange. > >> > >> That output was when we had issues, seems that most operations were > >> blocked / slow requests. > >> > >> A ”baseline” output is more like today: > >> > >> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs: > >> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB > >> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637 > >> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB > >> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s > >> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: > >> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / > >> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05 > >> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860 > >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; > >> 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0 > >> [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data, > >> 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr, > >> 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860 > >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB > >> avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05 > >> 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860 > >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; > >> 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05 10:44:36.213535 mon.0 > >> [INF] pgmap v12582766: 6860 pgs: 6860 active+clean; 12253 GB data, > >> 36574 GB used, 142 TB / 178 TB avail; 17630KB/s rd, 18855KB/s wr, > >> 3131op/s 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860 > >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB > >> avail; 12262KB/s rd, 18627KB/s wr, 2595op/s 2014-09-05 > >> 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 6860 > >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; > >> 17697KB/s rd, 17572KB/s wr, 2156op/s 2014-09-05 10:44:39.239409 mon.0 > >> [INF] pgmap v12582769: 6860 pgs: 6860 active+clean; 12253 GB data, > >> 36574 GB used, 142 TB / 178 TB avail; 20300KB/s rd, 19735KB/s wr, > >> 2197op/s 2014-09-05 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860 > >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB > >> avail; 14656KB/s rd, 15460KB/s wr, 2199op/s 2014-09-05 > >> 10:44:41.269736 mon.0 [INF] pgmap v12582771: 6860 pgs: 6860 > >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; > >> 8969KB/s rd, 11918KB/s wr, 1951op/s 2014-09-05 10:44:42.276192 mon.0 > >> [INF] pgmap v12582772: 6860 pgs: 6860 active+clean; 12253 GB data, > >> 36574 GB used, 142 TB / 178 TB avail; 7272KB/s rd, 10644KB/s wr, > >> 1832op/s 2014-09-05 10:44:43.291817 mon.0 [INF] pgmap v12582773: 6860 > >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB > >> avail; 9316KB/s rd, 16610KB/s wr, 2412op/s 2014-09-05 10:44:44.295469 > >> mon.0 [INF] pgmap v12582774: 6860 pgs: 6860 active+clean; 12253 GB > >> data,
Re: [ceph-users] Huge issues with slow requests
Hi, On 06 Sep 2014, at 13:53, Christian Balzer wrote: > > Hello, > > On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: > >> Also putting this on the list. >> >> On 06 Sep 2014, at 13:36, Josef Johansson wrote: >> >>> Hi, >>> >>> Same issues again, but I think we found the drive that causes the >>> problems. >>> >>> But this is causing problems as it’s trying to do a recover to that >>> osd at the moment. >>> >>> So we’re left with the status message >>> >>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 >>> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB >>> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; >>> 41424/15131923 degraded (0.274%); recovering 0 o/s, 2035KB/s >>> >>> >>> It’s improving, but way too slowly. If I restart the recovery (ceph >>> osd set no recovery /unset) it doesn’t change the osd what I can see. >>> >>> Any ideas? >>> > I don't know the state of your cluster, i.e. what caused the recovery to > start (how many OSDs went down?). Performance degradation, databases are the worst impacted. It’s actually a OSD that we put in that’s causing it (removed it again though). So the cluster in itself is healthy. > If you have a replication of 3 and only one OSD was involved, what is > stopping you from taking that wonky drive/OSD out? > There’s data that goes missing if I do that, I guess I have to wait for the recovery process to complete before I can go any further, this is with rep 3. > If you don't know that or want to play it safe, how about setting the > weight of that OSD to 0? > While that will AFAICT still result in all primary PGs to be evacuated > off it, no more writes will happen to it and reads might be faster. > In either case, it shouldn't slow down the rest of your cluster anymore. > That’s actually one idea I haven’t thought off, I wan’t to play it safe right now and hope that it goes up again, I actually found one wonky way of getting the recovery process from not stalling to a grind, and that was restarting OSDs. One at the time. Regards, Josef > Regards, > > Christian >>> Cheers, >>> Josef >>> >>> On 05 Sep 2014, at 11:26, Luis Periquito >>> wrote: >>> Only time I saw such behaviour was when I was deleting a big chunk of data from the cluster: all the client activity was reduced, the op/s were almost non-existent and there was unjustified delays all over the cluster. But all the disks were somewhat busy in atop/iotstat. On 5 September 2014 09:51, David wrote: Hi, Indeed strange. That output was when we had issues, seems that most operations were blocked / slow requests. A ”baseline” output is more like today: 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637 mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0 [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr, 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05 10:44:36.213535 mon.0 [INF] pgmap v12582766: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17630KB/s rd, 18855KB/s wr, 3131op/s 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12262KB/s rd, 18627KB/s wr, 2595op/s 2014-09-05 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17697KB/s rd, 17572KB/s wr, 2156op/s 2014-09-05 10:44:39.239409 mon.0 [INF] pgmap v12582769: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 20300KB/s rd, 19735KB/s wr, 2197op/s 2014-09-05 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 146
Re: [ceph-users] Huge issues with slow requests
Actually, it only worked with restarting for a period of time to get the recovering process going. Can’t get passed the 21k object mark. I’m uncertain if the disk really is messing this up right now as well. So I’m not glad to start moving 300k objects around. Regards, Josef On 06 Sep 2014, at 14:33, Josef Johansson wrote: > Hi, > > On 06 Sep 2014, at 13:53, Christian Balzer wrote: > >> >> Hello, >> >> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: >> >>> Also putting this on the list. >>> >>> On 06 Sep 2014, at 13:36, Josef Johansson wrote: >>> Hi, Same issues again, but I think we found the drive that causes the problems. But this is causing problems as it’s trying to do a recover to that osd at the moment. So we’re left with the status message 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; 41424/15131923 degraded (0.274%); recovering 0 o/s, 2035KB/s It’s improving, but way too slowly. If I restart the recovery (ceph osd set no recovery /unset) it doesn’t change the osd what I can see. Any ideas? >> I don't know the state of your cluster, i.e. what caused the recovery to >> start (how many OSDs went down?). > Performance degradation, databases are the worst impacted. It’s actually a > OSD that we put in that’s causing it (removed it again though). So the > cluster in itself is healthy. > >> If you have a replication of 3 and only one OSD was involved, what is >> stopping you from taking that wonky drive/OSD out? >> > There’s data that goes missing if I do that, I guess I have to wait for the > recovery process to complete before I can go any further, this is with rep 3. >> If you don't know that or want to play it safe, how about setting the >> weight of that OSD to 0? >> While that will AFAICT still result in all primary PGs to be evacuated >> off it, no more writes will happen to it and reads might be faster. >> In either case, it shouldn't slow down the rest of your cluster anymore. >> > That’s actually one idea I haven’t thought off, I wan’t to play it safe right > now and hope that it goes up again, I actually found one wonky way of getting > the recovery process from not stalling to a grind, and that was restarting > OSDs. One at the time. > > Regards, > Josef >> Regards, >> >> Christian Cheers, Josef On 05 Sep 2014, at 11:26, Luis Periquito wrote: > Only time I saw such behaviour was when I was deleting a big chunk of > data from the cluster: all the client activity was reduced, the op/s > were almost non-existent and there was unjustified delays all over > the cluster. But all the disks were somewhat busy in atop/iotstat. > > > On 5 September 2014 09:51, David wrote: > Hi, > > Indeed strange. > > That output was when we had issues, seems that most operations were > blocked / slow requests. > > A ”baseline” output is more like today: > > 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs: > 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB > avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637 > mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB > data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s > wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: > 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / > 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05 > 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860 > active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; > 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0 > [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data, > 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr, > 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860 > pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB > avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05 > 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860 > active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; > 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05 10:44:36.213535 mon.0 > [INF] pgmap v12582766: 6860 pgs: 6860 active+clean; 12253 GB data, > 36574 GB used, 142 TB / 178 TB avail; 17630KB/s rd, 18855KB/s wr, > 3131op/s 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860 > pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB > avail; 12262KB/s rd, 18627KB/s wr, 2595op/s 2014-09-05 > 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 68
Re: [ceph-users] SSD journal deployment experiences
Hi Christian, Let's keep debating until a dev corrects us ;) September 6 2014 1:27 PM, "Christian Balzer" wrote: > On Fri, 5 Sep 2014 09:42:02 + Dan Van Der Ster wrote: > >>> On 05 Sep 2014, at 11:04, Christian Balzer wrote: >>> >>> On Fri, 5 Sep 2014 07:46:12 + Dan Van Der Ster wrote: > On 05 Sep 2014, at 03:09, Christian Balzer wrote: > > On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote: > >> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster >> wrote: >> > > [snip] > >>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful >>> is the backfilling which results from an SSD failure? Have you >>> considered tricks like increasing the down out interval so >>> backfilling doesn’t happen in this case (leaving time for the SSD >>> to be replaced)? >>> >> >> Replacing a failed SSD won't help your backfill. I haven't actually >> tested it, but I'm pretty sure that losing the journal effectively >> corrupts your OSDs. I don't know what steps are required to >> complete this operation, but it wouldn't surprise me if you need to >> re-format the OSD. >> > This. > All the threads I've read about this indicate that journal loss > during operation means OSD loss. Total OSD loss, no recovery. > From what I gathered the developers are aware of this and it might be > addressed in the future. > I suppose I need to try it then. I don’t understand why you can't just use ceph-osd -i 10 --mkjournal to rebuild osd 10’s journal, for example. >>> I think the logic is if you shut down an OSD cleanly beforehand you can >>> just do that. >>> However from what I gathered there is no logic to re-issue transactions >>> that made it to the journal but not the filestore. >>> So a journal SSD failing mid-operation with a busy OSD would certainly >>> be in that state. >>> >> >> I had thought that the journal write and the buffered filestore write >> happen at the same time. > > Nope, definitely not. > > That's why we have tunables like the ones at: > http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals > > And people (me included) tend to crank that up (to eleven ^o^). > > The write-out to the filestore may start roughly at the same time as the > journal gets things, but it can and will fall behind. > filestore max sync interval is the period between the fsync/fdatasync's of the outstanding filestore writes, which were sent earlier. By the time the sync interval arrives, the OS may have already flushed those writes (sysctl's like vm.dirty_ratio, dirty_expire_centisecs, ... apply here). And even if the osd crashes and never calls fsync, then the OS will flush those anyway. Of course, if a power outage prevents the fsync from ever happening, then the journal entry replay is used to re-write the op. The other thing about filestore max sync interval is that journal entries are only free'd after the osd has fsync'd the related filestore write. That's why the journal size depends on the sync interval. >> So all the previous journal writes that >> succeeded are already on their way to the filestore. My (could be >> incorrect) understanding is that the real purpose of the journal is to >> be able to replay writes after a power outage (since the buffered >> filestore writes would be lost in that case). If there is no power >> outage, then filestore writes are still good regardless of a journal >> failure. > > From Cephs perspective a write is successful once it is on all replica > size journals. This is the key point - which I'm not sure about and don't feel like reading the code on a Saturday ;) Is a write ack'd after a successful journal write, or after the journal _and_ the buffered filestore writes? Is that documented somewhere? > I think (hope) that what you wrote up there to be true, but that doesn't > change the fact that journal data not even on the way to the filestore yet > is the crux here. > >>> I'm sure (hope) somebody from the Ceph team will pipe up about this. >> >> Ditto! > > Guess it will be next week... > > Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5 > ratio is sensible. However these will be the ones limiting your max > sequential write speed if that is of importance to you. In nearly all > use cases you run out of IOPS (on your HDDs) long before that becomes > an issue, though. IOPS is definitely the main limit, but we also only have 1 single 10Gig-E NIC on these servers, so 4 drives that can write (even only 200MB/s) would be good enough. >>> Fair enough. ^o^ >>> Also, we’ll put the SSDs in the first four ports of an SAS2008 HBA which is shared with the other 20 spinning disks. Counting the double writes, the HBA will run out of bandwidth before these SSDs, I expect. >>> Depends on what PCIe
Re: [ceph-users] Huge issues with slow requests
FWI I did restart the OSDs until I saw a server that made impact. Until that server stopped doing impact, I didn’t get lower in the number objects being degraded. After a while it was done with recovering that OSD and happily started with others. I guess I will be seeing the same behaviour when it gets to replicating the same PGs that were causing troubles the first time. On 06 Sep 2014, at 15:04, Josef Johansson wrote: > Actually, it only worked with restarting for a period of time to get the > recovering process going. Can’t get passed the 21k object mark. > > I’m uncertain if the disk really is messing this up right now as well. So I’m > not glad to start moving 300k objects around. > > Regards, > Josef > > On 06 Sep 2014, at 14:33, Josef Johansson wrote: > >> Hi, >> >> On 06 Sep 2014, at 13:53, Christian Balzer wrote: >> >>> >>> Hello, >>> >>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: >>> Also putting this on the list. On 06 Sep 2014, at 13:36, Josef Johansson wrote: > Hi, > > Same issues again, but I think we found the drive that causes the > problems. > > But this is causing problems as it’s trying to do a recover to that > osd at the moment. > > So we’re left with the status message > > 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 > active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB > used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; > 41424/15131923 degraded (0.274%); recovering 0 o/s, 2035KB/s > > > It’s improving, but way too slowly. If I restart the recovery (ceph > osd set no recovery /unset) it doesn’t change the osd what I can see. > > Any ideas? > >>> I don't know the state of your cluster, i.e. what caused the recovery to >>> start (how many OSDs went down?). >> Performance degradation, databases are the worst impacted. It’s actually a >> OSD that we put in that’s causing it (removed it again though). So the >> cluster in itself is healthy. >> >>> If you have a replication of 3 and only one OSD was involved, what is >>> stopping you from taking that wonky drive/OSD out? >>> >> There’s data that goes missing if I do that, I guess I have to wait for the >> recovery process to complete before I can go any further, this is with rep 3. >>> If you don't know that or want to play it safe, how about setting the >>> weight of that OSD to 0? >>> While that will AFAICT still result in all primary PGs to be evacuated >>> off it, no more writes will happen to it and reads might be faster. >>> In either case, it shouldn't slow down the rest of your cluster anymore. >>> >> That’s actually one idea I haven’t thought off, I wan’t to play it safe >> right now and hope that it goes up again, I actually found one wonky way of >> getting the recovery process from not stalling to a grind, and that was >> restarting OSDs. One at the time. >> >> Regards, >> Josef >>> Regards, >>> >>> Christian > Cheers, > Josef > > On 05 Sep 2014, at 11:26, Luis Periquito > wrote: > >> Only time I saw such behaviour was when I was deleting a big chunk of >> data from the cluster: all the client activity was reduced, the op/s >> were almost non-existent and there was unjustified delays all over >> the cluster. But all the disks were somewhat busy in atop/iotstat. >> >> >> On 5 September 2014 09:51, David wrote: >> Hi, >> >> Indeed strange. >> >> That output was when we had issues, seems that most operations were >> blocked / slow requests. >> >> A ”baseline” output is more like today: >> >> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs: >> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB >> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637 >> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB >> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s >> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: >> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / >> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05 >> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860 >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; >> 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0 >> [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data, >> 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr, >> 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860 >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB >> avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05 >> 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860 >> active+cle
Re: [ceph-users] SSD journal deployment experiences
On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote: > Hi Christian, > > Let's keep debating until a dev corrects us ;) > For the time being, I give the recent: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html And not so recent: http://www.spinics.net/lists/ceph-users/msg04152.html http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 And I'm not going to use BTRFS for mainly RBD backed VM images (fragmentation city), never mind the other stability issues that crop up here ever so often. > September 6 2014 1:27 PM, "Christian Balzer" wrote: > > On Fri, 5 Sep 2014 09:42:02 + Dan Van Der Ster wrote: > > > >>> On 05 Sep 2014, at 11:04, Christian Balzer wrote: > >>> > >>> On Fri, 5 Sep 2014 07:46:12 + Dan Van Der Ster wrote: > > > On 05 Sep 2014, at 03:09, Christian Balzer wrote: > > > > On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote: > > > >> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster > >> wrote: > >> > > > > [snip] > > > >>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how > >>> painful is the backfilling which results from an SSD failure? > >>> Have you considered tricks like increasing the down out interval > >>> so backfilling doesn’t happen in this case (leaving time for the > >>> SSD to be replaced)? > >>> > >> > >> Replacing a failed SSD won't help your backfill. I haven't > >> actually tested it, but I'm pretty sure that losing the journal > >> effectively corrupts your OSDs. I don't know what steps are > >> required to complete this operation, but it wouldn't surprise me > >> if you need to re-format the OSD. > >> > > This. > > All the threads I've read about this indicate that journal loss > > during operation means OSD loss. Total OSD loss, no recovery. > > From what I gathered the developers are aware of this and it might > > be addressed in the future. > > > > I suppose I need to try it then. I don’t understand why you can't > just use ceph-osd -i 10 --mkjournal to rebuild osd 10’s journal, for > example. > > >>> I think the logic is if you shut down an OSD cleanly beforehand you > >>> can just do that. > >>> However from what I gathered there is no logic to re-issue > >>> transactions that made it to the journal but not the filestore. > >>> So a journal SSD failing mid-operation with a busy OSD would > >>> certainly be in that state. > >>> > >> > >> I had thought that the journal write and the buffered filestore write > >> happen at the same time. > > > > Nope, definitely not. > > > > That's why we have tunables like the ones at: > > http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals > > > > And people (me included) tend to crank that up (to eleven ^o^). > > > > The write-out to the filestore may start roughly at the same time as > > the journal gets things, but it can and will fall behind. > > > > filestore max sync interval is the period between the fsync/fdatasync's > of the outstanding filestore writes, which were sent earlier. By the > time the sync interval arrives, the OS may have already flushed those > writes (sysctl's like vm.dirty_ratio, dirty_expire_centisecs, ... apply > here). And even if the osd crashes and never calls fsync, then the OS > will flush those anyway. Of course, if a power outage prevents the fsync > from ever happening, then the journal entry replay is used to re-write > the op. The other thing about filestore max sync interval is that > journal entries are only free'd after the osd has fsync'd the related > filestore write. That's why the journal size depends on the sync > interval. > > > >> So all the previous journal writes that > >> succeeded are already on their way to the filestore. My (could be > >> incorrect) understanding is that the real purpose of the journal is to > >> be able to replay writes after a power outage (since the buffered > >> filestore writes would be lost in that case). If there is no power > >> outage, then filestore writes are still good regardless of a journal > >> failure. > > > > From Cephs perspective a write is successful once it is on all replica > > size journals. > > This is the key point - which I'm not sure about and don't feel like > reading the code on a Saturday ;) Is a write ack'd after a successful > journal write, or after the journal _and_ the buffered filestore writes? > Is that documented somewhere? > http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/ Search for "acknowledgement" if you don't want to read the full thing. ^o^ > > > I think (hope) that what you wrote up there to be true, but that > > doesn't change the fact that journal data not even on the way to the > > filestore yet is the crux here. > > > >>> I'm sure (hope) somebody from the Ceph team will pipe up about this. > >> > >> Ditto! > > > >
Re: [ceph-users] ceph osd unexpected error
Hi, Could you give some more detail infos such as operation before occur errors? And what's your ceph version? On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋 wrote: > Dear CEPH , > Urgent question, I met a "FAILED assert(0 == "unexpected error")" > yesterday , Now i have not way to start this OSDS > I have attached my logs in the attachment, and some ceph configurations > as below > > > osd_pool_default_pgp_num = 300 > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 300 > mon_host = 10.1.0.213,10.1.0.214 > osd_crush_chooseleaf_type = 1 > mds_cache_size = 50 > osd objectstore = keyvaluestore-dev > > > > Detailed error information : > > > -13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops || > 11642907 > 104857600 > -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops || > 11642899 > 104857600 > -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops || > 11642901 > 104857600 > -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick > -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient: > _check_auth_rotating have uptodate secrets (they expire after 2014-09-05 > 15:07:05.326835) > -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? > (now: 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341) > -- no > -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops || > 11044551 > 104857600 > -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013 --> > osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0 > 0x18dcf000 > -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops || > 11044553 > 104857600 > -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops || > 11044579 > 104857600 > -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid argument > not handled on operation 9 (336.0.3, or op 3, counting from 0) > -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code > -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump: > { "ops": [ > { "op_num": 0, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 1, > "op_name": "mkcoll", > "collection": "0.a9_TEMP"}, > { "op_num": 2, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 3, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 4, > "op_name": "omap_setheader", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "header_length": "0"}, > { "op_num": 5, > "op_name": "write", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "length": 1160, > "offset": 0, > "bufferlist length": 1160}, > { "op_num": 6, > "op_name": "omap_setkeys", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "attr_lens": {}}, > { "op_num": 7, > "op_name": "setattrs", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "attr_lens": { "_": 239, > "_parent": 250, > "snapset": 31}}, > { "op_num": 8, > "op_name": "omap_setkeys", > "collection": "meta", > "oid": "16ef7597\/infos\/head\/\/-1", > "attr_lens": { "0.a9_epoch": 4, > "0.a9_info": 684}}, > { "op_num": 9, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 10, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 11, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 12, > "op_name": "omap_setheader", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "header_length": "0"}, > { "op_num": 13, > "op_name": "write", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "length": 507284, > "offset": 0, > "bufferlist length": 507284}, > { "op_num": 14, > "op_name": "omap_setkeys", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "attr_lens": {}}, > { "op_num": 15, > "op_name": "setattrs", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "attr_lens": { "_": 239, > "snapset": 31}}, > { "op_num": 16, > "op_name": "omap_setkeys", > "collection": "meta", > "oid": "16ef7597\/infos\/head\/\/-1", > "attr_lens": { "0.a9_epoch": 4, > "0.a9_info": 684}}, > { "op_num": 17, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "794064a9\/1c040e0.\/head\/\/0"}, > { "op_num": 18, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "794064a9\/1c040e0.\/head\/\/0"}, > { "op_num": 19, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "794064a9\/1c040e0.\/head\/\/0"}, > { "op_num": 20, > "op_name": "omap_seth
Re: [ceph-users] ceph osd unexpected error
Hi, Could you give some more detail infos such as operation before occur errors? And what's your ceph version? On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋 wrote: > Dear CEPH , > Urgent question, I met a "FAILED assert(0 == "unexpected error")" > yesterday , Now i have not way to start this OSDS > I have attached my logs in the attachment, and some ceph configurations as > below > > > osd_pool_default_pgp_num = 300 > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 300 > mon_host = 10.1.0.213,10.1.0.214 > osd_crush_chooseleaf_type = 1 > mds_cache_size = 50 > osd objectstore = keyvaluestore-dev > > > > Detailed error information : > > >-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops || > 11642907 > 104857600 > -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops || > 11642899 > 104857600 > -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops || > 11642901 > 104857600 > -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick > -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient: > _check_auth_rotating have uptodate secrets (they expire after 2014-09-05 > 15:07:05.326835) > -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now: > 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341) -- no > -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops || > 11044551 > 104857600 > -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013 --> > osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0 > 0x18dcf000 > -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops || > 11044553 > 104857600 > -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops || > 11044579 > 104857600 > -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid argument > not handled on operation 9 (336.0.3, or op 3, counting from 0) > -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code > -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump: > { "ops": [ > { "op_num": 0, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 1, > "op_name": "mkcoll", > "collection": "0.a9_TEMP"}, > { "op_num": 2, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 3, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 4, > "op_name": "omap_setheader", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "header_length": "0"}, > { "op_num": 5, > "op_name": "write", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "length": 1160, > "offset": 0, > "bufferlist length": 1160}, > { "op_num": 6, > "op_name": "omap_setkeys", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "attr_lens": {}}, > { "op_num": 7, > "op_name": "setattrs", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "attr_lens": { "_": 239, > "_parent": 250, > "snapset": 31}}, > { "op_num": 8, > "op_name": "omap_setkeys", > "collection": "meta", > "oid": "16ef7597\/infos\/head\/\/-1", > "attr_lens": { "0.a9_epoch": 4, > "0.a9_info": 684}}, > { "op_num": 9, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 10, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 11, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 12, > "op_name": "omap_setheader", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "header_length": "0"}, > { "op_num": 13, > "op_name": "write", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "length": 507284, > "offset": 0, > "bufferlist length": 507284}, > { "op_num": 14, > "op_name": "omap_setkeys", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "attr_lens": {}}, > { "op_num": 15, > "op_name": "setattrs", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "attr_lens": { "_": 239, > "snapset": 31}}, > { "op_num": 16, > "op_name": "omap_setkeys", > "collection": "meta", > "oid": "16ef7597\/infos\/head\/\/-1", > "attr_lens": { "0.a9_epoch": 4, > "0.a9_info": 684}}, > { "op_num": 17, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "794064a9\/1c040e0.\/head\/\/0"}, > { "op_num": 18, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "794064a9\/1c040e0.\/head\/\/0"}, > { "op_num": 19, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "794064a9\/1c040e0.\/head\/\/0"}, > { "op_num": 20, > "op_name": "omap_setheader", > "col
Re: [ceph-users] ceph cluster inconsistency keyvaluestore
Sorry for the late message, I'm back from a short vacation. I would like to try it this weekends. Thanks for your patient :-) On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman wrote: > I also can reproduce it on a new slightly different set up (also EC on KV > and Cache) by running ceph pg scrub on a KV pg: this pg will then get the > 'inconsistent' status > > > > - Message from Kenneth Waegeman - >Date: Mon, 01 Sep 2014 16:28:31 +0200 >From: Kenneth Waegeman > Subject: Re: ceph cluster inconsistency keyvaluestore > To: Haomai Wang > Cc: ceph-users@lists.ceph.com > > > >> Hi, >> >> >> The cluster got installed with quattor, which uses ceph-deploy for >> installation of daemons, writes the config file and installs the crushmap. >> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the >> ECdata pool and a small cache partition (50G) for the cache >> >> I manually did this: >> >> ceph osd pool create cache 1024 1024 >> ceph osd pool set cache size 2 >> ceph osd pool set cache min_size 1 >> ceph osd erasure-code-profile set profile11 k=8 m=3 >> ruleset-failure-domain=osd >> ceph osd pool create ecdata 128 128 erasure profile11 >> ceph osd tier add ecdata cache >> ceph osd tier cache-mode cache writeback >> ceph osd tier set-overlay ecdata cache >> ceph osd pool set cache hit_set_type bloom >> ceph osd pool set cache hit_set_count 1 >> ceph osd pool set cache hit_set_period 3600 >> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) >> >> (But the previous time I had the problem already without the cache part) >> >> >> >> Cluster live since 2014-08-29 15:34:16 >> >> Config file on host ceph001: >> >> [global] >> auth_client_required = cephx >> auth_cluster_required = cephx >> auth_service_required = cephx >> cluster_network = 10.143.8.0/24 >> filestore_xattr_use_omap = 1 >> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d >> mon_cluster_log_to_syslog = 1 >> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os >> mon_initial_members = ceph001, ceph002, ceph003 >> osd_crush_update_on_start = 0 >> osd_journal_size = 10240 >> osd_pool_default_min_size = 2 >> osd_pool_default_pg_num = 512 >> osd_pool_default_pgp_num = 512 >> osd_pool_default_size = 3 >> public_network = 10.141.8.0/24 >> >> [osd.11] >> osd_objectstore = keyvaluestore-dev >> >> [osd.13] >> osd_objectstore = keyvaluestore-dev >> >> [osd.15] >> osd_objectstore = keyvaluestore-dev >> >> [osd.17] >> osd_objectstore = keyvaluestore-dev >> >> [osd.19] >> osd_objectstore = keyvaluestore-dev >> >> [osd.21] >> osd_objectstore = keyvaluestore-dev >> >> [osd.23] >> osd_objectstore = keyvaluestore-dev >> >> [osd.25] >> osd_objectstore = keyvaluestore-dev >> >> [osd.3] >> osd_objectstore = keyvaluestore-dev >> >> [osd.5] >> osd_objectstore = keyvaluestore-dev >> >> [osd.7] >> osd_objectstore = keyvaluestore-dev >> >> [osd.9] >> osd_objectstore = keyvaluestore-dev >> >> >> OSDs: >> # idweight type name up/down reweight >> -12 140.6 root default-cache >> -9 46.87 host ceph001-cache >> 2 3.906 osd.2 up 1 >> 4 3.906 osd.4 up 1 >> 6 3.906 osd.6 up 1 >> 8 3.906 osd.8 up 1 >> 10 3.906 osd.10 up 1 >> 12 3.906 osd.12 up 1 >> 14 3.906 osd.14 up 1 >> 16 3.906 osd.16 up 1 >> 18 3.906 osd.18 up 1 >> 20 3.906 osd.20 up 1 >> 22 3.906 osd.22 up 1 >> 24 3.906 osd.24 up 1 >> -10 46.87 host ceph002-cache >> 28 3.906 osd.28 up 1 >> 30 3.906 osd.30 up 1 >> 32 3.906 osd.32 up 1 >> 34 3.906 osd.34 up 1 >> 36 3.906 osd.36 up 1 >> 38 3.906 osd.38 up 1 >> 40 3.906 osd.40 up 1 >> 42 3.906 osd.42 up 1 >> 44 3.906 osd.44 up 1 >> 46 3.906 osd.46 up 1 >> 48 3.906 osd.48 up 1 >> 50 3.906 osd.50 up 1 >> -11 46.87 host ceph003-cache >> 54 3.906 osd.54 up 1 >> 56 3.906 osd.56 up 1 >> 58 3.906 osd.58 up 1 >> 60 3.906 osd.60 up 1 >> 62 3.906 osd.62 up 1 >> 64 3.906 osd.64 up 1 >> 66 3.906 osd.66 up 1 >> 68 3.906 osd.68 up 1 >> 70 3.906 osd.70 up 1 >> 72 3.906 osd.72 up 1 >> 74 3.906
Re: [ceph-users] SSD journal deployment experiences
September 6 2014 4:01 PM, "Christian Balzer" wrote: > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote: > >> Hi Christian, >> >> Let's keep debating until a dev corrects us ;) > > For the time being, I give the recent: > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html > > And not so recent: > http://www.spinics.net/lists/ceph-users/msg04152.html > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 > > And I'm not going to use BTRFS for mainly RBD backed VM images > (fragmentation city), never mind the other stability issues that crop up > here ever so often. Thanks for the links... So until I learn otherwise, I better assume the OSD is lost when the journal fails. Even though I haven't understood exactly why :( I'm going to UTSL to understand the consistency better. An op state diagram would help, but I didn't find one yet. BTW, do you happen to know, _if_ we re-use an OSD after the journal has failed, are any object inconsistencies going to be found by a scrub/deep-scrub? >> >> We have 4 servers in a 3U rack, then each of those servers is connected >> to one of these enclosures with a single SAS cable. >> With the current config, when I dd to all drives in parallel I can write at 24*74MB/s = 1776MB/s. >>> >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 >>> lanes, so as far as that bus goes, it can do 4GB/s. >>> And given your storage pod I assume it is connected with 2 mini-SAS >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA >>> bandwidth. >> >> From above, we are only using 4 lanes -- so around 2GB/s is expected. > > Alright, that explains that then. Any reason for not using both ports? > Probably to minimize costs, and since the single 10Gig-E is a bottleneck anyway. The whole thing is suboptimal anyway, since this hardware was not purchased for Ceph to begin with. Hence retrofitting SSDs, etc... >>> Impressive, even given your huge cluster with 1128 OSDs. >>> However that's not really answering my question, how much data is on an >>> average OSD and thus gets backfilled in that hour? >> >> That's true -- our drives have around 300TB on them. So I guess it will >> take longer - 3x longer - when the drives are 1TB full. > > On your slides, when the crazy user filled the cluster with 250 million > objects and thus 1PB of data, I recall seeing a 7 hour backfill time? > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not close to 1PB. The point was that to fill the cluster with RBD, we'd need 250 million (4MB) objects. So, object-count-wise this was a full cluster, but for the real volume it was more like 70TB IIRC (there were some other larger objects too). In that case, the backfilling was CPU-bound, or perhaps wbthrottle-bound, I don't remember... It was just that there were many tiny tiny objects to synchronize. > Anyway, I guess the lesson to take away from this is that size and > parallelism does indeed help, but even in a cluster like yours recovering > from a 2TB loss would likely be in the 10 hour range... Bigger clusters probably backfill faster simply because there are more OSDs involved in the backfilling. In our cluster we initially get 30-40 backfills in parallel after 1 OSD fails. That's even with max backfills = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is spent backfilling the last 20% of the PGs, just because some OSDs randomly get more new PGs than the others. > Again, see the "Best practice K/M-parameters EC pool" thread. ^.^ Marked that one to read, again. Cheers, dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] resizing the OSD
Hello, On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote: > Hello Cephers, > > We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of the > stuff seems to be working fine but we are seeing some degrading on the > osd's due to lack of space on the osd's. Please elaborate on that degradation. > Is there a way to resize the > OSD without bringing the cluster down? > Define both "resize" and "cluster down". As in, resizing how? Are your current OSDs on disks/LVMs that are not fully used and thus could be grown? What is the size of your current OSDs? The normal way of growing a cluster is to add more OSDs. Preferably of the same size and same performance disks. This will not only simplify things immensely but also make them a lot more predictable. This of course depends on your use case and usage patterns, but often when running out of space you're also running out of other resources like CPU, memory or IOPS of the disks involved. So adding more instead of growing them is most likely the way forward. If you were to replace actual disks with larger ones, take them (the OSDs) out one at a time and re-add it. If you're using ceph-deploy, it will use the disk size as basic weight, if you're doing things manually make sure to specify that size/weight accordingly. Again, you do want to do this for all disks to keep things uniform. If your cluster (pools really) are set to a replica size of at least 2 (risky!) or 3 (as per Firefly default), taking a single OSD out would of course never bring the cluster down. However taking an OSD out and/or adding a new one will cause data movement that might impact your cluster's performance. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge issues with slow requests
We manage to go through the restore, but the performance degradation is still there. Looking through the OSDs to pinpoint a source of the degradation and hoping the current load will be lowered. I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be tough if the degradation is still there afterwards? i.e. if I set back the weight would it move back all the PGs? Regards, Josef On 06 Sep 2014, at 15:52, Josef Johansson wrote: > FWI I did restart the OSDs until I saw a server that made impact. Until that > server stopped doing impact, I didn’t get lower in the number objects being > degraded. > After a while it was done with recovering that OSD and happily started with > others. > I guess I will be seeing the same behaviour when it gets to replicating the > same PGs that were causing troubles the first time. > > On 06 Sep 2014, at 15:04, Josef Johansson wrote: > >> Actually, it only worked with restarting for a period of time to get the >> recovering process going. Can’t get passed the 21k object mark. >> >> I’m uncertain if the disk really is messing this up right now as well. So >> I’m not glad to start moving 300k objects around. >> >> Regards, >> Josef >> >> On 06 Sep 2014, at 14:33, Josef Johansson wrote: >> >>> Hi, >>> >>> On 06 Sep 2014, at 13:53, Christian Balzer wrote: >>> Hello, On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: > Also putting this on the list. > > On 06 Sep 2014, at 13:36, Josef Johansson wrote: > >> Hi, >> >> Same issues again, but I think we found the drive that causes the >> problems. >> >> But this is causing problems as it’s trying to do a recover to that >> osd at the moment. >> >> So we’re left with the status message >> >> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 >> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB >> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; >> 41424/15131923 degraded (0.274%); recovering 0 o/s, 2035KB/s >> >> >> It’s improving, but way too slowly. If I restart the recovery (ceph >> osd set no recovery /unset) it doesn’t change the osd what I can see. >> >> Any ideas? >> I don't know the state of your cluster, i.e. what caused the recovery to start (how many OSDs went down?). >>> Performance degradation, databases are the worst impacted. It’s actually a >>> OSD that we put in that’s causing it (removed it again though). So the >>> cluster in itself is healthy. >>> If you have a replication of 3 and only one OSD was involved, what is stopping you from taking that wonky drive/OSD out? >>> There’s data that goes missing if I do that, I guess I have to wait for the >>> recovery process to complete before I can go any further, this is with rep >>> 3. If you don't know that or want to play it safe, how about setting the weight of that OSD to 0? While that will AFAICT still result in all primary PGs to be evacuated off it, no more writes will happen to it and reads might be faster. In either case, it shouldn't slow down the rest of your cluster anymore. >>> That’s actually one idea I haven’t thought off, I wan’t to play it safe >>> right now and hope that it goes up again, I actually found one wonky way of >>> getting the recovery process from not stalling to a grind, and that was >>> restarting OSDs. One at the time. >>> >>> Regards, >>> Josef Regards, Christian >> Cheers, >> Josef >> >> On 05 Sep 2014, at 11:26, Luis Periquito >> wrote: >> >>> Only time I saw such behaviour was when I was deleting a big chunk of >>> data from the cluster: all the client activity was reduced, the op/s >>> were almost non-existent and there was unjustified delays all over >>> the cluster. But all the disks were somewhat busy in atop/iotstat. >>> >>> >>> On 5 September 2014 09:51, David wrote: >>> Hi, >>> >>> Indeed strange. >>> >>> That output was when we had issues, seems that most operations were >>> blocked / slow requests. >>> >>> A ”baseline” output is more like today: >>> >>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs: >>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB >>> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637 >>> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB >>> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s >>> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: >>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / >>> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05 >>> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860 >>> active+clea
Re: [ceph-users] Huge issues with slow requests
Hello, On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: > We manage to go through the restore, but the performance degradation is > still there. > Manifesting itself how? > Looking through the OSDs to pinpoint a source of the degradation and > hoping the current load will be lowered. > You're the one looking at your cluster, the iostat, atop, iotop and whatnot data. If one particular OSD/disk stands out, investigate it, as per the "Good way to monitor detailed latency/throughput" thread. If you have a spare and idle machine that is identical to your storage nodes, you could run a fio benchmark on a disk there and then compare the results to that of your suspect disk after setting your cluster to noout and stopping that particular OSD. > I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be > tough if the degradation is still there afterwards? i.e. if I set back > the weight would it move back all the PGs? > Of course. Until you can determine that a specific OSD/disk is the culprit, don't do that. If you have the evidence, go ahead. Regards, Christian > Regards, > Josef > > On 06 Sep 2014, at 15:52, Josef Johansson wrote: > > > FWI I did restart the OSDs until I saw a server that made impact. > > Until that server stopped doing impact, I didn’t get lower in the > > number objects being degraded. After a while it was done with > > recovering that OSD and happily started with others. I guess I will be > > seeing the same behaviour when it gets to replicating the same PGs > > that were causing troubles the first time. > > > > On 06 Sep 2014, at 15:04, Josef Johansson wrote: > > > >> Actually, it only worked with restarting for a period of time to get > >> the recovering process going. Can’t get passed the 21k object mark. > >> > >> I’m uncertain if the disk really is messing this up right now as > >> well. So I’m not glad to start moving 300k objects around. > >> > >> Regards, > >> Josef > >> > >> On 06 Sep 2014, at 14:33, Josef Johansson wrote: > >> > >>> Hi, > >>> > >>> On 06 Sep 2014, at 13:53, Christian Balzer wrote: > >>> > > Hello, > > On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: > > > Also putting this on the list. > > > > On 06 Sep 2014, at 13:36, Josef Johansson > > wrote: > > > >> Hi, > >> > >> Same issues again, but I think we found the drive that causes the > >> problems. > >> > >> But this is causing problems as it’s trying to do a recover to > >> that osd at the moment. > >> > >> So we’re left with the status message > >> > >> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: > >> 6841 active+clean, 19 active+remapped+backfilling; 12299 GB data, > >> 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, > >> 74op/s; 41424/15131923 degraded (0.274%); recovering 0 o/s, > >> 2035KB/s > >> > >> > >> It’s improving, but way too slowly. If I restart the recovery > >> (ceph osd set no recovery /unset) it doesn’t change the osd what > >> I can see. > >> > >> Any ideas? > >> > I don't know the state of your cluster, i.e. what caused the > recovery to start (how many OSDs went down?). > >>> Performance degradation, databases are the worst impacted. It’s > >>> actually a OSD that we put in that’s causing it (removed it again > >>> though). So the cluster in itself is healthy. > >>> > If you have a replication of 3 and only one OSD was involved, what > is stopping you from taking that wonky drive/OSD out? > > >>> There’s data that goes missing if I do that, I guess I have to wait > >>> for the recovery process to complete before I can go any further, > >>> this is with rep 3. > If you don't know that or want to play it safe, how about setting > the weight of that OSD to 0? > While that will AFAICT still result in all primary PGs to be > evacuated off it, no more writes will happen to it and reads might > be faster. In either case, it shouldn't slow down the rest of your > cluster anymore. > > >>> That’s actually one idea I haven’t thought off, I wan’t to play it > >>> safe right now and hope that it goes up again, I actually found one > >>> wonky way of getting the recovery process from not stalling to a > >>> grind, and that was restarting OSDs. One at the time. > >>> > >>> Regards, > >>> Josef > Regards, > > Christian > >> Cheers, > >> Josef > >> > >> On 05 Sep 2014, at 11:26, Luis Periquito > >> wrote: > >> > >>> Only time I saw such behaviour was when I was deleting a big > >>> chunk of data from the cluster: all the client activity was > >>> reduced, the op/s were almost non-existent and there was > >>> unjustified delays all over the cluster. But all the disks were > >>> somewhat busy in atop/iotstat. > >>> > >>> > >>> On 5
Re: [ceph-users] SSD journal deployment experiences
On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote: > September 6 2014 4:01 PM, "Christian Balzer" wrote: > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote: > > > >> Hi Christian, > >> > >> Let's keep debating until a dev corrects us ;) > > > > For the time being, I give the recent: > > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html > > > > And not so recent: > > http://www.spinics.net/lists/ceph-users/msg04152.html > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 > > > > And I'm not going to use BTRFS for mainly RBD backed VM images > > (fragmentation city), never mind the other stability issues that crop > > up here ever so often. > > > Thanks for the links... So until I learn otherwise, I better assume the > OSD is lost when the journal fails. Even though I haven't understood > exactly why :( I'm going to UTSL to understand the consistency better. > An op state diagram would help, but I didn't find one yet. > Using the source as an option of last resort is always nice, having to actually do so for something like this feels a bit lacking in the documentation department (that or my google foo being weak). ^o^ > BTW, do you happen to know, _if_ we re-use an OSD after the journal has > failed, are any object inconsistencies going to be found by a > scrub/deep-scrub? > No idea. And really a scenario I hope to never encounter. ^^;; > >> > >> We have 4 servers in a 3U rack, then each of those servers is > >> connected to one of these enclosures with a single SAS cable. > >> > With the current config, when I dd to all drives in parallel I can > write at 24*74MB/s = 1776MB/s. > >>> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 > >>> lanes, so as far as that bus goes, it can do 4GB/s. > >>> And given your storage pod I assume it is connected with 2 mini-SAS > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA > >>> bandwidth. > >> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected. > > > > Alright, that explains that then. Any reason for not using both ports? > > > > Probably to minimize costs, and since the single 10Gig-E is a bottleneck > anyway. The whole thing is suboptimal anyway, since this hardware was > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc... > The single 10Gb/s link is the bottleneck for sustained stuff, but when looking at spikes... Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port might also get some loving. ^o^ The cluster I'm currently building is based on storage nodes with 4 SSDs (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for redundancy, not speed. ^^ > >>> Impressive, even given your huge cluster with 1128 OSDs. > >>> However that's not really answering my question, how much data is on > >>> an average OSD and thus gets backfilled in that hour? > >> > >> That's true -- our drives have around 300TB on them. So I guess it > >> will take longer - 3x longer - when the drives are 1TB full. > > > > On your slides, when the crazy user filled the cluster with 250 million > > objects and thus 1PB of data, I recall seeing a 7 hour backfill time? > > > > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not > close to 1PB. The point was that to fill the cluster with RBD, we'd need > 250 million (4MB) objects. So, object-count-wise this was a full > cluster, but for the real volume it was more like 70TB IIRC (there were > some other larger objects too). > Ah, I see. ^^ > In that case, the backfilling was CPU-bound, or perhaps > wbthrottle-bound, I don't remember... It was just that there were many > tiny tiny objects to synchronize. > Indeed. This is something me and others have seen as well, as in backfilling being much slower than the underlying HW would permit and being CPU intensive. > > Anyway, I guess the lesson to take away from this is that size and > > parallelism does indeed help, but even in a cluster like yours > > recovering from a 2TB loss would likely be in the 10 hour range... > > Bigger clusters probably backfill faster simply because there are more > OSDs involved in the backfilling. In our cluster we initially get 30-40 > backfills in parallel after 1 OSD fails. That's even with max backfills > = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is > spent backfilling the last 20% of the PGs, just because some OSDs > randomly get more new PGs than the others. > You still being on dumpling probably doesn't help that uneven distribution bit. Definitely another data point to go into a realistic recovery/reliability model, though. Christian > > Again, see the "Best practice K/M-parameters EC pool" thread. ^.^ > > Marked that one to read, again. > > Cheers, dan > -- Christian BalzerNetwork/Systems Engineer ch...@gol.
Re: [ceph-users] Huge issues with slow requests
Hi, On 06 Sep 2014, at 17:27, Christian Balzer wrote: > > Hello, > > On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: > >> We manage to go through the restore, but the performance degradation is >> still there. >> > Manifesting itself how? > Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. But mostly a lot of iowait. >> Looking through the OSDs to pinpoint a source of the degradation and >> hoping the current load will be lowered. >> > > You're the one looking at your cluster, the iostat, atop, iotop and > whatnot data. > If one particular OSD/disk stands out, investigate it, as per the "Good > way to monitor detailed latency/throughput" thread. > Will read it through. > If you have a spare and idle machine that is identical to your storage > nodes, you could run a fio benchmark on a disk there and then compare the > results to that of your suspect disk after setting your cluster to noout > and stopping that particular OSD. No spare though, but I have a rough idea what it should be, what’s I’m going at right now. Right, so the cluster should be fine after I stop the OSD right? I though of stopping it a little bit to see if the IO was better afterwards from within the VMs. Not sure how good effect it makes though since it may be waiting for the IO to complete what not. > >> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be >> tough if the degradation is still there afterwards? i.e. if I set back >> the weight would it move back all the PGs? >> > Of course. > > Until you can determine that a specific OSD/disk is the culprit, don't do > that. > If you have the evidence, go ahead. > Great, that’s what I though as well. > Regards, > > Christian > >> Regards, >> Josef >> >> On 06 Sep 2014, at 15:52, Josef Johansson wrote: >> >>> FWI I did restart the OSDs until I saw a server that made impact. >>> Until that server stopped doing impact, I didn’t get lower in the >>> number objects being degraded. After a while it was done with >>> recovering that OSD and happily started with others. I guess I will be >>> seeing the same behaviour when it gets to replicating the same PGs >>> that were causing troubles the first time. >>> >>> On 06 Sep 2014, at 15:04, Josef Johansson wrote: >>> Actually, it only worked with restarting for a period of time to get the recovering process going. Can’t get passed the 21k object mark. I’m uncertain if the disk really is messing this up right now as well. So I’m not glad to start moving 300k objects around. Regards, Josef On 06 Sep 2014, at 14:33, Josef Johansson wrote: > Hi, > > On 06 Sep 2014, at 13:53, Christian Balzer wrote: > >> >> Hello, >> >> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: >> >>> Also putting this on the list. >>> >>> On 06 Sep 2014, at 13:36, Josef Johansson >>> wrote: >>> Hi, Same issues again, but I think we found the drive that causes the problems. But this is causing problems as it’s trying to do a recover to that osd at the moment. So we’re left with the status message 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; 41424/15131923 degraded (0.274%); recovering 0 o/s, 2035KB/s It’s improving, but way too slowly. If I restart the recovery (ceph osd set no recovery /unset) it doesn’t change the osd what I can see. Any ideas? >> I don't know the state of your cluster, i.e. what caused the >> recovery to start (how many OSDs went down?). > Performance degradation, databases are the worst impacted. It’s > actually a OSD that we put in that’s causing it (removed it again > though). So the cluster in itself is healthy. > >> If you have a replication of 3 and only one OSD was involved, what >> is stopping you from taking that wonky drive/OSD out? >> > There’s data that goes missing if I do that, I guess I have to wait > for the recovery process to complete before I can go any further, > this is with rep 3. >> If you don't know that or want to play it safe, how about setting >> the weight of that OSD to 0? >> While that will AFAICT still result in all primary PGs to be >> evacuated off it, no more writes will happen to it and reads might >> be faster. In either case, it shouldn't slow down the rest of your >> cluster anymore. >> > That’s actually one idea I haven’t thought off, I wan’t to play it > safe right now and hope that it goes up again, I actually found one > wonky way of getting the recovery process from
Re: [ceph-users] Huge issues with slow requests
Hi, Just realised that it could also be with a popularity bug as well and lots a small traffic. And seeing that it’s fast it gets popular until it hits the curb. I’m seeing this in the stats I think. Linux 3.13-0.bpo.1-amd64 (osd1) 09/06/2014 _x86_64_(24 CPU) 09/06/2014 05:48:41 PM avg-cpu: %user %nice %system %iowait %steal %idle 2.210.001.002.860.00 93.93 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdm 0.02 1.477.05 42.72 0.67 1.0771.43 0.418.176.418.46 3.44 17.13 sdn 0.03 1.426.17 37.08 0.57 0.9270.51 0.081.766.470.98 3.46 14.98 sdg 0.03 1.446.27 36.62 0.56 0.9471.40 0.348.006.838.20 3.45 14.78 sde 0.03 1.236.47 39.07 0.59 0.9870.29 0.439.476.579.95 3.37 15.33 sdf 0.02 1.266.47 33.77 0.61 0.8775.30 0.225.396.005.27 3.52 14.17 sdl 0.03 1.446.44 40.54 0.59 1.0872.68 0.214.496.564.16 3.40 15.95 sdk 0.03 1.415.62 35.92 0.52 0.9070.10 0.153.586.173.17 3.45 14.32 sdj 0.03 1.266.30 34.23 0.57 0.8370.84 0.317.656.567.85 3.48 14.10 Seeing that the drives are in pretty good shape but not giving lotsa read, I would assume that I need to tweak the cache to swallow more IO. When I tweaked it before production I did not see any performance gains what so ever, so they are pretty low. And it’s odd because we just saw these problems a little while ago. So probably that we hit a limit where the disks are getting lot of IO. I know that there’s some threads about this that I will read again. Thanks for the hints in looking at bad drives. Regards, Josef On 06 Sep 2014, at 17:41, Josef Johansson wrote: > Hi, > > On 06 Sep 2014, at 17:27, Christian Balzer wrote: > >> >> Hello, >> >> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: >> >>> We manage to go through the restore, but the performance degradation is >>> still there. >>> >> Manifesting itself how? >> > Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. > But mostly a lot of iowait. > >>> Looking through the OSDs to pinpoint a source of the degradation and >>> hoping the current load will be lowered. >>> >> >> You're the one looking at your cluster, the iostat, atop, iotop and >> whatnot data. >> If one particular OSD/disk stands out, investigate it, as per the "Good >> way to monitor detailed latency/throughput" thread. >> > Will read it through. >> If you have a spare and idle machine that is identical to your storage >> nodes, you could run a fio benchmark on a disk there and then compare the >> results to that of your suspect disk after setting your cluster to noout >> and stopping that particular OSD. > No spare though, but I have a rough idea what it should be, what’s I’m going > at right now. > Right, so the cluster should be fine after I stop the OSD right? I though of > stopping it a little bit to see if the IO was better afterwards from within > the VMs. Not sure how good effect it makes though since it may be waiting for > the IO to complete what not. >> >>> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be >>> tough if the degradation is still there afterwards? i.e. if I set back >>> the weight would it move back all the PGs? >>> >> Of course. >> >> Until you can determine that a specific OSD/disk is the culprit, don't do >> that. >> If you have the evidence, go ahead. >> > Great, that’s what I though as well. >> Regards, >> >> Christian >> >>> Regards, >>> Josef >>> >>> On 06 Sep 2014, at 15:52, Josef Johansson wrote: >>> FWI I did restart the OSDs until I saw a server that made impact. Until that server stopped doing impact, I didn’t get lower in the number objects being degraded. After a while it was done with recovering that OSD and happily started with others. I guess I will be seeing the same behaviour when it gets to replicating the same PGs that were causing troubles the first time. On 06 Sep 2014, at 15:04, Josef Johansson wrote: > Actually, it only worked with restarting for a period of time to get > the recovering process going. Can’t get passed the 21k object mark. > > I’m uncertain if the disk really is messing this up right now as > well. So I’m not glad to start moving 300k objects around. > > Regards, > Josef > > On 06 Sep 2014, at 14:33, Josef Johansson wrote: > >> Hi, >> >> On 06 Sep 2014, at 13:53, Christian Balzer wrote: >> >>> >>> Hello, >>> >>> On Sat
Re: [ceph-users] Huge issues with slow requests
Hello, On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote: > Hi, > > On 06 Sep 2014, at 17:27, Christian Balzer wrote: > > > > > Hello, > > > > On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: > > > >> We manage to go through the restore, but the performance degradation > >> is still there. > >> > > Manifesting itself how? > > > Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. > But mostly a lot of iowait. > I was thinking about the storage nodes. ^^ As in, does a particular node or disk seem to be redlined all the time? > >> Looking through the OSDs to pinpoint a source of the degradation and > >> hoping the current load will be lowered. > >> > > > > You're the one looking at your cluster, the iostat, atop, iotop and > > whatnot data. > > If one particular OSD/disk stands out, investigate it, as per the "Good > > way to monitor detailed latency/throughput" thread. > > > Will read it through. > > If you have a spare and idle machine that is identical to your storage > > nodes, you could run a fio benchmark on a disk there and then compare > > the results to that of your suspect disk after setting your cluster to > > noout and stopping that particular OSD. > No spare though, but I have a rough idea what it should be, what’s I’m > going at right now. Right, so the cluster should be fine after I stop > the OSD right? I though of stopping it a little bit to see if the IO was > better afterwards from within the VMs. Not sure how good effect it makes > though since it may be waiting for the IO to complete what not. > > If you set your cluster to noout, as in "ceph osd set noout" per http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ before shutting down a particular ODS, no data migration will happen. Of course you will want to shut it down as little as possible, so that recovery traffic when it comes back is minimized. Christian > >> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be > >> tough if the degradation is still there afterwards? i.e. if I set back > >> the weight would it move back all the PGs? > >> > > Of course. > > > > Until you can determine that a specific OSD/disk is the culprit, don't > > do that. > > If you have the evidence, go ahead. > > > Great, that’s what I though as well. > > Regards, > > > > Christian > > > >> Regards, > >> Josef > >> > >> On 06 Sep 2014, at 15:52, Josef Johansson wrote: > >> > >>> FWI I did restart the OSDs until I saw a server that made impact. > >>> Until that server stopped doing impact, I didn’t get lower in the > >>> number objects being degraded. After a while it was done with > >>> recovering that OSD and happily started with others. I guess I will > >>> be seeing the same behaviour when it gets to replicating the same PGs > >>> that were causing troubles the first time. > >>> > >>> On 06 Sep 2014, at 15:04, Josef Johansson wrote: > >>> > Actually, it only worked with restarting for a period of time to > get the recovering process going. Can’t get passed the 21k object > mark. > > I’m uncertain if the disk really is messing this up right now as > well. So I’m not glad to start moving 300k objects around. > > Regards, > Josef > > On 06 Sep 2014, at 14:33, Josef Johansson wrote: > > > Hi, > > > > On 06 Sep 2014, at 13:53, Christian Balzer wrote: > > > >> > >> Hello, > >> > >> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: > >> > >>> Also putting this on the list. > >>> > >>> On 06 Sep 2014, at 13:36, Josef Johansson > >>> wrote: > >>> > Hi, > > Same issues again, but I think we found the drive that causes > the problems. > > But this is causing problems as it’s trying to do a recover to > that osd at the moment. > > So we’re left with the status message > > 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 > pgs: 6841 active+clean, 19 active+remapped+backfilling; 12299 > GB data, 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, > 192KB/s wr, 74op/s; 41424/15131923 degraded (0.274%); > recovering 0 o/s, 2035KB/s > > > It’s improving, but way too slowly. If I restart the recovery > (ceph osd set no recovery /unset) it doesn’t change the osd what > I can see. > > Any ideas? > > >> I don't know the state of your cluster, i.e. what caused the > >> recovery to start (how many OSDs went down?). > > Performance degradation, databases are the worst impacted. It’s > > actually a OSD that we put in that’s causing it (removed it again > > though). So the cluster in itself is healthy. > > > >> If you have a replication of 3 and only one OSD was involved, what > >> is stopping you from
Re: [ceph-users] Huge issues with slow requests
Hello, On Sat, 6 Sep 2014 17:52:59 +0200 Josef Johansson wrote: > Hi, > > Just realised that it could also be with a popularity bug as well and > lots a small traffic. And seeing that it’s fast it gets popular until it > hits the curb. > I don't think I ever heard the term "popularity bug" before, care to elaborate? > I’m seeing this in the stats I think. > > Linux 3.13-0.bpo.1-amd64 (osd1) 09/06/2014 > _x86_64_ (24 CPU) Any particular reason you're not running 3.14? > > 09/06/2014 05:48:41 PM > avg-cpu: %user %nice %system %iowait %steal %idle >2.210.001.002.860.00 93.93 > > Device: rrqm/s wrqm/s r/s w/srMB/swMB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sdm 0.02 1.477.05 42.72 0.67 1.07 > 71.43 0.418.176.418.46 3.44 17.13 sdn > 0.03 1.426.17 37.08 0.57 0.9270.51 0.08 > 1.766.470.98 3.46 14.98 sdg 0.03 1.44 > 6.27 36.62 0.56 0.9471.40 0.348.006.83 > 8.20 3.45 14.78 sde 0.03 1.236.47 39.07 > 0.59 0.9870.29 0.439.476.579.95 3.37 15.33 > sdf 0.02 1.266.47 33.77 0.61 0.87 > 75.30 0.225.396.005.27 3.52 14.17 sdl > 0.03 1.446.44 40.54 0.59 1.0872.68 0.21 > 4.496.564.16 3.40 15.95 sdk 0.03 1.41 > 5.62 35.92 0.52 0.9070.10 0.153.586.17 > 3.17 3.45 14.32 sdj 0.03 1.266.30 34.23 > 0.57 0.8370.84 0.317.656.567.85 3.48 14.10 > > Seeing that the drives are in pretty good shape but not giving lotsa > read, I would assume that I need to tweak the cache to swallow more IO. > That looks indeed fine, as in, none of these disks looks suspicious to me. > When I tweaked it before production I did not see any performance gains > what so ever, so they are pretty low. And it’s odd because we just saw > these problems a little while ago. So probably that we hit a limit where > the disks are getting lot of IO. > > I know that there’s some threads about this that I will read again. > URL? Christian > Thanks for the hints in looking at bad drives. > > Regards, > Josef > > On 06 Sep 2014, at 17:41, Josef Johansson wrote: > > > Hi, > > > > On 06 Sep 2014, at 17:27, Christian Balzer wrote: > > > >> > >> Hello, > >> > >> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: > >> > >>> We manage to go through the restore, but the performance degradation > >>> is still there. > >>> > >> Manifesting itself how? > >> > > Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. > > But mostly a lot of iowait. > > > >>> Looking through the OSDs to pinpoint a source of the degradation and > >>> hoping the current load will be lowered. > >>> > >> > >> You're the one looking at your cluster, the iostat, atop, iotop and > >> whatnot data. > >> If one particular OSD/disk stands out, investigate it, as per the > >> "Good way to monitor detailed latency/throughput" thread. > >> > > Will read it through. > >> If you have a spare and idle machine that is identical to your storage > >> nodes, you could run a fio benchmark on a disk there and then compare > >> the results to that of your suspect disk after setting your cluster > >> to noout and stopping that particular OSD. > > No spare though, but I have a rough idea what it should be, what’s I’m > > going at right now. Right, so the cluster should be fine after I stop > > the OSD right? I though of stopping it a little bit to see if the IO > > was better afterwards from within the VMs. Not sure how good effect it > > makes though since it may be waiting for the IO to complete what not. > >> > >>> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be > >>> tough if the degradation is still there afterwards? i.e. if I set > >>> back the weight would it move back all the PGs? > >>> > >> Of course. > >> > >> Until you can determine that a specific OSD/disk is the culprit, > >> don't do that. > >> If you have the evidence, go ahead. > >> > > Great, that’s what I though as well. > >> Regards, > >> > >> Christian > >> > >>> Regards, > >>> Josef > >>> > >>> On 06 Sep 2014, at 15:52, Josef Johansson wrote: > >>> > FWI I did restart the OSDs until I saw a server that made impact. > Until that server stopped doing impact, I didn’t get lower in the > number objects being degraded. After a while it was done with > recovering that OSD and happily started with others. I guess I will > be seeing the same behaviour when it gets to replicating the same > PGs that were causing troubles the first time. > > On 06 Sep 2014, at 15:04, Josef Johansson wrote: > > > Actually, it only worked with restarting for a period of time to > > get the recoveri
Re: [ceph-users] SSD journal deployment experiences
Backing up slightly, have you considered RAID 5 over your SSDs? Practically speaking, there's no performance downside to RAID 5 when your devices aren't IOPS-bound. On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer wrote: > On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote: > > > September 6 2014 4:01 PM, "Christian Balzer" wrote: > > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote: > > > > > >> Hi Christian, > > >> > > >> Let's keep debating until a dev corrects us ;) > > > > > > For the time being, I give the recent: > > > > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html > > > > > > And not so recent: > > > http://www.spinics.net/lists/ceph-users/msg04152.html > > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 > > > > > > And I'm not going to use BTRFS for mainly RBD backed VM images > > > (fragmentation city), never mind the other stability issues that crop > > > up here ever so often. > > > > > > Thanks for the links... So until I learn otherwise, I better assume the > > OSD is lost when the journal fails. Even though I haven't understood > > exactly why :( I'm going to UTSL to understand the consistency better. > > An op state diagram would help, but I didn't find one yet. > > > Using the source as an option of last resort is always nice, having to > actually do so for something like this feels a bit lacking in the > documentation department (that or my google foo being weak). ^o^ > > > BTW, do you happen to know, _if_ we re-use an OSD after the journal has > > failed, are any object inconsistencies going to be found by a > > scrub/deep-scrub? > > > No idea. > And really a scenario I hope to never encounter. ^^;; > > > >> > > >> We have 4 servers in a 3U rack, then each of those servers is > > >> connected to one of these enclosures with a single SAS cable. > > >> > > With the current config, when I dd to all drives in parallel I can > > write at 24*74MB/s = 1776MB/s. > > >>> > > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 > > >>> lanes, so as far as that bus goes, it can do 4GB/s. > > >>> And given your storage pod I assume it is connected with 2 mini-SAS > > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA > > >>> bandwidth. > > >> > > >> From above, we are only using 4 lanes -- so around 2GB/s is expected. > > > > > > Alright, that explains that then. Any reason for not using both ports? > > > > > > > Probably to minimize costs, and since the single 10Gig-E is a bottleneck > > anyway. The whole thing is suboptimal anyway, since this hardware was > > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc... > > > The single 10Gb/s link is the bottleneck for sustained stuff, but when > looking at spikes... > Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port > might also get some loving. ^o^ > > The cluster I'm currently building is based on storage nodes with 4 SSDs > (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 > HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for > redundancy, not speed. ^^ > > > >>> Impressive, even given your huge cluster with 1128 OSDs. > > >>> However that's not really answering my question, how much data is on > > >>> an average OSD and thus gets backfilled in that hour? > > >> > > >> That's true -- our drives have around 300TB on them. So I guess it > > >> will take longer - 3x longer - when the drives are 1TB full. > > > > > > On your slides, when the crazy user filled the cluster with 250 million > > > objects and thus 1PB of data, I recall seeing a 7 hour backfill time? > > > > > > > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not > > close to 1PB. The point was that to fill the cluster with RBD, we'd need > > 250 million (4MB) objects. So, object-count-wise this was a full > > cluster, but for the real volume it was more like 70TB IIRC (there were > > some other larger objects too). > > > Ah, I see. ^^ > > > In that case, the backfilling was CPU-bound, or perhaps > > wbthrottle-bound, I don't remember... It was just that there were many > > tiny tiny objects to synchronize. > > > Indeed. This is something me and others have seen as well, as in > backfilling being much slower than the underlying HW would permit and > being CPU intensive. > > > > Anyway, I guess the lesson to take away from this is that size and > > > parallelism does indeed help, but even in a cluster like yours > > > recovering from a 2TB loss would likely be in the 10 hour range... > > > > Bigger clusters probably backfill faster simply because there are more > > OSDs involved in the backfilling. In our cluster we initially get 30-40 > > backfills in parallel after 1 OSD fails. That's even with max backfills > > = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is > > spent backfilling the last 20% of the PGs, just because some OSDs > > randomly get more new PGs than
Re: [ceph-users] Huge issues with slow requests
Hi, On 06 Sep 2014, at 18:05, Christian Balzer wrote: > > Hello, > > On Sat, 6 Sep 2014 17:52:59 +0200 Josef Johansson wrote: > >> Hi, >> >> Just realised that it could also be with a popularity bug as well and >> lots a small traffic. And seeing that it’s fast it gets popular until it >> hits the curb. >> > I don't think I ever heard the term "popularity bug" before, care to > elaborate? I did! :D When you start out fine with great numbers, people like it and suddenly it’s not so fast anymore, and when you hit the magic number it starts to be trouble. > >> I’m seeing this in the stats I think. >> >> Linux 3.13-0.bpo.1-amd64 (osd1) 09/06/2014 >> _x86_64_ (24 CPU) > Any particular reason you're not running 3.14? No, just that we don’t have that much time on our hands. >> >> 09/06/2014 05:48:41 PM >> avg-cpu: %user %nice %system %iowait %steal %idle >> 2.210.001.002.860.00 93.93 >> >> Device: rrqm/s wrqm/s r/s w/srMB/swMB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> sdm 0.02 1.477.05 42.72 0.67 1.07 >> 71.43 0.418.176.418.46 3.44 17.13 sdn >> 0.03 1.426.17 37.08 0.57 0.9270.51 0.08 >> 1.766.470.98 3.46 14.98 sdg 0.03 1.44 >> 6.27 36.62 0.56 0.9471.40 0.348.006.83 >> 8.20 3.45 14.78 sde 0.03 1.236.47 39.07 >> 0.59 0.9870.29 0.439.476.579.95 3.37 15.33 >> sdf 0.02 1.266.47 33.77 0.61 0.87 >> 75.30 0.225.396.005.27 3.52 14.17 sdl >> 0.03 1.446.44 40.54 0.59 1.0872.68 0.21 >> 4.496.564.16 3.40 15.95 sdk 0.03 1.41 >> 5.62 35.92 0.52 0.9070.10 0.153.586.17 >> 3.17 3.45 14.32 sdj 0.03 1.266.30 34.23 >> 0.57 0.8370.84 0.317.656.567.85 3.48 14.10 >> >> Seeing that the drives are in pretty good shape but not giving lotsa >> read, I would assume that I need to tweak the cache to swallow more IO. >> > That looks indeed fine, as in, none of these disks looks suspicious to me. > >> When I tweaked it before production I did not see any performance gains >> what so ever, so they are pretty low. And it’s odd because we just saw >> these problems a little while ago. So probably that we hit a limit where >> the disks are getting lot of IO. >> >> I know that there’s some threads about this that I will read again. >> > URL? > Uhm, I think you’re involved in most of them. I'll post what I do and from where. > Christian > >> Thanks for the hints in looking at bad drives. >> >> Regards, >> Josef >> >> On 06 Sep 2014, at 17:41, Josef Johansson wrote: >> >>> Hi, >>> >>> On 06 Sep 2014, at 17:27, Christian Balzer wrote: >>> Hello, On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: > We manage to go through the restore, but the performance degradation > is still there. > Manifesting itself how? >>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. >>> But mostly a lot of iowait. >>> > Looking through the OSDs to pinpoint a source of the degradation and > hoping the current load will be lowered. > You're the one looking at your cluster, the iostat, atop, iotop and whatnot data. If one particular OSD/disk stands out, investigate it, as per the "Good way to monitor detailed latency/throughput" thread. >>> Will read it through. If you have a spare and idle machine that is identical to your storage nodes, you could run a fio benchmark on a disk there and then compare the results to that of your suspect disk after setting your cluster to noout and stopping that particular OSD. >>> No spare though, but I have a rough idea what it should be, what’s I’m >>> going at right now. Right, so the cluster should be fine after I stop >>> the OSD right? I though of stopping it a little bit to see if the IO >>> was better afterwards from within the VMs. Not sure how good effect it >>> makes though since it may be waiting for the IO to complete what not. > I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be > tough if the degradation is still there afterwards? i.e. if I set > back the weight would it move back all the PGs? > Of course. Until you can determine that a specific OSD/disk is the culprit, don't do that. If you have the evidence, go ahead. >>> Great, that’s what I though as well. Regards, Christian > Regards, > Josef > > On 06 Sep 2014, at 15:52, Josef Johansson wrote: > >> FWI I did restart the OSDs until I saw a server that made impact. >> Until that server stopped doing impact, I didn’t get lower in the >
Re: [ceph-users] Huge issues with slow requests
Hi, On 06 Sep 2014, at 17:59, Christian Balzer wrote: > > Hello, > > On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote: > >> Hi, >> >> On 06 Sep 2014, at 17:27, Christian Balzer wrote: >> >>> >>> Hello, >>> >>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: >>> We manage to go through the restore, but the performance degradation is still there. >>> Manifesting itself how? >>> >> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. >> But mostly a lot of iowait. >> > I was thinking about the storage nodes. ^^ > As in, does a particular node or disk seem to be redlined all the time? They’re idle, with little io wait. > Looking through the OSDs to pinpoint a source of the degradation and hoping the current load will be lowered. >>> >>> You're the one looking at your cluster, the iostat, atop, iotop and >>> whatnot data. >>> If one particular OSD/disk stands out, investigate it, as per the "Good >>> way to monitor detailed latency/throughput" thread. >>> >> Will read it through. >>> If you have a spare and idle machine that is identical to your storage >>> nodes, you could run a fio benchmark on a disk there and then compare >>> the results to that of your suspect disk after setting your cluster to >>> noout and stopping that particular OSD. >> No spare though, but I have a rough idea what it should be, what’s I’m >> going at right now. Right, so the cluster should be fine after I stop >> the OSD right? I though of stopping it a little bit to see if the IO was >> better afterwards from within the VMs. Not sure how good effect it makes >> though since it may be waiting for the IO to complete what not. >>> > If you set your cluster to noout, as in "ceph osd set noout" per > http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ > before shutting down a particular ODS, no data migration will happen. > > Of course you will want to shut it down as little as possible, so that > recovery traffic when it comes back is minimized. > Good, yes will do this. Regards, Josef > Christian > I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be tough if the degradation is still there afterwards? i.e. if I set back the weight would it move back all the PGs? >>> Of course. >>> >>> Until you can determine that a specific OSD/disk is the culprit, don't >>> do that. >>> If you have the evidence, go ahead. >>> >> Great, that’s what I though as well. >>> Regards, >>> >>> Christian >>> Regards, Josef On 06 Sep 2014, at 15:52, Josef Johansson wrote: > FWI I did restart the OSDs until I saw a server that made impact. > Until that server stopped doing impact, I didn’t get lower in the > number objects being degraded. After a while it was done with > recovering that OSD and happily started with others. I guess I will > be seeing the same behaviour when it gets to replicating the same PGs > that were causing troubles the first time. > > On 06 Sep 2014, at 15:04, Josef Johansson wrote: > >> Actually, it only worked with restarting for a period of time to >> get the recovering process going. Can’t get passed the 21k object >> mark. >> >> I’m uncertain if the disk really is messing this up right now as >> well. So I’m not glad to start moving 300k objects around. >> >> Regards, >> Josef >> >> On 06 Sep 2014, at 14:33, Josef Johansson wrote: >> >>> Hi, >>> >>> On 06 Sep 2014, at 13:53, Christian Balzer wrote: >>> Hello, On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote: > Also putting this on the list. > > On 06 Sep 2014, at 13:36, Josef Johansson > wrote: > >> Hi, >> >> Same issues again, but I think we found the drive that causes >> the problems. >> >> But this is causing problems as it’s trying to do a recover to >> that osd at the moment. >> >> So we’re left with the status message >> >> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 >> pgs: 6841 active+clean, 19 active+remapped+backfilling; 12299 >> GB data, 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, >> 192KB/s wr, 74op/s; 41424/15131923 degraded (0.274%); >> recovering 0 o/s, 2035KB/s >> >> >> It’s improving, but way too slowly. If I restart the recovery >> (ceph osd set no recovery /unset) it doesn’t change the osd what >> I can see. >> >> Any ideas? >> I don't know the state of your cluster, i.e. what caused the recovery to start (how many OSDs went down?). >>> Performance degradation, databases are the worst impacted. It’s >>> actually a OSD that we put in that’s causing it (remove
Re: [ceph-users] SSD journal deployment experiences
On Sat, 06 Sep 2014 16:06:56 + Scott Laird wrote: > Backing up slightly, have you considered RAID 5 over your SSDs? > Practically speaking, there's no performance downside to RAID 5 when > your devices aren't IOPS-bound. > Well... For starters with RAID5 you would loose 25% throughput in both Dan's and my case (4 SSDs) compared to JBOD SSD journals. In Dan's case that might not matter due to other bottlenecks, in my case it certainly would. And while you're quite correct when it comes to IOPS, doing RAID5 will either consume significant CPU resource in a software RAID case or require a decent HW RAID controller. Christian > On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer wrote: > > > On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote: > > > > > September 6 2014 4:01 PM, "Christian Balzer" wrote: > > > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote: > > > > > > > >> Hi Christian, > > > >> > > > >> Let's keep debating until a dev corrects us ;) > > > > > > > > For the time being, I give the recent: > > > > > > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html > > > > > > > > And not so recent: > > > > http://www.spinics.net/lists/ceph-users/msg04152.html > > > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 > > > > > > > > And I'm not going to use BTRFS for mainly RBD backed VM images > > > > (fragmentation city), never mind the other stability issues that > > > > crop up here ever so often. > > > > > > > > > Thanks for the links... So until I learn otherwise, I better assume > > > the OSD is lost when the journal fails. Even though I haven't > > > understood exactly why :( I'm going to UTSL to understand the > > > consistency better. An op state diagram would help, but I didn't > > > find one yet. > > > > > Using the source as an option of last resort is always nice, having to > > actually do so for something like this feels a bit lacking in the > > documentation department (that or my google foo being weak). ^o^ > > > > > BTW, do you happen to know, _if_ we re-use an OSD after the journal > > > has failed, are any object inconsistencies going to be found by a > > > scrub/deep-scrub? > > > > > No idea. > > And really a scenario I hope to never encounter. ^^;; > > > > > >> > > > >> We have 4 servers in a 3U rack, then each of those servers is > > > >> connected to one of these enclosures with a single SAS cable. > > > >> > > > With the current config, when I dd to all drives in parallel I > > > can write at 24*74MB/s = 1776MB/s. > > > >>> > > > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe > > > >>> 2.0 lanes, so as far as that bus goes, it can do 4GB/s. > > > >>> And given your storage pod I assume it is connected with 2 > > > >>> mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 = > > > >>> 48Gb/s SATA bandwidth. > > > >> > > > >> From above, we are only using 4 lanes -- so around 2GB/s is > > > >> expected. > > > > > > > > Alright, that explains that then. Any reason for not using both > > > > ports? > > > > > > > > > > Probably to minimize costs, and since the single 10Gig-E is a > > > bottleneck anyway. The whole thing is suboptimal anyway, since this > > > hardware was not purchased for Ceph to begin with. Hence > > > retrofitting SSDs, etc... > > > > > The single 10Gb/s link is the bottleneck for sustained stuff, but when > > looking at spikes... > > Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port > > might also get some loving. ^o^ > > > > The cluster I'm currently building is based on storage nodes with 4 > > SSDs (100GB DC 3700s, so 800MB/s would be the absolute write speed > > limit) and 8 HDDs. Connected with 40Gb/s Infiniband. Dual port, dual > > switch for redundancy, not speed. ^^ > > > > > >>> Impressive, even given your huge cluster with 1128 OSDs. > > > >>> However that's not really answering my question, how much data > > > >>> is on an average OSD and thus gets backfilled in that hour? > > > >> > > > >> That's true -- our drives have around 300TB on them. So I guess it > > > >> will take longer - 3x longer - when the drives are 1TB full. > > > > > > > > On your slides, when the crazy user filled the cluster with 250 > > > > million objects and thus 1PB of data, I recall seeing a 7 hour > > > > backfill time? > > > > > > > > > > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not > > > close to 1PB. The point was that to fill the cluster with RBD, we'd > > > need 250 million (4MB) objects. So, object-count-wise this was a full > > > cluster, but for the real volume it was more like 70TB IIRC (there > > > were some other larger objects too). > > > > > Ah, I see. ^^ > > > > > In that case, the backfilling was CPU-bound, or perhaps > > > wbthrottle-bound, I don't remember... It was just that there were > > > many tiny tiny objects to synchronize. > > > > > Indeed. This is something me and others have seen as well, as in > > backfilling b
Re: [ceph-users] SSD journal deployment experiences
RAID5... Hadn't considered it due to the IOPS penalty (it would get 1/4th of the IOPS of separated journal devices, according to some online raid calc). Compared to RAID10, I guess we'd get 50% more capacity, but lower performance. After the anecdotes that the DCS3700 is very rarely failing, and without a stable bcache to build upon, I'm leaning toward the usual 5 journal partitions per SSD. But that will leave at least 100GB free per drive, so I might try running an OSD there. Cheers, Dan On Sep 6, 2014 6:07 PM, Scott Laird wrote: Backing up slightly, have you considered RAID 5 over your SSDs? Practically speaking, there's no performance downside to RAID 5 when your devices aren't IOPS-bound. On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer mailto:ch...@gol.com>> wrote: On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote: > September 6 2014 4:01 PM, "Christian Balzer" > mailto:ch...@gol.com>> wrote: > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote: > > > >> Hi Christian, > >> > >> Let's keep debating until a dev corrects us ;) > > > > For the time being, I give the recent: > > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html > > > > And not so recent: > > http://www.spinics.net/lists/ceph-users/msg04152.html > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 > > > > And I'm not going to use BTRFS for mainly RBD backed VM images > > (fragmentation city), never mind the other stability issues that crop > > up here ever so often. > > > Thanks for the links... So until I learn otherwise, I better assume the > OSD is lost when the journal fails. Even though I haven't understood > exactly why :( I'm going to UTSL to understand the consistency better. > An op state diagram would help, but I didn't find one yet. > Using the source as an option of last resort is always nice, having to actually do so for something like this feels a bit lacking in the documentation department (that or my google foo being weak). ^o^ > BTW, do you happen to know, _if_ we re-use an OSD after the journal has > failed, are any object inconsistencies going to be found by a > scrub/deep-scrub? > No idea. And really a scenario I hope to never encounter. ^^;; > >> > >> We have 4 servers in a 3U rack, then each of those servers is > >> connected to one of these enclosures with a single SAS cable. > >> > With the current config, when I dd to all drives in parallel I can > write at 24*74MB/s = 1776MB/s. > >>> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 > >>> lanes, so as far as that bus goes, it can do 4GB/s. > >>> And given your storage pod I assume it is connected with 2 mini-SAS > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA > >>> bandwidth. > >> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected. > > > > Alright, that explains that then. Any reason for not using both ports? > > > > Probably to minimize costs, and since the single 10Gig-E is a bottleneck > anyway. The whole thing is suboptimal anyway, since this hardware was > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc... > The single 10Gb/s link is the bottleneck for sustained stuff, but when looking at spikes... Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port might also get some loving. ^o^ The cluster I'm currently building is based on storage nodes with 4 SSDs (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for redundancy, not speed. ^^ > >>> Impressive, even given your huge cluster with 1128 OSDs. > >>> However that's not really answering my question, how much data is on > >>> an average OSD and thus gets backfilled in that hour? > >> > >> That's true -- our drives have around 300TB on them. So I guess it > >> will take longer - 3x longer - when the drives are 1TB full. > > > > On your slides, when the crazy user filled the cluster with 250 million > > objects and thus 1PB of data, I recall seeing a 7 hour backfill time? > > > > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not > close to 1PB. The point was that to fill the cluster with RBD, we'd need > 250 million (4MB) objects. So, object-count-wise this was a full > cluster, but for the real volume it was more like 70TB IIRC (there were > some other larger objects too). > Ah, I see. ^^ > In that case, the backfilling was CPU-bound, or perhaps > wbthrottle-bound, I don't remember... It was just that there were many > tiny tiny objects to synchronize. > Indeed. This is something me and others have seen as well, as in backfilling being much slower than the underlying HW would permit and being CPU intensive. > > Anyway, I guess the lesson to take away from this is that size and > > parallelism does indeed help, but even in a cluster like yours > > recovering from a 2TB loss would likely be in the 10 hour range... > > Big
Re: [ceph-users] ceph osd unexpected error
Have you set the open file descriptor limit in the OSD node ? Try setting it like 'ulimit -n 65536" -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang Sent: Saturday, September 06, 2014 7:44 AM To: 廖建锋 Cc: ceph-users; ceph-devel Subject: Re: [ceph-users] ceph osd unexpected error Hi, Could you give some more detail infos such as operation before occur errors? And what's your ceph version? On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋 wrote: > Dear CEPH , > Urgent question, I met a "FAILED assert(0 == "unexpected error")" > yesterday , Now i have not way to start this OSDS I have attached my > logs in the attachment, and some ceph configurations as below > > > osd_pool_default_pgp_num = 300 > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 300 > mon_host = 10.1.0.213,10.1.0.214 > osd_crush_chooseleaf_type = 1 > mds_cache_size = 50 > osd objectstore = keyvaluestore-dev > > > > Detailed error information : > > >-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops > || > 11642907 > 104857600 > -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops || > 11642899 > 104857600 > -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops || > 11642901 > 104857600 > -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick > -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient: > _check_auth_rotating have uptodate secrets (they expire after > 2014-09-05 > 15:07:05.326835) > -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now: > 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341) > -- no > -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops || > 11044551 > 104857600 > -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013 > -6> --> > osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0 > 0x18dcf000 > -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops || > 11044553 > 104857600 > -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops || > 11044579 > 104857600 > -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid > -3> argument > not handled on operation 9 (336.0.3, or op 3, counting from 0) > -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code > -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump: > { "ops": [ > { "op_num": 0, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 1, > "op_name": "mkcoll", > "collection": "0.a9_TEMP"}, > { "op_num": 2, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 3, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 4, > "op_name": "omap_setheader", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "header_length": "0"}, > { "op_num": 5, > "op_name": "write", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "length": 1160, > "offset": 0, > "bufferlist length": 1160}, > { "op_num": 6, > "op_name": "omap_setkeys", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "attr_lens": {}}, > { "op_num": 7, > "op_name": "setattrs", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "attr_lens": { "_": 239, > "_parent": 250, > "snapset": 31}}, > { "op_num": 8, > "op_name": "omap_setkeys", > "collection": "meta", > "oid": "16ef7597\/infos\/head\/\/-1", > "attr_lens": { "0.a9_epoch": 4, > "0.a9_info": 684}}, > { "op_num": 9, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 10, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 11, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 12, > "op_name": "omap_setheader", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "header_length": "0"}, > { "op_num": 13, > "op_name": "write", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "length": 507284, > "offset": 0, > "bufferlist length": 507284}, > { "op_num": 14, > "op_name": "omap_setkeys", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "attr_lens": {}}, > { "op_num": 15, > "op_name": "setattrs", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "attr_lens": { "_": 239, > "snapset": 31}}, > { "op_num": 16, > "op_name": "omap_setkeys", > "collection": "meta", > "oid": "16ef7597\/infos\/head\/\/-1", > "attr_lens": { "0.a9_epoch": 4, > "0.a9_info": 684}}, > { "op_num": 17, > "op_name": "remove", > "col
Re: [ceph-users] resizing the OSD
Thanks Christian. Replies inline. On Sep 6, 2014, at 8:04 AM, Christian Balzer wrote: > > Hello, > > On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote: > >> Hello Cephers, >> >> We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of the >> stuff seems to be working fine but we are seeing some degrading on the >> osd's due to lack of space on the osd's. > > Please elaborate on that degradation. The degradation happened on few OSD's because it got quickly filled up. They were not of the same size as the other OSD's. Now I want to remove these OSD's and readd them with correct size to match the others. > >> Is there a way to resize the >> OSD without bringing the cluster down? >> > > Define both "resize" and "cluster down". Basically I want to remove the OSD's with incorrect size and readd them with the size matching the other OSD's. > > As in, resizing how? > Are your current OSDs on disks/LVMs that are not fully used and thus could > be grown? > What is the size of your current OSDs? The size of current OSD's is 20GB and we do have more unused space on the disk that we can make the LVM bigger and increase the size of the OSD's. I agree that we need to have all the disks of same size and I am working towards that.Thanks. > > The normal way of growing a cluster is to add more OSDs. > Preferably of the same size and same performance disks. > This will not only simplify things immensely but also make them a lot more > predictable. > This of course depends on your use case and usage patterns, but often when > running out of space you're also running out of other resources like CPU, > memory or IOPS of the disks involved. So adding more instead of growing > them is most likely the way forward. > > If you were to replace actual disks with larger ones, take them (the OSDs) > out one at a time and re-add it. If you're using ceph-deploy, it will use > the disk size as basic weight, if you're doing things manually make sure > to specify that size/weight accordingly. > Again, you do want to do this for all disks to keep things uniform. > > If your cluster (pools really) are set to a replica size of at least 2 > (risky!) or 3 (as per Firefly default), taking a single OSD out would of > course never bring the cluster down. > However taking an OSD out and/or adding a new one will cause data movement > that might impact your cluster's performance. > We have a current replica size of 2 with 100 OSD's. How many can I loose without affecting the performance? I understand the impact of data movement. --Jiten > Regards, > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge issues with slow requests
Hi, Unfortunatly the journal tuning did not do much. That’s odd, because I don’t see much utilisation on OSDs themselves. Now this leads to a network-issue between the OSDs right? On 06 Sep 2014, at 18:17, Josef Johansson wrote: > Hi, > > On 06 Sep 2014, at 17:59, Christian Balzer wrote: > >> >> Hello, >> >> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote: >> >>> Hi, >>> >>> On 06 Sep 2014, at 17:27, Christian Balzer wrote: >>> Hello, On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: > We manage to go through the restore, but the performance degradation > is still there. > Manifesting itself how? >>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. >>> But mostly a lot of iowait. >>> >> I was thinking about the storage nodes. ^^ >> As in, does a particular node or disk seem to be redlined all the time? > They’re idle, with little io wait. It also shows it self as earlier, with slow requests now and then. Like this 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN] slow request 31.554785 seconds old, received at 2014-09-06 19:12:56.914688: osd_op(client.12483520.0:12211087 rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3813376~4096] 3.3bfab9da e15861) v4 currently waiting for subops from [13,2] 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN] slow request 31.554736 seconds old, received at 2014-09-06 19:12:56.914737: osd_op(client.12483520.0:12211088 rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3842048~8192] 3.3bfab9da e15861) v4 currently waiting for subops from [13,2] 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN] slow request 30.691760 seconds old, received at 2014-09-06 19:12:57.13: osd_op(client.12646408.0:36726433 rbd_data.81ab322eb141f2.ec38 [stat,write 749568~4096] 3.7ae1c1da e15861) v4 currently waiting for subops from [13,2] 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN] 23 slow requests, 2 included below; oldest blocked for > 42.196747 secs 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 : [WRN] slow request 30.344653 seconds old, received at 2014-09-06 19:13:01.125248: osd_op(client.18869229.0:100325 rbd_data.41d2eb2eb141f2.2732 [stat,write 2174976~4096] 3.55d437e e15861) v4 currently waiting for subops from [13,6] 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN] slow request 30.344579 seconds old, received at 2014-09-06 19:13:01.125322: osd_op(client.18869229.0:100326 rbd_data.41d2eb2eb141f2.2732 [stat,write 2920448~4096] 3.55d437e e15861) v4 currently waiting for subops from [13,6] 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN] 24 slow requests, 1 included below; oldest blocked for > 43.196971 secs 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 : [WRN] slow request 30.627252 seconds old, received at 2014-09-06 19:13:01.842873: osd_op(client.10785413.0:136148901 rbd_data.96803f2eb141f2.33d7 [stat,write 4063232~4096] 3.cf740399 e15861) v4 currently waiting for subops from [1,13] 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN] 27 slow requests, 3 included below; oldest blocked for > 48.197700 secs 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 : [WRN] slow request 30.769509 seconds old, received at 2014-09-06 19:13:06.701345: osd_op(client.18777372.0:1605468 rbd_data.2f1e4e2eb141f2.3541 [stat,write 1118208~4096] 3.db1ca37e e15861) v4 currently waiting for subops from [13,6] 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN] slow request 30.769458 seconds old, received at 2014-09-06 19:13:06.701396: osd_op(client.18777372.0:1605469 rbd_data.2f1e4e2eb141f2.3541 [stat,write 1130496~4096] 3.db1ca37e e15861) v4 currently waiting for subops from [13,6] 2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN] slow request 30.266843 seconds old, received at 2014-09-06 19:13:07.204011: osd_op(client.18795696.0:847270 rbd_data.30532e2eb141f2.36bd [stat,write 3772416~4096] 3.76f1df7e e15861) v4 currently waiting for subops from [13,6] 2014-09-06 19:13:38.471152 osd.25 10.168.7.23:6827/11423 374 : [WRN] 30 slow requests, 3 included below; oldest blocked for > 49.197952 secs 2014-09-06 19:13:38.471158 osd.25 10.168.7.23:6827/11423 375 : [WRN] slow request 30.706236 seconds old, received at 2014-09-06 19:13:07.764870: osd_op(client.12483523.0:36628673 rbd_data.4defd32eb141f2.00015200 [stat,write 2121728~4096] 3.cd82ed8a e15861) v4 currently waiting for subops from [0,13] 2014-09-06 19:13:38.471162 osd.25 10.168.7.23:6827/11423 376 : [WRN] slow request 30.695616 seconds old, received at 2014-09-06 19:13:07.775490: osd_op(client.10785416.0:72721328 rbd_data.96808f2eb141f2.2a37 [stat,write 1507328~4096] 3.323e11da e15861) v4 currently wait
Re: [ceph-users] Huge issues with slow requests
On 06 Sep 2014, at 19:37, Josef Johansson wrote: > Hi, > > Unfortunatly the journal tuning did not do much. That’s odd, because I don’t > see much utilisation on OSDs themselves. Now this leads to a network-issue > between the OSDs right? > To answer my own question. Restarted a bond and it all went up again, found the culprit — packet loss. Everything up and running afterwards. I’ll be taking that beer now, Regards, Josef > On 06 Sep 2014, at 18:17, Josef Johansson wrote: > >> Hi, >> >> On 06 Sep 2014, at 17:59, Christian Balzer wrote: >> >>> >>> Hello, >>> >>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote: >>> Hi, On 06 Sep 2014, at 17:27, Christian Balzer wrote: > > Hello, > > On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: > >> We manage to go through the restore, but the performance degradation >> is still there. >> > Manifesting itself how? > Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. But mostly a lot of iowait. >>> I was thinking about the storage nodes. ^^ >>> As in, does a particular node or disk seem to be redlined all the time? >> They’re idle, with little io wait. > It also shows it self as earlier, with slow requests now and then. > > Like this > 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN] slow > request 31.554785 seconds old, received at 2014-09-06 19:12:56.914688: > osd_op(client.12483520.0:12211087 rbd_data.4b8e9b3d1b58ba.1222 > [stat,write 3813376~4096] 3.3bfab9da e15861) v4 currently waiting for subops > from [13,2] > 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN] slow > request 31.554736 seconds old, received at 2014-09-06 19:12:56.914737: > osd_op(client.12483520.0:12211088 rbd_data.4b8e9b3d1b58ba.1222 > [stat,write 3842048~8192] 3.3bfab9da e15861) v4 currently waiting for subops > from [13,2] > 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN] slow > request 30.691760 seconds old, received at 2014-09-06 19:12:57.13: > osd_op(client.12646408.0:36726433 rbd_data.81ab322eb141f2.ec38 > [stat,write 749568~4096] 3.7ae1c1da e15861) v4 currently waiting for subops > from [13,2] > 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN] 23 slow > requests, 2 included below; oldest blocked for > 42.196747 secs > 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 : [WRN] slow > request 30.344653 seconds old, received at 2014-09-06 19:13:01.125248: > osd_op(client.18869229.0:100325 rbd_data.41d2eb2eb141f2.2732 > [stat,write 2174976~4096] 3.55d437e e15861) v4 currently waiting for subops > from [13,6] > 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN] slow > request 30.344579 seconds old, received at 2014-09-06 19:13:01.125322: > osd_op(client.18869229.0:100326 rbd_data.41d2eb2eb141f2.2732 > [stat,write 2920448~4096] 3.55d437e e15861) v4 currently waiting for subops > from [13,6] > 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN] 24 slow > requests, 1 included below; oldest blocked for > 43.196971 secs > 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 : [WRN] slow > request 30.627252 seconds old, received at 2014-09-06 19:13:01.842873: > osd_op(client.10785413.0:136148901 rbd_data.96803f2eb141f2.33d7 > [stat,write 4063232~4096] 3.cf740399 e15861) v4 currently waiting for subops > from [1,13] > 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN] 27 slow > requests, 3 included below; oldest blocked for > 48.197700 secs > 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 : [WRN] slow > request 30.769509 seconds old, received at 2014-09-06 19:13:06.701345: > osd_op(client.18777372.0:1605468 rbd_data.2f1e4e2eb141f2.3541 > [stat,write 1118208~4096] 3.db1ca37e e15861) v4 currently waiting for subops > from [13,6] > 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN] slow > request 30.769458 seconds old, received at 2014-09-06 19:13:06.701396: > osd_op(client.18777372.0:1605469 rbd_data.2f1e4e2eb141f2.3541 > [stat,write 1130496~4096] 3.db1ca37e e15861) v4 currently waiting for subops > from [13,6] > 2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN] slow > request 30.266843 seconds old, received at 2014-09-06 19:13:07.204011: > osd_op(client.18795696.0:847270 rbd_data.30532e2eb141f2.36bd > [stat,write 3772416~4096] 3.76f1df7e e15861) v4 currently waiting for subops > from [13,6] > 2014-09-06 19:13:38.471152 osd.25 10.168.7.23:6827/11423 374 : [WRN] 30 slow > requests, 3 included below; oldest blocked for > 49.197952 secs > 2014-09-06 19:13:38.471158 osd.25 10.168.7.23:6827/11423 375 : [WRN] slow > request 30.706236 seconds old, received at 2014-09-06 19:13:07.764870: > osd_op(client.12483523.0:36628673 rbd_data.4defd
Re: [ceph-users] SSD journal deployment experiences
IOPS are weird things with SSDs. In theory, you'd see 25% of the write IOPS when writing to a 4-way RAID5 device, since you write to all 4 devices in parallel. Except that's not actually true--unlike HDs where an IOP is an IOP, SSD IOPS limits are really just a function of request size. Because each operation would be ~1/3rd the size, you should see a net of about 3x the performance of one drive overall, or 75% of the sum of the drives. The CPU use will be higher, but it may or may not be a substantial hit for your use case. Journals are basically write-only, and 200G S3700s are supposed to be able to sustain around 360 MB/sec, so RAID 5 would give you somewhere around 1 GB/sec writing on paper. Depending on your access patterns, that may or may not be a win vs single SSDs; it should give you slightly lower latency for uncongested writes at the very least. It's probably worth benchmarking if you have the time. OTOH, S3700s seem to be pretty reliable, and if your cluster is big enough to handle the loss of 5 OSDs without a big hit, then the lack of complexity may be a bigger win all on its own. Scott On Sat Sep 06 2014 at 9:28:32 AM Dan Van Der Ster wrote: > RAID5... Hadn't considered it due to the IOPS penalty (it would get > 1/4th of the IOPS of separated journal devices, according to some online > raid calc). Compared to RAID10, I guess we'd get 50% more capacity, but > lower performance. > > After the anecdotes that the DCS3700 is very rarely failing, and without a > stable bcache to build upon, I'm leaning toward the usual 5 journal > partitions per SSD. But that will leave at least 100GB free per drive, so I > might try running an OSD there. > > Cheers, Dan > On Sep 6, 2014 6:07 PM, Scott Laird wrote: > Backing up slightly, have you considered RAID 5 over your SSDs? > Practically speaking, there's no performance downside to RAID 5 when your > devices aren't IOPS-bound. > > On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer wrote: > >> On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote: >> >> > September 6 2014 4:01 PM, "Christian Balzer" wrote: >> > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote: >> > > >> > >> Hi Christian, >> > >> >> > >> Let's keep debating until a dev corrects us ;) >> > > >> > > For the time being, I give the recent: >> > > >> > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html >> > > >> > > And not so recent: >> > > http://www.spinics.net/lists/ceph-users/msg04152.html >> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 >> > > >> > > And I'm not going to use BTRFS for mainly RBD backed VM images >> > > (fragmentation city), never mind the other stability issues that crop >> > > up here ever so often. >> > >> > >> > Thanks for the links... So until I learn otherwise, I better assume the >> > OSD is lost when the journal fails. Even though I haven't understood >> > exactly why :( I'm going to UTSL to understand the consistency better. >> > An op state diagram would help, but I didn't find one yet. >> > >> Using the source as an option of last resort is always nice, having to >> actually do so for something like this feels a bit lacking in the >> documentation department (that or my google foo being weak). ^o^ >> >> > BTW, do you happen to know, _if_ we re-use an OSD after the journal has >> > failed, are any object inconsistencies going to be found by a >> > scrub/deep-scrub? >> > >> No idea. >> And really a scenario I hope to never encounter. ^^;; >> >> > >> >> > >> We have 4 servers in a 3U rack, then each of those servers is >> > >> connected to one of these enclosures with a single SAS cable. >> > >> >> > With the current config, when I dd to all drives in parallel I can >> > write at 24*74MB/s = 1776MB/s. >> > >>> >> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 >> > >>> lanes, so as far as that bus goes, it can do 4GB/s. >> > >>> And given your storage pod I assume it is connected with 2 mini-SAS >> > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA >> > >>> bandwidth. >> > >> >> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected. >> > > >> > > Alright, that explains that then. Any reason for not using both ports? >> > > >> > >> > Probably to minimize costs, and since the single 10Gig-E is a bottleneck >> > anyway. The whole thing is suboptimal anyway, since this hardware was >> > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc... >> > >> The single 10Gb/s link is the bottleneck for sustained stuff, but when >> looking at spikes... >> Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port >> might also get some loving. ^o^ >> >> The cluster I'm currently building is based on storage nodes with 4 SSDs >> (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 >> HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for >> redundancy, not speed. ^^ >> >> > >>>
Re: [ceph-users] Huge issues with slow requests
On Sat, 6 Sep 2014 19:47:13 +0200 Josef Johansson wrote: > > On 06 Sep 2014, at 19:37, Josef Johansson wrote: > > > Hi, > > > > Unfortunatly the journal tuning did not do much. That’s odd, because I > > don’t see much utilisation on OSDs themselves. Now this leads to a > > network-issue between the OSDs right? > > > To answer my own question. Restarted a bond and it all went up again, > found the culprit — packet loss. Everything up and running afterwards. > If there were actual errors, that should have been visible in atop as well. For utilization it isn't that obvious, as it doesn't know what bandwidth a bond device has. Same is true for IPoIB interfaces. And FWIW, tap (kvm guest interfaces) are wrongly pegged in the kernel at 10Mb/s, so they get to be falsely redlined on compute nodes all the time. > I’ll be taking that beer now, Skol. Christian > Regards, > Josef > > On 06 Sep 2014, at 18:17, Josef Johansson wrote: > > > >> Hi, > >> > >> On 06 Sep 2014, at 17:59, Christian Balzer wrote: > >> > >>> > >>> Hello, > >>> > >>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote: > >>> > Hi, > > On 06 Sep 2014, at 17:27, Christian Balzer wrote: > > > > > Hello, > > > > On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: > > > >> We manage to go through the restore, but the performance > >> degradation is still there. > >> > > Manifesting itself how? > > > Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. > But mostly a lot of iowait. > > >>> I was thinking about the storage nodes. ^^ > >>> As in, does a particular node or disk seem to be redlined all the > >>> time? > >> They’re idle, with little io wait. > > It also shows it self as earlier, with slow requests now and then. > > > > Like this > > 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN] > > slow request 31.554785 seconds old, received at 2014-09-06 > > 19:12:56.914688: osd_op(client.12483520.0:12211087 > > rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3813376~4096] > > 3.3bfab9da e15861) v4 currently waiting for subops from [13,2] > > 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN] > > slow request 31.554736 seconds old, received at 2014-09-06 > > 19:12:56.914737: osd_op(client.12483520.0:12211088 > > rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3842048~8192] > > 3.3bfab9da e15861) v4 currently waiting for subops from [13,2] > > 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN] > > slow request 30.691760 seconds old, received at 2014-09-06 > > 19:12:57.13: osd_op(client.12646408.0:36726433 > > rbd_data.81ab322eb141f2.ec38 [stat,write 749568~4096] > > 3.7ae1c1da e15861) v4 currently waiting for subops from [13,2] > > 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN] > > 23 slow requests, 2 included below; oldest blocked for > 42.196747 > > secs 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 : > > [WRN] slow request 30.344653 seconds old, received at 2014-09-06 > > 19:13:01.125248: osd_op(client.18869229.0:100325 > > rbd_data.41d2eb2eb141f2.2732 [stat,write 2174976~4096] > > 3.55d437e e15861) v4 currently waiting for subops from [13,6] > > 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN] > > slow request 30.344579 seconds old, received at 2014-09-06 > > 19:13:01.125322: osd_op(client.18869229.0:100326 > > rbd_data.41d2eb2eb141f2.2732 [stat,write 2920448~4096] > > 3.55d437e e15861) v4 currently waiting for subops from [13,6] > > 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN] > > 24 slow requests, 1 included below; oldest blocked for > 43.196971 > > secs 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 : > > [WRN] slow request 30.627252 seconds old, received at 2014-09-06 > > 19:13:01.842873: osd_op(client.10785413.0:136148901 > > rbd_data.96803f2eb141f2.33d7 [stat,write 4063232~4096] > > 3.cf740399 e15861) v4 currently waiting for subops from [1,13] > > 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN] > > 27 slow requests, 3 included below; oldest blocked for > 48.197700 > > secs 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 : > > [WRN] slow request 30.769509 seconds old, received at 2014-09-06 > > 19:13:06.701345: osd_op(client.18777372.0:1605468 > > rbd_data.2f1e4e2eb141f2.3541 [stat,write 1118208~4096] > > 3.db1ca37e e15861) v4 currently waiting for subops from [13,6] > > 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN] > > slow request 30.769458 seconds old, received at 2014-09-06 > > 19:13:06.701396: osd_op(client.18777372.0:1605469 > > rbd_data.2f1e4e2eb141f2.3541 [stat,write 1130496~4096] > > 3.db1ca37e e15861) v4 currently waiting for subops from [13,6] > > 2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN] > > slow
[ceph-users] 答复: ceph osd unexpected error
I use latest version 0.80.6 I am setting the limitation now, and watching? 发件人: Somnath Roy [somnath@sandisk.com] 发送时间: 2014年9月7日 1:12 到: Haomai Wang; 廖建锋 Cc: ceph-users; ceph-devel 主题: RE: [ceph-users] ceph osd unexpected error Have you set the open file descriptor limit in the OSD node ? Try setting it like 'ulimit -n 65536" -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang Sent: Saturday, September 06, 2014 7:44 AM To: 廖建锋 Cc: ceph-users; ceph-devel Subject: Re: [ceph-users] ceph osd unexpected error Hi, Could you give some more detail infos such as operation before occur errors? And what's your ceph version? On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋 wrote: > Dear CEPH , > Urgent question, I met a "FAILED assert(0 == "unexpected error")" > yesterday , Now i have not way to start this OSDS I have attached my > logs in the attachment, and some ceph configurations as below > > > osd_pool_default_pgp_num = 300 > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 300 > mon_host = 10.1.0.213,10.1.0.214 > osd_crush_chooseleaf_type = 1 > mds_cache_size = 50 > osd objectstore = keyvaluestore-dev > > > > Detailed error information : > > >-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops > || > 11642907 > 104857600 > -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops || > 11642899 > 104857600 > -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops || > 11642901 > 104857600 > -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick > -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient: > _check_auth_rotating have uptodate secrets (they expire after > 2014-09-05 > 15:07:05.326835) > -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now: > 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341) > -- no > -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops || > 11044551 > 104857600 > -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013 > -6> --> > osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0 > 0x18dcf000 > -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops || > 11044553 > 104857600 > -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops || > 11044579 > 104857600 > -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid > -3> argument > not handled on operation 9 (336.0.3, or op 3, counting from 0) > -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code > -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump: > { "ops": [ > { "op_num": 0, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 1, > "op_name": "mkcoll", > "collection": "0.a9_TEMP"}, > { "op_num": 2, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 3, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0"}, > { "op_num": 4, > "op_name": "omap_setheader", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "header_length": "0"}, > { "op_num": 5, > "op_name": "write", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "length": 1160, > "offset": 0, > "bufferlist length": 1160}, > { "op_num": 6, > "op_name": "omap_setkeys", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "attr_lens": {}}, > { "op_num": 7, > "op_name": "setattrs", > "collection": "0.a9_head", > "oid": "4b0fea9\/153b885.\/head\/\/0", > "attr_lens": { "_": 239, > "_parent": 250, > "snapset": 31}}, > { "op_num": 8, > "op_name": "omap_setkeys", > "collection": "meta", > "oid": "16ef7597\/infos\/head\/\/-1", > "attr_lens": { "0.a9_epoch": 4, > "0.a9_info": 684}}, > { "op_num": 9, > "op_name": "remove", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 10, > "op_name": "remove", > "collection": "0.a9_TEMP", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 11, > "op_name": "touch", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, > { "op_num": 12, > "op_name": "omap_setheader", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "header_length": "0"}, > { "op_num": 13, > "op_name": "write", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "length": 507284, > "offset": 0, > "bufferlist length": 507284}, > { "op_num": 14, > "op_name": "omap_setkeys", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.\/head\/\/0", > "attr_lens": {}}, > { "op_num": 15, > "op_name": "setattrs", > "collection": "0.a9_head", > "oid": "4c56f2a9\/1c04096.
Re: [ceph-users] resizing the OSD
Hello, On Sat, 06 Sep 2014 10:28:19 -0700 JIten Shah wrote: > Thanks Christian. Replies inline. > On Sep 6, 2014, at 8:04 AM, Christian Balzer wrote: > > > > > Hello, > > > > On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote: > > > >> Hello Cephers, > >> > >> We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of > >> the stuff seems to be working fine but we are seeing some degrading > >> on the osd's due to lack of space on the osd's. > > > > Please elaborate on that degradation. > > The degradation happened on few OSD's because it got quickly filled up. > They were not of the same size as the other OSD's. Now I want to remove > these OSD's and readd them with correct size to match the others. Alright, that's good idea, uniformity helps. ^^ > > > >> Is there a way to resize the > >> OSD without bringing the cluster down? > >> > > > > Define both "resize" and "cluster down". > > Basically I want to remove the OSD's with incorrect size and readd them > with the size matching the other OSD's. > > > > As in, resizing how? > > Are your current OSDs on disks/LVMs that are not fully used and thus > > could be grown? > > What is the size of your current OSDs? > > The size of current OSD's is 20GB and we do have more unused space on > the disk that we can make the LVM bigger and increase the size of the > OSD's. I agree that we need to have all the disks of same size and I am > working towards that.Thanks. > > OK, so your OSDs are backed by LVM. A curious choice, any particular reason to do so? Either way, in theory you could grow things in place, obviously first the LVM and then the underlying filesystem. Both ext4 and xfs support online growing, so the OSD can keep running the whole time. If you're unfamiliar with these things, play with them on a test machine first. Now for the next step we will really need to know how you deployed ceph and the result of "ceph osd tree" (not all 100 OSDs are needed, a sample of a "small" and "big" OSD is sufficient). Depending on the results (it will probably have varying weights depending on the size and a reweight value of 1 for all) you will need to adjust the weight of the grown OSD in question accordingly with "ceph osd crush reweight". That step will incur data movement, so do it one OSD at a time. > > The normal way of growing a cluster is to add more OSDs. > > Preferably of the same size and same performance disks. > > This will not only simplify things immensely but also make them a lot > > more predictable. > > This of course depends on your use case and usage patterns, but often > > when running out of space you're also running out of other resources > > like CPU, memory or IOPS of the disks involved. So adding more instead > > of growing them is most likely the way forward. > > > > If you were to replace actual disks with larger ones, take them (the > > OSDs) out one at a time and re-add it. If you're using ceph-deploy, it > > will use the disk size as basic weight, if you're doing things > > manually make sure to specify that size/weight accordingly. > > Again, you do want to do this for all disks to keep things uniform. > > > > If your cluster (pools really) are set to a replica size of at least 2 > > (risky!) or 3 (as per Firefly default), taking a single OSD out would > > of course never bring the cluster down. > > However taking an OSD out and/or adding a new one will cause data > > movement that might impact your cluster's performance. > > > > We have a current replica size of 2 with 100 OSD's. How many can I loose > without affecting the performance? I understand the impact of data > movement. > Unless your LVMs are in turn living on a RAID, a replica of 2 with 100 OSDs is begging Murphy for a double disk failure. I'm also curious on how many actual physical disks those OSD live and how many physical hosts are in your cluster. So again, you can't loose more than one OSD at a time w/o loosing data. The performance impact of losing a single OSD out of 100 should be small, especially given the size of your OSDs. However w/o knowing your actual cluster (hardware and otherwise) don't expect anybody here to make accurate predictions. Christian > --Jiten > > > > > > > Regards, > > > > Christian > > -- > > Christian BalzerNetwork/Systems Engineer > > ch...@gol.com Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 答复: ceph osd unexpected error
Yes, if you still meet this error, please add "debug_keyvaluestore=20/20" to your config and catch the debug output On Sun, Sep 7, 2014 at 11:11 AM, 廖建锋 wrote: > I use latest version 0.80.6 > I am setting the limitation now, and watching? > > > > 发件人: Somnath Roy [somnath@sandisk.com] > 发送时间: 2014年9月7日 1:12 > 到: Haomai Wang; 廖建锋 > Cc: ceph-users; ceph-devel > 主题: RE: [ceph-users] ceph osd unexpected error > > Have you set the open file descriptor limit in the OSD node ? > Try setting it like 'ulimit -n 65536" > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang > Sent: Saturday, September 06, 2014 7:44 AM > To: 廖建锋 > Cc: ceph-users; ceph-devel > Subject: Re: [ceph-users] ceph osd unexpected error > > Hi, > > Could you give some more detail infos such as operation before occur errors? > > And what's your ceph version? > > On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋 wrote: >> Dear CEPH , >> Urgent question, I met a "FAILED assert(0 == "unexpected error")" >> yesterday , Now i have not way to start this OSDS I have attached my >> logs in the attachment, and some ceph configurations as below >> >> >> osd_pool_default_pgp_num = 300 >> osd_pool_default_size = 2 >> osd_pool_default_min_size = 1 >> osd_pool_default_pg_num = 300 >> mon_host = 10.1.0.213,10.1.0.214 >> osd_crush_chooseleaf_type = 1 >> mds_cache_size = 50 >> osd objectstore = keyvaluestore-dev >> >> >> >> Detailed error information : >> >> >>-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops >> || >> 11642907 > 104857600 >> -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops || >> 11642899 > 104857600 >> -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops || >> 11642901 > 104857600 >> -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick >> -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient: >> _check_auth_rotating have uptodate secrets (they expire after >> 2014-09-05 >> 15:07:05.326835) >> -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now: >> 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341) >> -- no >> -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops || >> 11044551 > 104857600 >> -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013 >> -6> --> >> osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0 >> 0x18dcf000 >> -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops || >> 11044553 > 104857600 >> -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops || >> 11044579 > 104857600 >> -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid >> -3> argument >> not handled on operation 9 (336.0.3, or op 3, counting from 0) >> -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code >> -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump: >> { "ops": [ >> { "op_num": 0, >> "op_name": "remove", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0"}, >> { "op_num": 1, >> "op_name": "mkcoll", >> "collection": "0.a9_TEMP"}, >> { "op_num": 2, >> "op_name": "remove", >> "collection": "0.a9_TEMP", >> "oid": "4b0fea9\/153b885.\/head\/\/0"}, >> { "op_num": 3, >> "op_name": "touch", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0"}, >> { "op_num": 4, >> "op_name": "omap_setheader", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0", >> "header_length": "0"}, >> { "op_num": 5, >> "op_name": "write", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0", >> "length": 1160, >> "offset": 0, >> "bufferlist length": 1160}, >> { "op_num": 6, >> "op_name": "omap_setkeys", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0", >> "attr_lens": {}}, >> { "op_num": 7, >> "op_name": "setattrs", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0", >> "attr_lens": { "_": 239, >> "_parent": 250, >> "snapset": 31}}, >> { "op_num": 8, >> "op_name": "omap_setkeys", >> "collection": "meta", >> "oid": "16ef7597\/infos\/head\/\/-1", >> "attr_lens": { "0.a9_epoch": 4, >> "0.a9_info": 684}}, >> { "op_num": 9, >> "op_name": "remove", >> "collection": "0.a9_head", >> "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, >> { "op_num": 10, >> "op_name": "remove", >> "collection": "0.a9_TEMP", >> "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, >> { "op_num": 11, >> "op_name": "touch", >> "collection": "0.a9_head", >> "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, >> { "op_num": 12, >> "op_name": "omap_setheader", >> "collection": "0.a9_head", >> "oid": "4c56f2a9\/1c04096.\/head\/\/0", >> "header_length": "0"}, >> { "op_num": 13, >> "op_name": "write", >> "collection": "0.a9_head", >> "oid": "4c56f2a9\/1c04096.\/head\/\/
[ceph-users] 答复: 答复: ceph osd unexpected error
it happend this morning, i can not wait, so I remove and add osd again next time I will set debug level up when it happend again thanks very much 发件人: Haomai Wang [haomaiw...@gmail.com] 发送时间: 2014年9月7日 12:08 到: 廖建锋 Cc: Somnath Roy; ceph-users; ceph-devel 主题: Re: 答复: [ceph-users] ceph osd unexpected error Yes, if you still meet this error, please add "debug_keyvaluestore=20/20" to your config and catch the debug output On Sun, Sep 7, 2014 at 11:11 AM, 廖建锋 wrote: > I use latest version 0.80.6 > I am setting the limitation now, and watching? > > > > 发件人: Somnath Roy [somnath@sandisk.com] > 发送时间: 2014年9月7日 1:12 > 到: Haomai Wang; 廖建锋 > Cc: ceph-users; ceph-devel > 主题: RE: [ceph-users] ceph osd unexpected error > > Have you set the open file descriptor limit in the OSD node ? > Try setting it like 'ulimit -n 65536" > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang > Sent: Saturday, September 06, 2014 7:44 AM > To: 廖建锋 > Cc: ceph-users; ceph-devel > Subject: Re: [ceph-users] ceph osd unexpected error > > Hi, > > Could you give some more detail infos such as operation before occur errors? > > And what's your ceph version? > > On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋 wrote: >> Dear CEPH , >> Urgent question, I met a "FAILED assert(0 == "unexpected error")" >> yesterday , Now i have not way to start this OSDS I have attached my >> logs in the attachment, and some ceph configurations as below >> >> >> osd_pool_default_pgp_num = 300 >> osd_pool_default_size = 2 >> osd_pool_default_min_size = 1 >> osd_pool_default_pg_num = 300 >> mon_host = 10.1.0.213,10.1.0.214 >> osd_crush_chooseleaf_type = 1 >> mds_cache_size = 50 >> osd objectstore = keyvaluestore-dev >> >> >> >> Detailed error information : >> >> >>-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops >> || >> 11642907 > 104857600 >> -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops || >> 11642899 > 104857600 >> -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops || >> 11642901 > 104857600 >> -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick >> -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient: >> _check_auth_rotating have uptodate secrets (they expire after >> 2014-09-05 >> 15:07:05.326835) >> -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now: >> 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341) >> -- no >> -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops || >> 11044551 > 104857600 >> -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013 >> -6> --> >> osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0 >> 0x18dcf000 >> -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops || >> 11044553 > 104857600 >> -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops || >> 11044579 > 104857600 >> -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid >> -3> argument >> not handled on operation 9 (336.0.3, or op 3, counting from 0) >> -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code >> -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump: >> { "ops": [ >> { "op_num": 0, >> "op_name": "remove", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0"}, >> { "op_num": 1, >> "op_name": "mkcoll", >> "collection": "0.a9_TEMP"}, >> { "op_num": 2, >> "op_name": "remove", >> "collection": "0.a9_TEMP", >> "oid": "4b0fea9\/153b885.\/head\/\/0"}, >> { "op_num": 3, >> "op_name": "touch", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0"}, >> { "op_num": 4, >> "op_name": "omap_setheader", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0", >> "header_length": "0"}, >> { "op_num": 5, >> "op_name": "write", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0", >> "length": 1160, >> "offset": 0, >> "bufferlist length": 1160}, >> { "op_num": 6, >> "op_name": "omap_setkeys", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0", >> "attr_lens": {}}, >> { "op_num": 7, >> "op_name": "setattrs", >> "collection": "0.a9_head", >> "oid": "4b0fea9\/153b885.\/head\/\/0", >> "attr_lens": { "_": 239, >> "_parent": 250, >> "snapset": 31}}, >> { "op_num": 8, >> "op_name": "omap_setkeys", >> "collection": "meta", >> "oid": "16ef7597\/infos\/head\/\/-1", >> "attr_lens": { "0.a9_epoch": 4, >> "0.a9_info": 684}}, >> { "op_num": 9, >> "op_name": "remove", >> "collection": "0.a9_head", >> "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, >> { "op_num": 10, >> "op_name": "remove", >> "collection": "0.a9_TEMP", >> "oid": "4c56f2a9\/1c04096.\/head\/\/0"}, >> { "op_num": 11, >> "op_name": "touch", >> "collection": "0.
Re: [ceph-users] Huge issues with slow requests
On 07 Sep 2014, at 04:47, Christian Balzer wrote: > On Sat, 6 Sep 2014 19:47:13 +0200 Josef Johansson wrote: > >> >> On 06 Sep 2014, at 19:37, Josef Johansson wrote: >> >>> Hi, >>> >>> Unfortunatly the journal tuning did not do much. That’s odd, because I >>> don’t see much utilisation on OSDs themselves. Now this leads to a >>> network-issue between the OSDs right? >>> >> To answer my own question. Restarted a bond and it all went up again, >> found the culprit — packet loss. Everything up and running afterwards. >> > If there were actual errors, that should have been visible in atop as well. > For utilization it isn't that obvious, as it doesn't know what bandwidth a > bond device has. Same is true for IPoIB interfaces. > And FWIW, tap (kvm guest interfaces) are wrongly pegged in the kernel at > 10Mb/s, so they get to be falsely redlined on compute nodes all the time. > This is the second time I’ve seen Ceph behaving badly due to networking issues. Maybe @Inktank has ideas of how to announce in the ceph log that there’s packet loss? Regards, Josef >> I’ll be taking that beer now, > > Skol. > > Christian > >> Regards, >> Josef >>> On 06 Sep 2014, at 18:17, Josef Johansson wrote: >>> Hi, On 06 Sep 2014, at 17:59, Christian Balzer wrote: > > Hello, > > On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote: > >> Hi, >> >> On 06 Sep 2014, at 17:27, Christian Balzer wrote: >> >>> >>> Hello, >>> >>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote: >>> We manage to go through the restore, but the performance degradation is still there. >>> Manifesting itself how? >>> >> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so. >> But mostly a lot of iowait. >> > I was thinking about the storage nodes. ^^ > As in, does a particular node or disk seem to be redlined all the > time? They’re idle, with little io wait. >>> It also shows it self as earlier, with slow requests now and then. >>> >>> Like this >>> 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN] >>> slow request 31.554785 seconds old, received at 2014-09-06 >>> 19:12:56.914688: osd_op(client.12483520.0:12211087 >>> rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3813376~4096] >>> 3.3bfab9da e15861) v4 currently waiting for subops from [13,2] >>> 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN] >>> slow request 31.554736 seconds old, received at 2014-09-06 >>> 19:12:56.914737: osd_op(client.12483520.0:12211088 >>> rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3842048~8192] >>> 3.3bfab9da e15861) v4 currently waiting for subops from [13,2] >>> 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN] >>> slow request 30.691760 seconds old, received at 2014-09-06 >>> 19:12:57.13: osd_op(client.12646408.0:36726433 >>> rbd_data.81ab322eb141f2.ec38 [stat,write 749568~4096] >>> 3.7ae1c1da e15861) v4 currently waiting for subops from [13,2] >>> 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN] >>> 23 slow requests, 2 included below; oldest blocked for > 42.196747 >>> secs 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 : >>> [WRN] slow request 30.344653 seconds old, received at 2014-09-06 >>> 19:13:01.125248: osd_op(client.18869229.0:100325 >>> rbd_data.41d2eb2eb141f2.2732 [stat,write 2174976~4096] >>> 3.55d437e e15861) v4 currently waiting for subops from [13,6] >>> 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN] >>> slow request 30.344579 seconds old, received at 2014-09-06 >>> 19:13:01.125322: osd_op(client.18869229.0:100326 >>> rbd_data.41d2eb2eb141f2.2732 [stat,write 2920448~4096] >>> 3.55d437e e15861) v4 currently waiting for subops from [13,6] >>> 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN] >>> 24 slow requests, 1 included below; oldest blocked for > 43.196971 >>> secs 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 : >>> [WRN] slow request 30.627252 seconds old, received at 2014-09-06 >>> 19:13:01.842873: osd_op(client.10785413.0:136148901 >>> rbd_data.96803f2eb141f2.33d7 [stat,write 4063232~4096] >>> 3.cf740399 e15861) v4 currently waiting for subops from [1,13] >>> 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN] >>> 27 slow requests, 3 included below; oldest blocked for > 48.197700 >>> secs 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 : >>> [WRN] slow request 30.769509 seconds old, received at 2014-09-06 >>> 19:13:06.701345: osd_op(client.18777372.0:1605468 >>> rbd_data.2f1e4e2eb141f2.3541 [stat,write 1118208~4096] >>> 3.db1ca37e e15861) v4 currently waiting for subops from [13,6] >>> 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN] >>> slow request 30.769458 seconds old, received at 2014-09-06 >>>