Re: [ceph-users] Good way to monitor detailed latency/throughput

2014-09-06 Thread Christian Balzer
On Fri, 05 Sep 2014 16:23:13 +0200 Josef Johansson wrote:

> Hi,
> 
> How do you guys monitor the cluster to find disks that behave bad, or
> VMs that impact the Ceph cluster?
> 
> I'm looking for something where I could get a good bird-view of
> latency/throughput, that uses something easy like SNMP.
> 
You mean there is another form of monitoring than waiting for the
users/customers to yell at you you because performance sucks? ^o^

The first part is relatively easy, run something like "iostat -y -x 300"
and feed the output into snmp via the extend functionality. Maybe somebody
has done that already, but it would be trivial anyway. 
The hard part here is what to do with that data, just graphing it is great
for post-mortem analysis or if you have 24h staff staring blindly at
monitors. 
Deciding what numbers warrant a warning or even a notification (in Nagios
terms) is going to be much harder.

Take this iostat -x output (all activities since boot) for example:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.300.022.61 0.43   405.49   308.65 
0.03   10.440.50   10.52   0.83   0.22
sdb   0.00 0.300.012.55 0.27   379.16   296.20 
0.03   11.310.73   11.35   0.80   0.21
sdc   0.00 0.290.022.44 0.38   376.57   307.23 
0.03   11.820.56   11.89   0.84   0.21
sdd   0.00 0.290.012.42 0.24   369.05   304.43 
0.03   11.510.63   11.55   0.84   0.20
sde   0.02   266.520.652.9372.56   365.03   244.67 
0.29   79.751.65   97.16   1.60   0.57
sdg   0.01 0.970.720.6576.33   187.84   384.75 
0.09   69.061.85  143.21   2.87   0.39
sdf   0.01 0.870.680.5967.04   167.94   369.82 
0.09   67.582.79  143.18   3.44   0.44
sdh   0.00 0.940.940.6474.87   182.81   327.19 
0.09   57.341.91  139.22   2.79   0.44
sdj   0.01 0.960.930.6575.76   187.75   331.78 
0.10   62.761.81  149.88   2.72   0.43
sdk   0.01 1.021.000.6777.78   188.83   320.46 
0.08   47.021.66  115.02   2.53   0.42
sdi   0.01 0.930.960.6174.38   173.72   317.35 
0.22  140.562.16  358.85   3.49   0.54
sdl   0.01 0.920.710.6272.57   175.19   373.05 
0.09   65.362.01  138.19   3.03   0.40

sda to sdd are SSDs. So for starters, you can't compare them with
spinning rust. So if you were to look for outliers, all of sde to sdl
(actual disks) are suspiciously slow. ^o^
And if you look at sde it seems to be faster than the rest, but that is
because the original drive was replaced and thus the new one has seen
less action than the rest.
The actual wonky drive is sdi, looking at await/w_await and svctm. 
This drive sometimes goes into a state (for 10-20 hours at a time) where
it can only perform at half speed.

These are the same drives when running a rados bench against the cluster,
sdi is currently not wonky and performing at full speed:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0.00   173.000.00  236.60 0.00 91407.20   772.67
76.40  338.320.00  338.32   4.21  99.60
sdg   0.00   153.000.40  234.60 1.60 88052.40   749.40
83.61  359.95   23.00  360.52   4.24  99.68
sdf   0.00   147.300.50  206.00 2.00 68918.40   667.51
50.15  264.40   65.60  264.88   4.45  91.88
sdh   0.00   158.100.80  170.90 3.20 66077.20   769.72
31.31  153.45   12.50  154.11   5.40  92.76
sdj   0.00   158.000.60  207.00 2.40 77455.20   746.22
61.61  296.78   55.33  297.48   4.79  99.52
sdk   0.00   160.900.90  242.30 3.60 92251.20   758.67
57.11  234.84   40.44  235.57   4.06  98.68
sdi   0.00   166.701.00  190.90 4.00 69919.20   728.75
60.15  282.98   24.00  284.34   5.16  99.00
sdl   0.00   131.900.80  207.10 3.20 85014.00   817.87
92.10  412.02   53.00  413.41   4.79  99.52

Now things are more uniform (of course ceph never is really uniform and
sdh was more busy and thus slower in the next sample).
If sdi were in its half speed mode, it would at 100% (all the time while
the other drives were not and often even idle) and with a svctm of about 15
and w_await well over 800.
You could simply say that with this baseline, anything that goes over 500
w_await is worthy an alert, but it might only get there if your cluster is
sufficiently busy.
To really find a "slow" disk, you need to compare identical disks having
the same workload.
Personally I'm still not sure what formula to use, even though it is so
blatantly obvious and v

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Christian Balzer
On Fri, 5 Sep 2014 09:42:02 + Dan Van Der Ster wrote:

> 
> > On 05 Sep 2014, at 11:04, Christian Balzer  wrote:
> > 
> > On Fri, 5 Sep 2014 07:46:12 + Dan Van Der Ster wrote:
> >> 
> >>> On 05 Sep 2014, at 03:09, Christian Balzer  wrote:
> >>> 
> >>> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
> >>> 
>  On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
>   wrote:
>  
[snip]
> > 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful
> > is the backfilling which results from an SSD failure? Have you
> > considered tricks like increasing the down out interval so
> > backfilling doesn’t happen in this case (leaving time for the SSD
> > to be replaced)?
> > 
>  
>  Replacing a failed SSD won't help your backfill.  I haven't actually
>  tested it, but I'm pretty sure that losing the journal effectively
>  corrupts your OSDs.  I don't know what steps are required to
>  complete this operation, but it wouldn't surprise me if you need to
>  re-format the OSD.
>  
> >>> This.
> >>> All the threads I've read about this indicate that journal loss
> >>> during operation means OSD loss. Total OSD loss, no recovery.
> >>> From what I gathered the developers are aware of this and it might be
> >>> addressed in the future.
> >>> 
> >> 
> >> I suppose I need to try it then. I don’t understand why you can't just
> >> use ceph-osd -i 10 --mkjournal to rebuild osd 10’s journal, for
> >> example.
> >> 
> > I think the logic is if you shut down an OSD cleanly beforehand you can
> > just do that.
> > However from what I gathered there is no logic to re-issue transactions
> > that made it to the journal but not the filestore.
> > So a journal SSD failing mid-operation with a busy OSD would certainly
> > be in that state.
> > 
> 
> I had thought that the journal write and the buffered filestore write
> happen at the same time. 

Nope, definitely not.

That's why we have tunables like the ones at:
http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals

And people (me included) tend to crank that up (to eleven ^o^).

The write-out to the filestore may start roughly at the same time as the
journal gets things, but it can and will fall behind.

> So all the previous journal writes that
> succeeded are already on their way to the filestore. My (could be
> incorrect) understanding is that the real purpose of the journal is to
> be able to replay writes after a power outage (since the buffered
> filestore writes would be lost in that case). If there is no power
> outage, then filestore writes are still good regardless of a journal
> failure.
> 
From Cephs perspective a write is successful once it is on all replica
size journals. 
I think (hope) that what you wrote up there to be true, but that doesn't
change the fact that journal data not even on the way to the filestore yet
is the crux here.

> 
> > I'm sure (hope) somebody from the Ceph team will pipe up about this.
> 
> Ditto!
> 
Guess it will be next week...

> 
> >>> Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5
> >>> ratio is sensible. However these will be the ones limiting your max
> >>> sequential write speed if that is of importance to you. In nearly all
> >>> use cases you run out of IOPS (on your HDDs) long before that becomes
> >>> an issue, though.
> >> 
> >> IOPS is definitely the main limit, but we also only have 1 single
> >> 10Gig-E NIC on these servers, so 4 drives that can write (even only
> >> 200MB/s) would be good enough.
> >> 
> > Fair enough. ^o^
> > 
> >> Also, we’ll put the SSDs in the first four ports of an SAS2008 HBA
> >> which is shared with the other 20 spinning disks. Counting the double
> >> writes, the HBA will run out of bandwidth before these SSDs, I expect.
> >> 
> > Depends on what PCIe slot it is and so forth. A 2008 should give you
> > 4GB/s, enough to keep the SSDs happy at least. ^o^
> > 
> > A 2008 has only 8 SAS/SATA ports, so are you using port expanders on
> > your case backplane? 
> > In that case you might want to spread the SSDs out over channels, as in
> > have 3 HDDs sharing one channel with one SSD.
> 
> We use a Promise VTrak J830sS, and now I’ll got ask our hardware team if
> there would be any benefit to store the SSDs row or column wise.
>
Ah, a storage pod. So you have that and a real OSD head server, something
like a 1U machine or Supermicro Twin? 
Looking at the specs of it I would assume 3 drive per expander, so having
one SSD mixed with 2 HDDs should definitely be beneficial. 

> With the current config, when I dd to all drives in parallel I can write
> at 24*74MB/s = 1776MB/s.
> 
That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 lanes,
so as far as that bus goes, it can do 4GB/s.
And given your storage pod I assume it is connected with 2 mini-SAS
cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA bandwidth. 

How fast can your "eco 5900rpm" drive

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
Also putting this on the list.

On 06 Sep 2014, at 13:36, Josef Johansson  wrote:

> Hi,
> 
> Same issues again, but I think we found the drive that causes the problems.
> 
> But this is causing problems as it’s trying to do a recover to that osd at 
> the moment.
> 
> So we’re left with the status message 
> 
> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 
> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB used, 
> 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; 41424/15131923 
> degraded (0.274%);  recovering 0 o/s, 2035KB/s
> 
> 
> It’s improving, but way too slowly. If I restart the recovery (ceph osd set 
> no recovery /unset) it doesn’t change the osd what I can see.
> 
> Any ideas?
> 
> Cheers,
> Josef
> 
> On 05 Sep 2014, at 11:26, Luis Periquito  wrote:
> 
>> Only time I saw such behaviour was when I was deleting a big chunk of data 
>> from the cluster: all the client activity was reduced, the op/s were almost 
>> non-existent and there was unjustified delays all over the cluster. But all 
>> the disks were somewhat busy in atop/iotstat.
>> 
>> 
>> On 5 September 2014 09:51, David  wrote:
>> Hi,
>> 
>> Indeed strange.
>> 
>> That output was when we had issues, seems that most operations were blocked 
>> / slow requests.
>> 
>> A ”baseline” output is more like today:
>> 
>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9273KB/s 
>> rd, 24650KB/s wr, 2755op/s
>> 2014-09-05 10:44:30.125637 mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s 
>> rd, 20430KB/s wr, 2294op/s
>> 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9216KB/s 
>> rd, 20062KB/s wr, 2488op/s
>> 2014-09-05 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12511KB/s 
>> rd, 15739KB/s wr, 2488op/s
>> 2014-09-05 10:44:33.161210 mon.0 [INF] pgmap v12582763: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 18593KB/s 
>> rd, 14880KB/s wr, 2609op/s
>> 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17720KB/s 
>> rd, 22964KB/s wr, 3257op/s
>> 2014-09-05 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 19230KB/s 
>> rd, 18901KB/s wr, 3199op/s
>> 2014-09-05 10:44:36.213535 mon.0 [INF] pgmap v12582766: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17630KB/s 
>> rd, 18855KB/s wr, 3131op/s
>> 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12262KB/s 
>> rd, 18627KB/s wr, 2595op/s
>> 2014-09-05 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17697KB/s 
>> rd, 17572KB/s wr, 2156op/s
>> 2014-09-05 10:44:39.239409 mon.0 [INF] pgmap v12582769: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 20300KB/s 
>> rd, 19735KB/s wr, 2197op/s
>> 2014-09-05 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 14656KB/s 
>> rd, 15460KB/s wr, 2199op/s
>> 2014-09-05 10:44:41.269736 mon.0 [INF] pgmap v12582771: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 8969KB/s 
>> rd, 11918KB/s wr, 1951op/s
>> 2014-09-05 10:44:42.276192 mon.0 [INF] pgmap v12582772: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 7272KB/s 
>> rd, 10644KB/s wr, 1832op/s
>> 2014-09-05 10:44:43.291817 mon.0 [INF] pgmap v12582773: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9316KB/s 
>> rd, 16610KB/s wr, 2412op/s
>> 2014-09-05 10:44:44.295469 mon.0 [INF] pgmap v12582774: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9257KB/s 
>> rd, 19953KB/s wr, 2633op/s
>> 2014-09-05 10:44:45.315774 mon.0 [INF] pgmap v12582775: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9718KB/s 
>> rd, 14298KB/s wr, 2101op/s
>> 2014-09-05 10:44:46.326783 mon.0 [INF] pgmap v12582776: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 20877KB/s 
>> rd, 12822KB/s wr, 2447op/s
>> 2014-09-05 10:44:47.327537 mon.0 [INF] pgmap v12582777: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 18447KB/s 
>> rd, 12945KB/s wr, 2226op/s
>> 2014-09-05 10:44:48.348725 mon.0 [INF] pgmap v12582778: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB u

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Christian Balzer

Hello,

On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:

> Also putting this on the list.
> 
> On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
> 
> > Hi,
> > 
> > Same issues again, but I think we found the drive that causes the
> > problems.
> > 
> > But this is causing problems as it’s trying to do a recover to that
> > osd at the moment.
> > 
> > So we’re left with the status message 
> > 
> > 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
> > active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
> > used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
> > 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
> > 
> > 
> > It’s improving, but way too slowly. If I restart the recovery (ceph
> > osd set no recovery /unset) it doesn’t change the osd what I can see.
> > 
> > Any ideas?
> > 
I don't know the state of your cluster, i.e. what caused the recovery to
start (how many OSDs went down?).
If you have a replication of 3 and only one OSD was involved, what is
stopping you from taking that wonky drive/OSD out?

If you don't know that or want to play it safe, how about setting the
weight of that OSD to 0? 
While that will AFAICT still result in all primary PGs to be evacuated
off it, no more writes will happen to it and reads might be faster.
In either case, it shouldn't slow down the rest of your cluster anymore.

Regards,

Christian
> > Cheers,
> > Josef
> > 
> > On 05 Sep 2014, at 11:26, Luis Periquito 
> > wrote:
> > 
> >> Only time I saw such behaviour was when I was deleting a big chunk of
> >> data from the cluster: all the client activity was reduced, the op/s
> >> were almost non-existent and there was unjustified delays all over
> >> the cluster. But all the disks were somewhat busy in atop/iotstat.
> >> 
> >> 
> >> On 5 September 2014 09:51, David  wrote:
> >> Hi,
> >> 
> >> Indeed strange.
> >> 
> >> That output was when we had issues, seems that most operations were
> >> blocked / slow requests.
> >> 
> >> A ”baseline” output is more like today:
> >> 
> >> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
> >> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
> >> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637
> >> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB
> >> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s
> >> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761:
> >> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
> >> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05
> >> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860
> >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >> 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0
> >> [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data,
> >> 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr,
> >> 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860
> >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
> >> avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05
> >> 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860
> >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >> 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05 10:44:36.213535 mon.0
> >> [INF] pgmap v12582766: 6860 pgs: 6860 active+clean; 12253 GB data,
> >> 36574 GB used, 142 TB / 178 TB avail; 17630KB/s rd, 18855KB/s wr,
> >> 3131op/s 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860
> >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
> >> avail; 12262KB/s rd, 18627KB/s wr, 2595op/s 2014-09-05
> >> 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 6860
> >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >> 17697KB/s rd, 17572KB/s wr, 2156op/s 2014-09-05 10:44:39.239409 mon.0
> >> [INF] pgmap v12582769: 6860 pgs: 6860 active+clean; 12253 GB data,
> >> 36574 GB used, 142 TB / 178 TB avail; 20300KB/s rd, 19735KB/s wr,
> >> 2197op/s 2014-09-05 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860
> >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
> >> avail; 14656KB/s rd, 15460KB/s wr, 2199op/s 2014-09-05
> >> 10:44:41.269736 mon.0 [INF] pgmap v12582771: 6860 pgs: 6860
> >> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >> 8969KB/s rd, 11918KB/s wr, 1951op/s 2014-09-05 10:44:42.276192 mon.0
> >> [INF] pgmap v12582772: 6860 pgs: 6860 active+clean; 12253 GB data,
> >> 36574 GB used, 142 TB / 178 TB avail; 7272KB/s rd, 10644KB/s wr,
> >> 1832op/s 2014-09-05 10:44:43.291817 mon.0 [INF] pgmap v12582773: 6860
> >> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
> >> avail; 9316KB/s rd, 16610KB/s wr, 2412op/s 2014-09-05 10:44:44.295469
> >> mon.0 [INF] pgmap v12582774: 6860 pgs: 6860 active+clean; 12253 GB
> >> data, 

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
Hi,

On 06 Sep 2014, at 13:53, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
> 
>> Also putting this on the list.
>> 
>> On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> Same issues again, but I think we found the drive that causes the
>>> problems.
>>> 
>>> But this is causing problems as it’s trying to do a recover to that
>>> osd at the moment.
>>> 
>>> So we’re left with the status message 
>>> 
>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
>>> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
>>> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
>>> 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
>>> 
>>> 
>>> It’s improving, but way too slowly. If I restart the recovery (ceph
>>> osd set no recovery /unset) it doesn’t change the osd what I can see.
>>> 
>>> Any ideas?
>>> 
> I don't know the state of your cluster, i.e. what caused the recovery to
> start (how many OSDs went down?).
Performance degradation, databases are the worst impacted. It’s actually a OSD 
that we put in that’s causing it (removed it again though). So the cluster in 
itself is healthy.

> If you have a replication of 3 and only one OSD was involved, what is
> stopping you from taking that wonky drive/OSD out?
> 
There’s data that goes missing if I do that, I guess I have to wait for the 
recovery process to complete before I can go any further, this is with rep 3.
> If you don't know that or want to play it safe, how about setting the
> weight of that OSD to 0? 
> While that will AFAICT still result in all primary PGs to be evacuated
> off it, no more writes will happen to it and reads might be faster.
> In either case, it shouldn't slow down the rest of your cluster anymore.
> 
That’s actually one idea I haven’t thought off, I wan’t to play it safe right 
now and hope that it goes up again, I actually found one wonky way of getting 
the recovery process from not stalling to a grind, and that was restarting 
OSDs. One at the time.

Regards,
Josef
> Regards,
> 
> Christian
>>> Cheers,
>>> Josef
>>> 
>>> On 05 Sep 2014, at 11:26, Luis Periquito 
>>> wrote:
>>> 
 Only time I saw such behaviour was when I was deleting a big chunk of
 data from the cluster: all the client activity was reduced, the op/s
 were almost non-existent and there was unjustified delays all over
 the cluster. But all the disks were somewhat busy in atop/iotstat.
 
 
 On 5 September 2014 09:51, David  wrote:
 Hi,
 
 Indeed strange.
 
 That output was when we had issues, seems that most operations were
 blocked / slow requests.
 
 A ”baseline” output is more like today:
 
 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
 avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637
 mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB
 data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s
 wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761:
 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05
 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860
 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0
 [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data,
 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr,
 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860
 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
 avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05
 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860
 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05 10:44:36.213535 mon.0
 [INF] pgmap v12582766: 6860 pgs: 6860 active+clean; 12253 GB data,
 36574 GB used, 142 TB / 178 TB avail; 17630KB/s rd, 18855KB/s wr,
 3131op/s 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860
 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
 avail; 12262KB/s rd, 18627KB/s wr, 2595op/s 2014-09-05
 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 6860
 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
 17697KB/s rd, 17572KB/s wr, 2156op/s 2014-09-05 10:44:39.239409 mon.0
 [INF] pgmap v12582769: 6860 pgs: 6860 active+clean; 12253 GB data,
 36574 GB used, 142 TB / 178 TB avail; 20300KB/s rd, 19735KB/s wr,
 2197op/s 2014-09-05 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860
 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
 avail; 146

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
Actually, it only worked with restarting  for a period of time to get the 
recovering process going. Can’t get passed the 21k object mark.

I’m uncertain if the disk really is messing this up right now as well. So I’m 
not glad to start moving 300k objects around.

Regards,
Josef

On 06 Sep 2014, at 14:33, Josef Johansson  wrote:

> Hi,
> 
> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
> 
>> 
>> Hello,
>> 
>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>> 
>>> Also putting this on the list.
>>> 
>>> On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
>>> 
 Hi,
 
 Same issues again, but I think we found the drive that causes the
 problems.
 
 But this is causing problems as it’s trying to do a recover to that
 osd at the moment.
 
 So we’re left with the status message 
 
 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
 active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
 used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
 
 
 It’s improving, but way too slowly. If I restart the recovery (ceph
 osd set no recovery /unset) it doesn’t change the osd what I can see.
 
 Any ideas?
 
>> I don't know the state of your cluster, i.e. what caused the recovery to
>> start (how many OSDs went down?).
> Performance degradation, databases are the worst impacted. It’s actually a 
> OSD that we put in that’s causing it (removed it again though). So the 
> cluster in itself is healthy.
> 
>> If you have a replication of 3 and only one OSD was involved, what is
>> stopping you from taking that wonky drive/OSD out?
>> 
> There’s data that goes missing if I do that, I guess I have to wait for the 
> recovery process to complete before I can go any further, this is with rep 3.
>> If you don't know that or want to play it safe, how about setting the
>> weight of that OSD to 0? 
>> While that will AFAICT still result in all primary PGs to be evacuated
>> off it, no more writes will happen to it and reads might be faster.
>> In either case, it shouldn't slow down the rest of your cluster anymore.
>> 
> That’s actually one idea I haven’t thought off, I wan’t to play it safe right 
> now and hope that it goes up again, I actually found one wonky way of getting 
> the recovery process from not stalling to a grind, and that was restarting 
> OSDs. One at the time.
> 
> Regards,
> Josef
>> Regards,
>> 
>> Christian
 Cheers,
 Josef
 
 On 05 Sep 2014, at 11:26, Luis Periquito 
 wrote:
 
> Only time I saw such behaviour was when I was deleting a big chunk of
> data from the cluster: all the client activity was reduced, the op/s
> were almost non-existent and there was unjustified delays all over
> the cluster. But all the disks were somewhat busy in atop/iotstat.
> 
> 
> On 5 September 2014 09:51, David  wrote:
> Hi,
> 
> Indeed strange.
> 
> That output was when we had issues, seems that most operations were
> blocked / slow requests.
> 
> A ”baseline” output is more like today:
> 
> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637
> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB
> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s
> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761:
> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05
> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860
> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0
> [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data,
> 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr,
> 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860
> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
> avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05
> 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860
> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05 10:44:36.213535 mon.0
> [INF] pgmap v12582766: 6860 pgs: 6860 active+clean; 12253 GB data,
> 36574 GB used, 142 TB / 178 TB avail; 17630KB/s rd, 18855KB/s wr,
> 3131op/s 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860
> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
> avail; 12262KB/s rd, 18627KB/s wr, 2595op/s 2014-09-05
> 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 68

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Dan van der Ster
Hi Christian,

Let's keep debating until a dev corrects us ;)

September 6 2014 1:27 PM, "Christian Balzer"  wrote: 
> On Fri, 5 Sep 2014 09:42:02 + Dan Van Der Ster wrote:
> 
>>> On 05 Sep 2014, at 11:04, Christian Balzer  wrote:
>>> 
>>> On Fri, 5 Sep 2014 07:46:12 + Dan Van Der Ster wrote:
 
> On 05 Sep 2014, at 03:09, Christian Balzer  wrote:
> 
> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
> 
>> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
>>  wrote:
>> 
> 
> [snip]
> 
>>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful
>>> is the backfilling which results from an SSD failure? Have you
>>> considered tricks like increasing the down out interval so
>>> backfilling doesn’t happen in this case (leaving time for the SSD
>>> to be replaced)?
>>> 
>> 
>> Replacing a failed SSD won't help your backfill. I haven't actually
>> tested it, but I'm pretty sure that losing the journal effectively
>> corrupts your OSDs. I don't know what steps are required to
>> complete this operation, but it wouldn't surprise me if you need to
>> re-format the OSD.
>> 
> This.
> All the threads I've read about this indicate that journal loss
> during operation means OSD loss. Total OSD loss, no recovery.
> From what I gathered the developers are aware of this and it might be
> addressed in the future.
> 
 
 I suppose I need to try it then. I don’t understand why you can't just
 use ceph-osd -i 10 --mkjournal to rebuild osd 10’s journal, for
 example.
 
>>> I think the logic is if you shut down an OSD cleanly beforehand you can
>>> just do that.
>>> However from what I gathered there is no logic to re-issue transactions
>>> that made it to the journal but not the filestore.
>>> So a journal SSD failing mid-operation with a busy OSD would certainly
>>> be in that state.
>>> 
>> 
>> I had thought that the journal write and the buffered filestore write
>> happen at the same time.
> 
> Nope, definitely not.
> 
> That's why we have tunables like the ones at:
> http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals
> 
> And people (me included) tend to crank that up (to eleven ^o^).
> 
> The write-out to the filestore may start roughly at the same time as the
> journal gets things, but it can and will fall behind.
> 

filestore max sync interval is the period between the fsync/fdatasync's of the 
outstanding filestore writes, which were sent earlier. By the time the sync 
interval arrives, the OS may have already flushed those writes (sysctl's like 
vm.dirty_ratio, dirty_expire_centisecs, ... apply here). And even if the osd 
crashes and never calls fsync, then the OS will flush those anyway. Of course, 
if a power outage prevents the fsync from ever happening, then the journal 
entry replay is used to re-write the op. The other thing about filestore max 
sync interval is that journal entries are only free'd after the osd has fsync'd 
the related filestore write. That's why the journal size depends on the sync 
interval.


>> So all the previous journal writes that
>> succeeded are already on their way to the filestore. My (could be
>> incorrect) understanding is that the real purpose of the journal is to
>> be able to replay writes after a power outage (since the buffered
>> filestore writes would be lost in that case). If there is no power
>> outage, then filestore writes are still good regardless of a journal
>> failure.
> 
> From Cephs perspective a write is successful once it is on all replica
> size journals.

This is the key point - which I'm not sure about and don't feel like reading 
the code on a Saturday ;) Is a write ack'd after a successful journal write, or 
after the journal _and_ the buffered filestore writes? Is that documented 
somewhere?


> I think (hope) that what you wrote up there to be true, but that doesn't
> change the fact that journal data not even on the way to the filestore yet
> is the crux here.
> 
>>> I'm sure (hope) somebody from the Ceph team will pipe up about this.
>> 
>> Ditto!
> 
> Guess it will be next week...
> 
> Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5
> ratio is sensible. However these will be the ones limiting your max
> sequential write speed if that is of importance to you. In nearly all
> use cases you run out of IOPS (on your HDDs) long before that becomes
> an issue, though.
 
 IOPS is definitely the main limit, but we also only have 1 single
 10Gig-E NIC on these servers, so 4 drives that can write (even only
 200MB/s) would be good enough.
 
>>> Fair enough. ^o^
>>> 
 Also, we’ll put the SSDs in the first four ports of an SAS2008 HBA
 which is shared with the other 20 spinning disks. Counting the double
 writes, the HBA will run out of bandwidth before these SSDs, I expect.
 
>>> Depends on what PCIe

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
FWI I did restart the OSDs until I saw a server that made impact. Until that 
server stopped doing impact, I didn’t get lower in the number objects being 
degraded.
After a while it was done with recovering that OSD and happily started with 
others.
I guess I will be seeing the same behaviour when it gets to replicating the 
same PGs that were causing troubles the first time.

On 06 Sep 2014, at 15:04, Josef Johansson  wrote:

> Actually, it only worked with restarting  for a period of time to get the 
> recovering process going. Can’t get passed the 21k object mark.
> 
> I’m uncertain if the disk really is messing this up right now as well. So I’m 
> not glad to start moving 300k objects around.
> 
> Regards,
> Josef
> 
> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
> 
>> Hi,
>> 
>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>>> 
 Also putting this on the list.
 
 On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
 
> Hi,
> 
> Same issues again, but I think we found the drive that causes the
> problems.
> 
> But this is causing problems as it’s trying to do a recover to that
> osd at the moment.
> 
> So we’re left with the status message 
> 
> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
> 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
> 
> 
> It’s improving, but way too slowly. If I restart the recovery (ceph
> osd set no recovery /unset) it doesn’t change the osd what I can see.
> 
> Any ideas?
> 
>>> I don't know the state of your cluster, i.e. what caused the recovery to
>>> start (how many OSDs went down?).
>> Performance degradation, databases are the worst impacted. It’s actually a 
>> OSD that we put in that’s causing it (removed it again though). So the 
>> cluster in itself is healthy.
>> 
>>> If you have a replication of 3 and only one OSD was involved, what is
>>> stopping you from taking that wonky drive/OSD out?
>>> 
>> There’s data that goes missing if I do that, I guess I have to wait for the 
>> recovery process to complete before I can go any further, this is with rep 3.
>>> If you don't know that or want to play it safe, how about setting the
>>> weight of that OSD to 0? 
>>> While that will AFAICT still result in all primary PGs to be evacuated
>>> off it, no more writes will happen to it and reads might be faster.
>>> In either case, it shouldn't slow down the rest of your cluster anymore.
>>> 
>> That’s actually one idea I haven’t thought off, I wan’t to play it safe 
>> right now and hope that it goes up again, I actually found one wonky way of 
>> getting the recovery process from not stalling to a grind, and that was 
>> restarting OSDs. One at the time.
>> 
>> Regards,
>> Josef
>>> Regards,
>>> 
>>> Christian
> Cheers,
> Josef
> 
> On 05 Sep 2014, at 11:26, Luis Periquito 
> wrote:
> 
>> Only time I saw such behaviour was when I was deleting a big chunk of
>> data from the cluster: all the client activity was reduced, the op/s
>> were almost non-existent and there was unjustified delays all over
>> the cluster. But all the disks were somewhat busy in atop/iotstat.
>> 
>> 
>> On 5 September 2014 09:51, David  wrote:
>> Hi,
>> 
>> Indeed strange.
>> 
>> That output was when we had issues, seems that most operations were
>> blocked / slow requests.
>> 
>> A ”baseline” output is more like today:
>> 
>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637
>> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB
>> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s
>> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761:
>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
>> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05
>> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
>> 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0
>> [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data,
>> 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr,
>> 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860
>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>> avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05
>> 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860
>> active+cle

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Christian Balzer
On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:

> Hi Christian,
> 
> Let's keep debating until a dev corrects us ;)
> 
For the time being, I give the recent:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html

And not so recent:
http://www.spinics.net/lists/ceph-users/msg04152.html
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021

And I'm not going to use BTRFS for mainly RBD backed VM images
(fragmentation city), never mind the other stability issues that crop up
here ever so often.

> September 6 2014 1:27 PM, "Christian Balzer"  wrote: 
> > On Fri, 5 Sep 2014 09:42:02 + Dan Van Der Ster wrote:
> > 
> >>> On 05 Sep 2014, at 11:04, Christian Balzer  wrote:
> >>> 
> >>> On Fri, 5 Sep 2014 07:46:12 + Dan Van Der Ster wrote:
>  
> > On 05 Sep 2014, at 03:09, Christian Balzer  wrote:
> > 
> > On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
> > 
> >> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
> >>  wrote:
> >> 
> > 
> > [snip]
> > 
> >>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how
> >>> painful is the backfilling which results from an SSD failure?
> >>> Have you considered tricks like increasing the down out interval
> >>> so backfilling doesn’t happen in this case (leaving time for the
> >>> SSD to be replaced)?
> >>> 
> >> 
> >> Replacing a failed SSD won't help your backfill. I haven't
> >> actually tested it, but I'm pretty sure that losing the journal
> >> effectively corrupts your OSDs. I don't know what steps are
> >> required to complete this operation, but it wouldn't surprise me
> >> if you need to re-format the OSD.
> >> 
> > This.
> > All the threads I've read about this indicate that journal loss
> > during operation means OSD loss. Total OSD loss, no recovery.
> > From what I gathered the developers are aware of this and it might
> > be addressed in the future.
> > 
>  
>  I suppose I need to try it then. I don’t understand why you can't
>  just use ceph-osd -i 10 --mkjournal to rebuild osd 10’s journal, for
>  example.
>  
> >>> I think the logic is if you shut down an OSD cleanly beforehand you
> >>> can just do that.
> >>> However from what I gathered there is no logic to re-issue
> >>> transactions that made it to the journal but not the filestore.
> >>> So a journal SSD failing mid-operation with a busy OSD would
> >>> certainly be in that state.
> >>> 
> >> 
> >> I had thought that the journal write and the buffered filestore write
> >> happen at the same time.
> > 
> > Nope, definitely not.
> > 
> > That's why we have tunables like the ones at:
> > http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals
> > 
> > And people (me included) tend to crank that up (to eleven ^o^).
> > 
> > The write-out to the filestore may start roughly at the same time as
> > the journal gets things, but it can and will fall behind.
> > 
> 
> filestore max sync interval is the period between the fsync/fdatasync's
> of the outstanding filestore writes, which were sent earlier. By the
> time the sync interval arrives, the OS may have already flushed those
> writes (sysctl's like vm.dirty_ratio, dirty_expire_centisecs, ... apply
> here). And even if the osd crashes and never calls fsync, then the OS
> will flush those anyway. Of course, if a power outage prevents the fsync
> from ever happening, then the journal entry replay is used to re-write
> the op. The other thing about filestore max sync interval is that
> journal entries are only free'd after the osd has fsync'd the related
> filestore write. That's why the journal size depends on the sync
> interval.
> 
> 
> >> So all the previous journal writes that
> >> succeeded are already on their way to the filestore. My (could be
> >> incorrect) understanding is that the real purpose of the journal is to
> >> be able to replay writes after a power outage (since the buffered
> >> filestore writes would be lost in that case). If there is no power
> >> outage, then filestore writes are still good regardless of a journal
> >> failure.
> > 
> > From Cephs perspective a write is successful once it is on all replica
> > size journals.
> 
> This is the key point - which I'm not sure about and don't feel like
> reading the code on a Saturday ;) Is a write ack'd after a successful
> journal write, or after the journal _and_ the buffered filestore writes?
> Is that documented somewhere?
> 
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/

Search for "acknowledgement" if you don't want to read the full thing. ^o^

> 
> > I think (hope) that what you wrote up there to be true, but that
> > doesn't change the fact that journal data not even on the way to the
> > filestore yet is the crux here.
> > 
> >>> I'm sure (hope) somebody from the Ceph team will pipe up about this.
> >> 
> >> Ditto!
> > 
> >

Re: [ceph-users] ceph osd unexpected error

2014-09-06 Thread Haomai Wang
Hi,

Could you give some more detail infos such as operation before occur errors?

And what's your ceph version?


On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋  wrote:

>   Dear CEPH ,
> Urgent question, I met a "FAILED assert(0 == "unexpected error")"
>  yesterday , Now i have not way to start this OSDS
> I have attached my logs in the attachment, and some  ceph configurations
>  as below
>
>
>  osd_pool_default_pgp_num = 300
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 300
> mon_host = 10.1.0.213,10.1.0.214
> osd_crush_chooseleaf_type = 1
> mds_cache_size = 50
> osd objectstore = keyvaluestore-dev
>
>
>
>  Detailed error information :
>
>
> -13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops ||
> 11642907 > 104857600
> -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops ||
> 11642899 > 104857600
> -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops ||
> 11642901 > 104857600
> -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick
> -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after 2014-09-05
> 15:07:05.326835)
> -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs?
> (now: 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341)
> -- no
> -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops ||
> 11044551 > 104857600
> -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013 -->
> osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0
> 0x18dcf000
> -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops ||
> 11044553 > 104857600
> -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops ||
> 11044579 > 104857600
> -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid argument
> not handled on operation 9 (336.0.3, or op 3, counting from 0)
> -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code
> -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump:
> { "ops": [
> { "op_num": 0,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 1,
> "op_name": "mkcoll",
> "collection": "0.a9_TEMP"},
> { "op_num": 2,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 3,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 4,
> "op_name": "omap_setheader",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "header_length": "0"},
> { "op_num": 5,
> "op_name": "write",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "length": 1160,
> "offset": 0,
> "bufferlist length": 1160},
> { "op_num": 6,
> "op_name": "omap_setkeys",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "attr_lens": {}},
> { "op_num": 7,
> "op_name": "setattrs",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "attr_lens": { "_": 239,
> "_parent": 250,
> "snapset": 31}},
> { "op_num": 8,
> "op_name": "omap_setkeys",
> "collection": "meta",
> "oid": "16ef7597\/infos\/head\/\/-1",
> "attr_lens": { "0.a9_epoch": 4,
> "0.a9_info": 684}},
> { "op_num": 9,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 10,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 11,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 12,
> "op_name": "omap_setheader",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "header_length": "0"},
> { "op_num": 13,
> "op_name": "write",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "length": 507284,
> "offset": 0,
> "bufferlist length": 507284},
> { "op_num": 14,
> "op_name": "omap_setkeys",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "attr_lens": {}},
> { "op_num": 15,
> "op_name": "setattrs",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "attr_lens": { "_": 239,
> "snapset": 31}},
> { "op_num": 16,
> "op_name": "omap_setkeys",
> "collection": "meta",
> "oid": "16ef7597\/infos\/head\/\/-1",
> "attr_lens": { "0.a9_epoch": 4,
> "0.a9_info": 684}},
> { "op_num": 17,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "794064a9\/1c040e0.\/head\/\/0"},
> { "op_num": 18,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "794064a9\/1c040e0.\/head\/\/0"},
> { "op_num": 19,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "794064a9\/1c040e0.\/head\/\/0"},
> { "op_num": 20,
> "op_name": "omap_seth

Re: [ceph-users] ceph osd unexpected error

2014-09-06 Thread Haomai Wang
Hi,

Could you give some more detail infos such as operation before occur errors?

And what's your ceph version?

On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋  wrote:
> Dear CEPH ,
> Urgent question, I met a "FAILED assert(0 == "unexpected error")"
> yesterday , Now i have not way to start this OSDS
> I have attached my logs in the attachment, and some  ceph configurations  as
> below
>
>
> osd_pool_default_pgp_num = 300
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 300
> mon_host = 10.1.0.213,10.1.0.214
> osd_crush_chooseleaf_type = 1
> mds_cache_size = 50
> osd objectstore = keyvaluestore-dev
>
>
>
> Detailed error information :
>
>
>-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops ||
> 11642907 > 104857600
> -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops ||
> 11642899 > 104857600
> -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops ||
> 11642901 > 104857600
> -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick
> -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after 2014-09-05
> 15:07:05.326835)
> -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now:
> 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341) -- no
> -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops ||
> 11044551 > 104857600
> -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013 -->
> osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0
> 0x18dcf000
> -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops ||
> 11044553 > 104857600
> -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops ||
> 11044579 > 104857600
> -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid argument
> not handled on operation 9 (336.0.3, or op 3, counting from 0)
> -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code
> -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump:
> { "ops": [
> { "op_num": 0,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 1,
> "op_name": "mkcoll",
> "collection": "0.a9_TEMP"},
> { "op_num": 2,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 3,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 4,
> "op_name": "omap_setheader",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "header_length": "0"},
> { "op_num": 5,
> "op_name": "write",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "length": 1160,
> "offset": 0,
> "bufferlist length": 1160},
> { "op_num": 6,
> "op_name": "omap_setkeys",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "attr_lens": {}},
> { "op_num": 7,
> "op_name": "setattrs",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "attr_lens": { "_": 239,
> "_parent": 250,
> "snapset": 31}},
> { "op_num": 8,
> "op_name": "omap_setkeys",
> "collection": "meta",
> "oid": "16ef7597\/infos\/head\/\/-1",
> "attr_lens": { "0.a9_epoch": 4,
> "0.a9_info": 684}},
> { "op_num": 9,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 10,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 11,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 12,
> "op_name": "omap_setheader",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "header_length": "0"},
> { "op_num": 13,
> "op_name": "write",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "length": 507284,
> "offset": 0,
> "bufferlist length": 507284},
> { "op_num": 14,
> "op_name": "omap_setkeys",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "attr_lens": {}},
> { "op_num": 15,
> "op_name": "setattrs",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "attr_lens": { "_": 239,
> "snapset": 31}},
> { "op_num": 16,
> "op_name": "omap_setkeys",
> "collection": "meta",
> "oid": "16ef7597\/infos\/head\/\/-1",
> "attr_lens": { "0.a9_epoch": 4,
> "0.a9_info": 684}},
> { "op_num": 17,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "794064a9\/1c040e0.\/head\/\/0"},
> { "op_num": 18,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "794064a9\/1c040e0.\/head\/\/0"},
> { "op_num": 19,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "794064a9\/1c040e0.\/head\/\/0"},
> { "op_num": 20,
> "op_name": "omap_setheader",
> "col

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-06 Thread Haomai Wang
Sorry for the late message, I'm back from a short vacation. I would
like to try it this weekends. Thanks for your patient :-)

On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
 wrote:
> I also can reproduce it on a new slightly different set up (also EC on KV
> and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
> 'inconsistent' status
>
>
>
> - Message from Kenneth Waegeman  -
>Date: Mon, 01 Sep 2014 16:28:31 +0200
>From: Kenneth Waegeman 
> Subject: Re: ceph cluster inconsistency keyvaluestore
>  To: Haomai Wang 
>  Cc: ceph-users@lists.ceph.com
>
>
>
>> Hi,
>>
>>
>> The cluster got installed with quattor, which uses ceph-deploy for
>> installation of daemons, writes the config file and installs the crushmap.
>> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the
>> ECdata pool and a small cache partition (50G) for the cache
>>
>> I manually did this:
>>
>> ceph osd pool create cache 1024 1024
>> ceph osd pool set cache size 2
>> ceph osd pool set cache min_size 1
>> ceph osd erasure-code-profile set profile11 k=8 m=3
>> ruleset-failure-domain=osd
>> ceph osd pool create ecdata 128 128 erasure profile11
>> ceph osd tier add ecdata cache
>> ceph osd tier cache-mode cache writeback
>> ceph osd tier set-overlay ecdata cache
>> ceph osd pool set cache hit_set_type bloom
>> ceph osd pool set cache hit_set_count 1
>> ceph osd pool set cache hit_set_period 3600
>> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
>>
>> (But the previous time I had the problem already without the cache part)
>>
>>
>>
>> Cluster live since 2014-08-29 15:34:16
>>
>> Config file on host ceph001:
>>
>> [global]
>> auth_client_required = cephx
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> cluster_network = 10.143.8.0/24
>> filestore_xattr_use_omap = 1
>> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>> mon_cluster_log_to_syslog = 1
>> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
>> mon_initial_members = ceph001, ceph002, ceph003
>> osd_crush_update_on_start = 0
>> osd_journal_size = 10240
>> osd_pool_default_min_size = 2
>> osd_pool_default_pg_num = 512
>> osd_pool_default_pgp_num = 512
>> osd_pool_default_size = 3
>> public_network = 10.141.8.0/24
>>
>> [osd.11]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.13]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.15]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.17]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.19]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.21]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.23]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.25]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.3]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.5]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.7]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.9]
>> osd_objectstore = keyvaluestore-dev
>>
>>
>> OSDs:
>> # idweight  type name   up/down reweight
>> -12 140.6   root default-cache
>> -9  46.87   host ceph001-cache
>> 2   3.906   osd.2   up  1
>> 4   3.906   osd.4   up  1
>> 6   3.906   osd.6   up  1
>> 8   3.906   osd.8   up  1
>> 10  3.906   osd.10  up  1
>> 12  3.906   osd.12  up  1
>> 14  3.906   osd.14  up  1
>> 16  3.906   osd.16  up  1
>> 18  3.906   osd.18  up  1
>> 20  3.906   osd.20  up  1
>> 22  3.906   osd.22  up  1
>> 24  3.906   osd.24  up  1
>> -10 46.87   host ceph002-cache
>> 28  3.906   osd.28  up  1
>> 30  3.906   osd.30  up  1
>> 32  3.906   osd.32  up  1
>> 34  3.906   osd.34  up  1
>> 36  3.906   osd.36  up  1
>> 38  3.906   osd.38  up  1
>> 40  3.906   osd.40  up  1
>> 42  3.906   osd.42  up  1
>> 44  3.906   osd.44  up  1
>> 46  3.906   osd.46  up  1
>> 48  3.906   osd.48  up  1
>> 50  3.906   osd.50  up  1
>> -11 46.87   host ceph003-cache
>> 54  3.906   osd.54  up  1
>> 56  3.906   osd.56  up  1
>> 58  3.906   osd.58  up  1
>> 60  3.906   osd.60  up  1
>> 62  3.906   osd.62  up  1
>> 64  3.906   osd.64  up  1
>> 66  3.906   osd.66  up  1
>> 68  3.906   osd.68  up  1
>> 70  3.906   osd.70  up  1
>> 72  3.906   osd.72  up  1
>> 74  3.906 

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Dan van der Ster
September 6 2014 4:01 PM, "Christian Balzer"  wrote: 
> On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> 
>> Hi Christian,
>> 
>> Let's keep debating until a dev corrects us ;)
> 
> For the time being, I give the recent:
> 
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> 
> And not so recent:
> http://www.spinics.net/lists/ceph-users/msg04152.html
> http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> 
> And I'm not going to use BTRFS for mainly RBD backed VM images
> (fragmentation city), never mind the other stability issues that crop up
> here ever so often.


Thanks for the links... So until I learn otherwise, I better assume the OSD is 
lost when the journal fails. Even though I haven't understood exactly why :(
I'm going to UTSL to understand the consistency better. An op state diagram 
would help, but I didn't find one yet.

BTW, do you happen to know, _if_ we re-use an OSD after the journal has failed, 
are any object inconsistencies going to be found by a scrub/deep-scrub?

>> 
>> We have 4 servers in a 3U rack, then each of those servers is connected
>> to one of these enclosures with a single SAS cable.
>> 
 With the current config, when I dd to all drives in parallel I can
 write at 24*74MB/s = 1776MB/s.
>>> 
>>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
>>> lanes, so as far as that bus goes, it can do 4GB/s.
>>> And given your storage pod I assume it is connected with 2 mini-SAS
>>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
>>> bandwidth.
>> 
>> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> 
> Alright, that explains that then. Any reason for not using both ports?
> 

Probably to minimize costs, and since the single 10Gig-E is a bottleneck anyway.
The whole thing is suboptimal anyway, since this hardware was not purchased for 
Ceph to begin with.
Hence retrofitting SSDs, etc...

>>> Impressive, even given your huge cluster with 1128 OSDs.
>>> However that's not really answering my question, how much data is on an
>>> average OSD and thus gets backfilled in that hour?
>> 
>> That's true -- our drives have around 300TB on them. So I guess it will
>> take longer - 3x longer - when the drives are 1TB full.
> 
> On your slides, when the crazy user filled the cluster with 250 million
> objects and thus 1PB of data, I recall seeing a 7 hour backfill time?
> 

Yeah that was fun :) It was 250 million (mostly) 4k objects, so not close to 
1PB. The point was that to fill the cluster with RBD, we'd need 250 million 
(4MB) objects. So, object-count-wise this was a full cluster, but for the real 
volume it was more like 70TB IIRC (there were some other larger objects too).

In that case, the backfilling was CPU-bound, or perhaps wbthrottle-bound, I 
don't remember... It was just that there were many tiny tiny objects to 
synchronize.

> Anyway, I guess the lesson to take away from this is that size and
> parallelism does indeed help, but even in a cluster like yours recovering
> from a 2TB loss would likely be in the 10 hour range...

Bigger clusters probably backfill faster simply because there are more OSDs 
involved in the backfilling. In our cluster we initially get 30-40 backfills in 
parallel after 1 OSD fails. That's even with max backfills = 1. The backfilling 
sorta follows an 80/20 rule -- 80% of the time is spent backfilling the last 
20% of the PGs, just because some OSDs randomly get more new PGs than the 
others.

> Again, see the "Best practice K/M-parameters EC pool" thread. ^.^

Marked that one to read, again.

Cheers, dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resizing the OSD

2014-09-06 Thread Christian Balzer

Hello,

On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote:

> Hello Cephers,
> 
> We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of the
> stuff seems to be working fine but we are seeing some degrading on the
> osd's due to lack of space on the osd's. 

Please elaborate on that degradation.

> Is there a way to resize the
> OSD without bringing the cluster down?
> 

Define both "resize" and "cluster down".

As in, resizing how? 
Are your current OSDs on disks/LVMs that are not fully used and thus could
be grown?
What is the size of your current OSDs?

The normal way of growing a cluster is to add more OSDs.
Preferably of the same size and same performance disks.
This will not only simplify things immensely but also make them a lot more
predictable.
This of course depends on your use case and usage patterns, but often when
running out of space you're also running out of other resources like CPU,
memory or IOPS of the disks involved. So adding more instead of growing
them is most likely the way forward.

If you were to replace actual disks with larger ones, take them (the OSDs)
out one at a time and re-add it. If you're using ceph-deploy, it will use
the disk size as basic weight, if you're doing things manually make sure
to specify that size/weight accordingly.
Again, you do want to do this for all disks to keep things uniform.

If your cluster (pools really) are set to a replica size of at least 2
(risky!) or 3 (as per Firefly default), taking a single OSD out would of
course never bring the cluster down.
However taking an OSD out and/or adding a new one will cause data movement
that might impact your cluster's performance.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
We manage to go through the restore, but the performance degradation is still 
there.

Looking through the OSDs to pinpoint a source of the degradation and hoping the 
current load will be lowered.

I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be tough if 
the degradation is still there afterwards? i.e. if I set back the weight would 
it move back all the PGs?

Regards,
Josef

On 06 Sep 2014, at 15:52, Josef Johansson  wrote:

> FWI I did restart the OSDs until I saw a server that made impact. Until that 
> server stopped doing impact, I didn’t get lower in the number objects being 
> degraded.
> After a while it was done with recovering that OSD and happily started with 
> others.
> I guess I will be seeing the same behaviour when it gets to replicating the 
> same PGs that were causing troubles the first time.
> 
> On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
> 
>> Actually, it only worked with restarting  for a period of time to get the 
>> recovering process going. Can’t get passed the 21k object mark.
>> 
>> I’m uncertain if the disk really is messing this up right now as well. So 
>> I’m not glad to start moving 300k objects around.
>> 
>> Regards,
>> Josef
>> 
>> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
>>> 
 
 Hello,
 
 On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
 
> Also putting this on the list.
> 
> On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
> 
>> Hi,
>> 
>> Same issues again, but I think we found the drive that causes the
>> problems.
>> 
>> But this is causing problems as it’s trying to do a recover to that
>> osd at the moment.
>> 
>> So we’re left with the status message 
>> 
>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
>> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
>> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
>> 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
>> 
>> 
>> It’s improving, but way too slowly. If I restart the recovery (ceph
>> osd set no recovery /unset) it doesn’t change the osd what I can see.
>> 
>> Any ideas?
>> 
 I don't know the state of your cluster, i.e. what caused the recovery to
 start (how many OSDs went down?).
>>> Performance degradation, databases are the worst impacted. It’s actually a 
>>> OSD that we put in that’s causing it (removed it again though). So the 
>>> cluster in itself is healthy.
>>> 
 If you have a replication of 3 and only one OSD was involved, what is
 stopping you from taking that wonky drive/OSD out?
 
>>> There’s data that goes missing if I do that, I guess I have to wait for the 
>>> recovery process to complete before I can go any further, this is with rep 
>>> 3.
 If you don't know that or want to play it safe, how about setting the
 weight of that OSD to 0? 
 While that will AFAICT still result in all primary PGs to be evacuated
 off it, no more writes will happen to it and reads might be faster.
 In either case, it shouldn't slow down the rest of your cluster anymore.
 
>>> That’s actually one idea I haven’t thought off, I wan’t to play it safe 
>>> right now and hope that it goes up again, I actually found one wonky way of 
>>> getting the recovery process from not stalling to a grind, and that was 
>>> restarting OSDs. One at the time.
>>> 
>>> Regards,
>>> Josef
 Regards,
 
 Christian
>> Cheers,
>> Josef
>> 
>> On 05 Sep 2014, at 11:26, Luis Periquito 
>> wrote:
>> 
>>> Only time I saw such behaviour was when I was deleting a big chunk of
>>> data from the cluster: all the client activity was reduced, the op/s
>>> were almost non-existent and there was unjustified delays all over
>>> the cluster. But all the disks were somewhat busy in atop/iotstat.
>>> 
>>> 
>>> On 5 September 2014 09:51, David  wrote:
>>> Hi,
>>> 
>>> Indeed strange.
>>> 
>>> That output was when we had issues, seems that most operations were
>>> blocked / slow requests.
>>> 
>>> A ”baseline” output is more like today:
>>> 
>>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637
>>> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB
>>> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s
>>> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761:
>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
>>> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05
>>> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860
>>> active+clea

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Christian Balzer

Hello,

On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:

> We manage to go through the restore, but the performance degradation is
> still there.
>
Manifesting itself how?
 
> Looking through the OSDs to pinpoint a source of the degradation and
> hoping the current load will be lowered.
> 

You're the one looking at your cluster, the iostat, atop, iotop and
whatnot data.
If one particular OSD/disk stands out, investigate it, as per the "Good
way to monitor detailed latency/throughput" thread. 

If you have a spare and idle machine that is identical to your storage
nodes, you could run a fio benchmark on a disk there and then compare the
results to that of your suspect disk after setting your cluster to noout
and stopping that particular OSD.

> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
> tough if the degradation is still there afterwards? i.e. if I set back
> the weight would it move back all the PGs?
>
Of course.

Until you can determine that a specific OSD/disk is the culprit, don't do
that. 
If you have the evidence, go ahead.
 
Regards,

Christian

> Regards,
> Josef
> 
> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
> 
> > FWI I did restart the OSDs until I saw a server that made impact.
> > Until that server stopped doing impact, I didn’t get lower in the
> > number objects being degraded. After a while it was done with
> > recovering that OSD and happily started with others. I guess I will be
> > seeing the same behaviour when it gets to replicating the same PGs
> > that were causing troubles the first time.
> > 
> > On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
> > 
> >> Actually, it only worked with restarting  for a period of time to get
> >> the recovering process going. Can’t get passed the 21k object mark.
> >> 
> >> I’m uncertain if the disk really is messing this up right now as
> >> well. So I’m not glad to start moving 300k objects around.
> >> 
> >> Regards,
> >> Josef
> >> 
> >> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
> >> 
> >>> Hi,
> >>> 
> >>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
> >>> 
>  
>  Hello,
>  
>  On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>  
> > Also putting this on the list.
> > 
> > On 06 Sep 2014, at 13:36, Josef Johansson 
> > wrote:
> > 
> >> Hi,
> >> 
> >> Same issues again, but I think we found the drive that causes the
> >> problems.
> >> 
> >> But this is causing problems as it’s trying to do a recover to
> >> that osd at the moment.
> >> 
> >> So we’re left with the status message 
> >> 
> >> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs:
> >> 6841 active+clean, 19 active+remapped+backfilling; 12299 GB data,
> >> 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr,
> >> 74op/s; 41424/15131923 degraded (0.274%);  recovering 0 o/s,
> >> 2035KB/s
> >> 
> >> 
> >> It’s improving, but way too slowly. If I restart the recovery
> >> (ceph osd set no recovery /unset) it doesn’t change the osd what
> >> I can see.
> >> 
> >> Any ideas?
> >> 
>  I don't know the state of your cluster, i.e. what caused the
>  recovery to start (how many OSDs went down?).
> >>> Performance degradation, databases are the worst impacted. It’s
> >>> actually a OSD that we put in that’s causing it (removed it again
> >>> though). So the cluster in itself is healthy.
> >>> 
>  If you have a replication of 3 and only one OSD was involved, what
>  is stopping you from taking that wonky drive/OSD out?
>  
> >>> There’s data that goes missing if I do that, I guess I have to wait
> >>> for the recovery process to complete before I can go any further,
> >>> this is with rep 3.
>  If you don't know that or want to play it safe, how about setting
>  the weight of that OSD to 0? 
>  While that will AFAICT still result in all primary PGs to be
>  evacuated off it, no more writes will happen to it and reads might
>  be faster. In either case, it shouldn't slow down the rest of your
>  cluster anymore.
>  
> >>> That’s actually one idea I haven’t thought off, I wan’t to play it
> >>> safe right now and hope that it goes up again, I actually found one
> >>> wonky way of getting the recovery process from not stalling to a
> >>> grind, and that was restarting OSDs. One at the time.
> >>> 
> >>> Regards,
> >>> Josef
>  Regards,
>  
>  Christian
> >> Cheers,
> >> Josef
> >> 
> >> On 05 Sep 2014, at 11:26, Luis Periquito
> >>  wrote:
> >> 
> >>> Only time I saw such behaviour was when I was deleting a big
> >>> chunk of data from the cluster: all the client activity was
> >>> reduced, the op/s were almost non-existent and there was
> >>> unjustified delays all over the cluster. But all the disks were
> >>> somewhat busy in atop/iotstat.
> >>> 
> >>> 
> >>> On 5 

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Christian Balzer
On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:

> September 6 2014 4:01 PM, "Christian Balzer"  wrote: 
> > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> > 
> >> Hi Christian,
> >> 
> >> Let's keep debating until a dev corrects us ;)
> > 
> > For the time being, I give the recent:
> > 
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> > 
> > And not so recent:
> > http://www.spinics.net/lists/ceph-users/msg04152.html
> > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > 
> > And I'm not going to use BTRFS for mainly RBD backed VM images
> > (fragmentation city), never mind the other stability issues that crop
> > up here ever so often.
> 
> 
> Thanks for the links... So until I learn otherwise, I better assume the
> OSD is lost when the journal fails. Even though I haven't understood
> exactly why :( I'm going to UTSL to understand the consistency better.
> An op state diagram would help, but I didn't find one yet.
> 
Using the source as an option of last resort is always nice, having to
actually do so for something like this feels a bit lacking in the
documentation department (that or my google foo being weak). ^o^

> BTW, do you happen to know, _if_ we re-use an OSD after the journal has
> failed, are any object inconsistencies going to be found by a
> scrub/deep-scrub?
> 
No idea. 
And really a scenario I hope to never encounter. ^^;;

> >> 
> >> We have 4 servers in a 3U rack, then each of those servers is
> >> connected to one of these enclosures with a single SAS cable.
> >> 
>  With the current config, when I dd to all drives in parallel I can
>  write at 24*74MB/s = 1776MB/s.
> >>> 
> >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
> >>> lanes, so as far as that bus goes, it can do 4GB/s.
> >>> And given your storage pod I assume it is connected with 2 mini-SAS
> >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
> >>> bandwidth.
> >> 
> >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> > 
> > Alright, that explains that then. Any reason for not using both ports?
> > 
> 
> Probably to minimize costs, and since the single 10Gig-E is a bottleneck
> anyway. The whole thing is suboptimal anyway, since this hardware was
> not purchased for Ceph to begin with. Hence retrofitting SSDs, etc...
>
The single 10Gb/s link is the bottleneck for sustained stuff, but when
looking at spikes...
Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port
might also get some loving. ^o^

The cluster I'm currently building is based on storage nodes with 4 SSDs
(100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8
HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for
redundancy, not speed. ^^ 
 
> >>> Impressive, even given your huge cluster with 1128 OSDs.
> >>> However that's not really answering my question, how much data is on
> >>> an average OSD and thus gets backfilled in that hour?
> >> 
> >> That's true -- our drives have around 300TB on them. So I guess it
> >> will take longer - 3x longer - when the drives are 1TB full.
> > 
> > On your slides, when the crazy user filled the cluster with 250 million
> > objects and thus 1PB of data, I recall seeing a 7 hour backfill time?
> > 
> 
> Yeah that was fun :) It was 250 million (mostly) 4k objects, so not
> close to 1PB. The point was that to fill the cluster with RBD, we'd need
> 250 million (4MB) objects. So, object-count-wise this was a full
> cluster, but for the real volume it was more like 70TB IIRC (there were
> some other larger objects too).
> 
Ah, I see. ^^

> In that case, the backfilling was CPU-bound, or perhaps
> wbthrottle-bound, I don't remember... It was just that there were many
> tiny tiny objects to synchronize.
> 
Indeed. This is something me and others have seen as well, as in
backfilling being much slower than the underlying HW would permit and
being CPU intensive.

> > Anyway, I guess the lesson to take away from this is that size and
> > parallelism does indeed help, but even in a cluster like yours
> > recovering from a 2TB loss would likely be in the 10 hour range...
> 
> Bigger clusters probably backfill faster simply because there are more
> OSDs involved in the backfilling. In our cluster we initially get 30-40
> backfills in parallel after 1 OSD fails. That's even with max backfills
> = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is
> spent backfilling the last 20% of the PGs, just because some OSDs
> randomly get more new PGs than the others.
> 
You still being on dumpling probably doesn't help that uneven distribution
bit.
Definitely another data point to go into a realistic recovery/reliability
model, though.

Christian

> > Again, see the "Best practice K/M-parameters EC pool" thread. ^.^
> 
> Marked that one to read, again.
> 
> Cheers, dan
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
Hi,

On 06 Sep 2014, at 17:27, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
> 
>> We manage to go through the restore, but the performance degradation is
>> still there.
>> 
> Manifesting itself how?
> 
Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
But mostly a lot of iowait.

>> Looking through the OSDs to pinpoint a source of the degradation and
>> hoping the current load will be lowered.
>> 
> 
> You're the one looking at your cluster, the iostat, atop, iotop and
> whatnot data.
> If one particular OSD/disk stands out, investigate it, as per the "Good
> way to monitor detailed latency/throughput" thread. 
> 
Will read it through.
> If you have a spare and idle machine that is identical to your storage
> nodes, you could run a fio benchmark on a disk there and then compare the
> results to that of your suspect disk after setting your cluster to noout
> and stopping that particular OSD.
No spare though, but I have a rough idea what it should be, what’s I’m going at 
right now.
Right, so the cluster should be fine after I stop the OSD right? I though of 
stopping it a little bit to see if the IO was better afterwards from within the 
VMs. Not sure how good effect it makes though since it may be waiting for the 
IO to complete what not.
> 
>> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
>> tough if the degradation is still there afterwards? i.e. if I set back
>> the weight would it move back all the PGs?
>> 
> Of course.
> 
> Until you can determine that a specific OSD/disk is the culprit, don't do
> that. 
> If you have the evidence, go ahead.
> 
Great, that’s what I though as well.
> Regards,
> 
> Christian
> 
>> Regards,
>> Josef
>> 
>> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
>> 
>>> FWI I did restart the OSDs until I saw a server that made impact.
>>> Until that server stopped doing impact, I didn’t get lower in the
>>> number objects being degraded. After a while it was done with
>>> recovering that OSD and happily started with others. I guess I will be
>>> seeing the same behaviour when it gets to replicating the same PGs
>>> that were causing troubles the first time.
>>> 
>>> On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
>>> 
 Actually, it only worked with restarting  for a period of time to get
 the recovering process going. Can’t get passed the 21k object mark.
 
 I’m uncertain if the disk really is messing this up right now as
 well. So I’m not glad to start moving 300k objects around.
 
 Regards,
 Josef
 
 On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
 
> Hi,
> 
> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
> 
>> 
>> Hello,
>> 
>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>> 
>>> Also putting this on the list.
>>> 
>>> On 06 Sep 2014, at 13:36, Josef Johansson 
>>> wrote:
>>> 
 Hi,
 
 Same issues again, but I think we found the drive that causes the
 problems.
 
 But this is causing problems as it’s trying to do a recover to
 that osd at the moment.
 
 So we’re left with the status message 
 
 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs:
 6841 active+clean, 19 active+remapped+backfilling; 12299 GB data,
 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr,
 74op/s; 41424/15131923 degraded (0.274%);  recovering 0 o/s,
 2035KB/s
 
 
 It’s improving, but way too slowly. If I restart the recovery
 (ceph osd set no recovery /unset) it doesn’t change the osd what
 I can see.
 
 Any ideas?
 
>> I don't know the state of your cluster, i.e. what caused the
>> recovery to start (how many OSDs went down?).
> Performance degradation, databases are the worst impacted. It’s
> actually a OSD that we put in that’s causing it (removed it again
> though). So the cluster in itself is healthy.
> 
>> If you have a replication of 3 and only one OSD was involved, what
>> is stopping you from taking that wonky drive/OSD out?
>> 
> There’s data that goes missing if I do that, I guess I have to wait
> for the recovery process to complete before I can go any further,
> this is with rep 3.
>> If you don't know that or want to play it safe, how about setting
>> the weight of that OSD to 0? 
>> While that will AFAICT still result in all primary PGs to be
>> evacuated off it, no more writes will happen to it and reads might
>> be faster. In either case, it shouldn't slow down the rest of your
>> cluster anymore.
>> 
> That’s actually one idea I haven’t thought off, I wan’t to play it
> safe right now and hope that it goes up again, I actually found one
> wonky way of getting the recovery process from

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
Hi,

Just realised that it could also be with a popularity bug as well and lots a 
small traffic. And seeing that it’s fast it gets popular until it hits the curb.

I’m seeing this in the stats I think.

Linux 3.13-0.bpo.1-amd64 (osd1) 09/06/2014  _x86_64_(24 CPU)

09/06/2014 05:48:41 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2.210.001.002.860.00   93.93

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdm   0.02 1.477.05   42.72 0.67 1.0771.43 
0.418.176.418.46   3.44  17.13
sdn   0.03 1.426.17   37.08 0.57 0.9270.51 
0.081.766.470.98   3.46  14.98
sdg   0.03 1.446.27   36.62 0.56 0.9471.40 
0.348.006.838.20   3.45  14.78
sde   0.03 1.236.47   39.07 0.59 0.9870.29 
0.439.476.579.95   3.37  15.33
sdf   0.02 1.266.47   33.77 0.61 0.8775.30 
0.225.396.005.27   3.52  14.17
sdl   0.03 1.446.44   40.54 0.59 1.0872.68 
0.214.496.564.16   3.40  15.95
sdk   0.03 1.415.62   35.92 0.52 0.9070.10 
0.153.586.173.17   3.45  14.32
sdj   0.03 1.266.30   34.23 0.57 0.8370.84 
0.317.656.567.85   3.48  14.10

Seeing that the drives are in pretty good shape but not giving lotsa read, I 
would assume that I need to tweak the cache to swallow more IO.

When I tweaked it before production I did not see any performance gains what so 
ever, so they are pretty low. And it’s odd because we just saw these problems a 
little while ago. So probably that we hit a limit where the disks are getting 
lot of IO.

I know that there’s some threads about this that I will read again.

Thanks for the hints in looking at bad drives.

Regards,
Josef

On 06 Sep 2014, at 17:41, Josef Johansson  wrote:

> Hi,
> 
> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
> 
>> 
>> Hello,
>> 
>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>> 
>>> We manage to go through the restore, but the performance degradation is
>>> still there.
>>> 
>> Manifesting itself how?
>> 
> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
> But mostly a lot of iowait.
> 
>>> Looking through the OSDs to pinpoint a source of the degradation and
>>> hoping the current load will be lowered.
>>> 
>> 
>> You're the one looking at your cluster, the iostat, atop, iotop and
>> whatnot data.
>> If one particular OSD/disk stands out, investigate it, as per the "Good
>> way to monitor detailed latency/throughput" thread. 
>> 
> Will read it through.
>> If you have a spare and idle machine that is identical to your storage
>> nodes, you could run a fio benchmark on a disk there and then compare the
>> results to that of your suspect disk after setting your cluster to noout
>> and stopping that particular OSD.
> No spare though, but I have a rough idea what it should be, what’s I’m going 
> at right now.
> Right, so the cluster should be fine after I stop the OSD right? I though of 
> stopping it a little bit to see if the IO was better afterwards from within 
> the VMs. Not sure how good effect it makes though since it may be waiting for 
> the IO to complete what not.
>> 
>>> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
>>> tough if the degradation is still there afterwards? i.e. if I set back
>>> the weight would it move back all the PGs?
>>> 
>> Of course.
>> 
>> Until you can determine that a specific OSD/disk is the culprit, don't do
>> that. 
>> If you have the evidence, go ahead.
>> 
> Great, that’s what I though as well.
>> Regards,
>> 
>> Christian
>> 
>>> Regards,
>>> Josef
>>> 
>>> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
>>> 
 FWI I did restart the OSDs until I saw a server that made impact.
 Until that server stopped doing impact, I didn’t get lower in the
 number objects being degraded. After a while it was done with
 recovering that OSD and happily started with others. I guess I will be
 seeing the same behaviour when it gets to replicating the same PGs
 that were causing troubles the first time.
 
 On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
 
> Actually, it only worked with restarting  for a period of time to get
> the recovering process going. Can’t get passed the 21k object mark.
> 
> I’m uncertain if the disk really is messing this up right now as
> well. So I’m not glad to start moving 300k objects around.
> 
> Regards,
> Josef
> 
> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
> 
>> Hi,
>> 
>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Sat

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Christian Balzer

Hello,

On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:

> Hi,
> 
> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
> 
> > 
> > Hello,
> > 
> > On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
> > 
> >> We manage to go through the restore, but the performance degradation
> >> is still there.
> >> 
> > Manifesting itself how?
> > 
> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
> But mostly a lot of iowait.
>
I was thinking about the storage nodes. ^^
As in, does a particular node or disk seem to be redlined all the time?
 
> >> Looking through the OSDs to pinpoint a source of the degradation and
> >> hoping the current load will be lowered.
> >> 
> > 
> > You're the one looking at your cluster, the iostat, atop, iotop and
> > whatnot data.
> > If one particular OSD/disk stands out, investigate it, as per the "Good
> > way to monitor detailed latency/throughput" thread. 
> > 
> Will read it through.
> > If you have a spare and idle machine that is identical to your storage
> > nodes, you could run a fio benchmark on a disk there and then compare
> > the results to that of your suspect disk after setting your cluster to
> > noout and stopping that particular OSD.
> No spare though, but I have a rough idea what it should be, what’s I’m
> going at right now. Right, so the cluster should be fine after I stop
> the OSD right? I though of stopping it a little bit to see if the IO was
> better afterwards from within the VMs. Not sure how good effect it makes
> though since it may be waiting for the IO to complete what not.
> > 
If you set your cluster to noout, as in "ceph osd set noout" per
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
before shutting down a particular ODS, no data migration will happen.

Of course you will want to shut it down as little as possible, so that
recovery traffic when it comes back is minimized. 

Christian 

> >> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
> >> tough if the degradation is still there afterwards? i.e. if I set back
> >> the weight would it move back all the PGs?
> >> 
> > Of course.
> > 
> > Until you can determine that a specific OSD/disk is the culprit, don't
> > do that. 
> > If you have the evidence, go ahead.
> > 
> Great, that’s what I though as well.
> > Regards,
> > 
> > Christian
> > 
> >> Regards,
> >> Josef
> >> 
> >> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
> >> 
> >>> FWI I did restart the OSDs until I saw a server that made impact.
> >>> Until that server stopped doing impact, I didn’t get lower in the
> >>> number objects being degraded. After a while it was done with
> >>> recovering that OSD and happily started with others. I guess I will
> >>> be seeing the same behaviour when it gets to replicating the same PGs
> >>> that were causing troubles the first time.
> >>> 
> >>> On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
> >>> 
>  Actually, it only worked with restarting  for a period of time to
>  get the recovering process going. Can’t get passed the 21k object
>  mark.
>  
>  I’m uncertain if the disk really is messing this up right now as
>  well. So I’m not glad to start moving 300k objects around.
>  
>  Regards,
>  Josef
>  
>  On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
>  
> > Hi,
> > 
> > On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
> > 
> >> 
> >> Hello,
> >> 
> >> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
> >> 
> >>> Also putting this on the list.
> >>> 
> >>> On 06 Sep 2014, at 13:36, Josef Johansson 
> >>> wrote:
> >>> 
>  Hi,
>  
>  Same issues again, but I think we found the drive that causes
>  the problems.
>  
>  But this is causing problems as it’s trying to do a recover to
>  that osd at the moment.
>  
>  So we’re left with the status message 
>  
>  2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860
>  pgs: 6841 active+clean, 19 active+remapped+backfilling; 12299
>  GB data, 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd,
>  192KB/s wr, 74op/s; 41424/15131923 degraded (0.274%);
>  recovering 0 o/s, 2035KB/s
>  
>  
>  It’s improving, but way too slowly. If I restart the recovery
>  (ceph osd set no recovery /unset) it doesn’t change the osd what
>  I can see.
>  
>  Any ideas?
>  
> >> I don't know the state of your cluster, i.e. what caused the
> >> recovery to start (how many OSDs went down?).
> > Performance degradation, databases are the worst impacted. It’s
> > actually a OSD that we put in that’s causing it (removed it again
> > though). So the cluster in itself is healthy.
> > 
> >> If you have a replication of 3 and only one OSD was involved, what
> >> is stopping you from

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Christian Balzer

Hello,

On Sat, 6 Sep 2014 17:52:59 +0200 Josef Johansson wrote:

> Hi,
> 
> Just realised that it could also be with a popularity bug as well and
> lots a small traffic. And seeing that it’s fast it gets popular until it
> hits the curb.
> 
I don't think I ever heard the term "popularity bug" before, care to
elaborate? 

> I’m seeing this in the stats I think.
> 
> Linux 3.13-0.bpo.1-amd64 (osd1)   09/06/2014
> _x86_64_  (24 CPU)
Any particular reason you're not running 3.14?

> 
> 09/06/2014 05:48:41 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>2.210.001.002.860.00   93.93
> 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdm   0.02 1.477.05   42.72 0.67 1.07
> 71.43 0.418.176.418.46   3.44  17.13 sdn
> 0.03 1.426.17   37.08 0.57 0.9270.51 0.08
> 1.766.470.98   3.46  14.98 sdg   0.03 1.44
> 6.27   36.62 0.56 0.9471.40 0.348.006.83
> 8.20   3.45  14.78 sde   0.03 1.236.47   39.07
> 0.59 0.9870.29 0.439.476.579.95   3.37  15.33
> sdf   0.02 1.266.47   33.77 0.61 0.87
> 75.30 0.225.396.005.27   3.52  14.17 sdl
> 0.03 1.446.44   40.54 0.59 1.0872.68 0.21
> 4.496.564.16   3.40  15.95 sdk   0.03 1.41
> 5.62   35.92 0.52 0.9070.10 0.153.586.17
> 3.17   3.45  14.32 sdj   0.03 1.266.30   34.23
> 0.57 0.8370.84 0.317.656.567.85   3.48  14.10
> 
> Seeing that the drives are in pretty good shape but not giving lotsa
> read, I would assume that I need to tweak the cache to swallow more IO.
>
That looks indeed fine, as in, none of these disks looks suspicious to me.
 
> When I tweaked it before production I did not see any performance gains
> what so ever, so they are pretty low. And it’s odd because we just saw
> these problems a little while ago. So probably that we hit a limit where
> the disks are getting lot of IO.
> 
> I know that there’s some threads about this that I will read again.
>
URL?
 
Christian

> Thanks for the hints in looking at bad drives.
> 
> Regards,
> Josef
> 
> On 06 Sep 2014, at 17:41, Josef Johansson  wrote:
> 
> > Hi,
> > 
> > On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
> > 
> >> 
> >> Hello,
> >> 
> >> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
> >> 
> >>> We manage to go through the restore, but the performance degradation
> >>> is still there.
> >>> 
> >> Manifesting itself how?
> >> 
> > Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
> > But mostly a lot of iowait.
> > 
> >>> Looking through the OSDs to pinpoint a source of the degradation and
> >>> hoping the current load will be lowered.
> >>> 
> >> 
> >> You're the one looking at your cluster, the iostat, atop, iotop and
> >> whatnot data.
> >> If one particular OSD/disk stands out, investigate it, as per the
> >> "Good way to monitor detailed latency/throughput" thread. 
> >> 
> > Will read it through.
> >> If you have a spare and idle machine that is identical to your storage
> >> nodes, you could run a fio benchmark on a disk there and then compare
> >> the results to that of your suspect disk after setting your cluster
> >> to noout and stopping that particular OSD.
> > No spare though, but I have a rough idea what it should be, what’s I’m
> > going at right now. Right, so the cluster should be fine after I stop
> > the OSD right? I though of stopping it a little bit to see if the IO
> > was better afterwards from within the VMs. Not sure how good effect it
> > makes though since it may be waiting for the IO to complete what not.
> >> 
> >>> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
> >>> tough if the degradation is still there afterwards? i.e. if I set
> >>> back the weight would it move back all the PGs?
> >>> 
> >> Of course.
> >> 
> >> Until you can determine that a specific OSD/disk is the culprit,
> >> don't do that. 
> >> If you have the evidence, go ahead.
> >> 
> > Great, that’s what I though as well.
> >> Regards,
> >> 
> >> Christian
> >> 
> >>> Regards,
> >>> Josef
> >>> 
> >>> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
> >>> 
>  FWI I did restart the OSDs until I saw a server that made impact.
>  Until that server stopped doing impact, I didn’t get lower in the
>  number objects being degraded. After a while it was done with
>  recovering that OSD and happily started with others. I guess I will
>  be seeing the same behaviour when it gets to replicating the same
>  PGs that were causing troubles the first time.
>  
>  On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
>  
> > Actually, it only worked with restarting  for a period of time to
> > get the recoveri

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Scott Laird
Backing up slightly, have you considered RAID 5 over your SSDs?
 Practically speaking, there's no performance downside to RAID 5 when your
devices aren't IOPS-bound.

On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer  wrote:

> On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:
>
> > September 6 2014 4:01 PM, "Christian Balzer"  wrote:
> > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> > >
> > >> Hi Christian,
> > >>
> > >> Let's keep debating until a dev corrects us ;)
> > >
> > > For the time being, I give the recent:
> > >
> > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> > >
> > > And not so recent:
> > > http://www.spinics.net/lists/ceph-users/msg04152.html
> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > >
> > > And I'm not going to use BTRFS for mainly RBD backed VM images
> > > (fragmentation city), never mind the other stability issues that crop
> > > up here ever so often.
> >
> >
> > Thanks for the links... So until I learn otherwise, I better assume the
> > OSD is lost when the journal fails. Even though I haven't understood
> > exactly why :( I'm going to UTSL to understand the consistency better.
> > An op state diagram would help, but I didn't find one yet.
> >
> Using the source as an option of last resort is always nice, having to
> actually do so for something like this feels a bit lacking in the
> documentation department (that or my google foo being weak). ^o^
>
> > BTW, do you happen to know, _if_ we re-use an OSD after the journal has
> > failed, are any object inconsistencies going to be found by a
> > scrub/deep-scrub?
> >
> No idea.
> And really a scenario I hope to never encounter. ^^;;
>
> > >>
> > >> We have 4 servers in a 3U rack, then each of those servers is
> > >> connected to one of these enclosures with a single SAS cable.
> > >>
> >  With the current config, when I dd to all drives in parallel I can
> >  write at 24*74MB/s = 1776MB/s.
> > >>>
> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
> > >>> lanes, so as far as that bus goes, it can do 4GB/s.
> > >>> And given your storage pod I assume it is connected with 2 mini-SAS
> > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
> > >>> bandwidth.
> > >>
> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> > >
> > > Alright, that explains that then. Any reason for not using both ports?
> > >
> >
> > Probably to minimize costs, and since the single 10Gig-E is a bottleneck
> > anyway. The whole thing is suboptimal anyway, since this hardware was
> > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc...
> >
> The single 10Gb/s link is the bottleneck for sustained stuff, but when
> looking at spikes...
> Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port
> might also get some loving. ^o^
>
> The cluster I'm currently building is based on storage nodes with 4 SSDs
> (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8
> HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for
> redundancy, not speed. ^^
>
> > >>> Impressive, even given your huge cluster with 1128 OSDs.
> > >>> However that's not really answering my question, how much data is on
> > >>> an average OSD and thus gets backfilled in that hour?
> > >>
> > >> That's true -- our drives have around 300TB on them. So I guess it
> > >> will take longer - 3x longer - when the drives are 1TB full.
> > >
> > > On your slides, when the crazy user filled the cluster with 250 million
> > > objects and thus 1PB of data, I recall seeing a 7 hour backfill time?
> > >
> >
> > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not
> > close to 1PB. The point was that to fill the cluster with RBD, we'd need
> > 250 million (4MB) objects. So, object-count-wise this was a full
> > cluster, but for the real volume it was more like 70TB IIRC (there were
> > some other larger objects too).
> >
> Ah, I see. ^^
>
> > In that case, the backfilling was CPU-bound, or perhaps
> > wbthrottle-bound, I don't remember... It was just that there were many
> > tiny tiny objects to synchronize.
> >
> Indeed. This is something me and others have seen as well, as in
> backfilling being much slower than the underlying HW would permit and
> being CPU intensive.
>
> > > Anyway, I guess the lesson to take away from this is that size and
> > > parallelism does indeed help, but even in a cluster like yours
> > > recovering from a 2TB loss would likely be in the 10 hour range...
> >
> > Bigger clusters probably backfill faster simply because there are more
> > OSDs involved in the backfilling. In our cluster we initially get 30-40
> > backfills in parallel after 1 OSD fails. That's even with max backfills
> > = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is
> > spent backfilling the last 20% of the PGs, just because some OSDs
> > randomly get more new PGs than

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
Hi,

On 06 Sep 2014, at 18:05, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 6 Sep 2014 17:52:59 +0200 Josef Johansson wrote:
> 
>> Hi,
>> 
>> Just realised that it could also be with a popularity bug as well and
>> lots a small traffic. And seeing that it’s fast it gets popular until it
>> hits the curb.
>> 
> I don't think I ever heard the term "popularity bug" before, care to
> elaborate? 
I did! :D When you start out fine with great numbers, people like it and 
suddenly it’s not so fast anymore, and when you hit the magic number it starts 
to be trouble.
> 
>> I’m seeing this in the stats I think.
>> 
>> Linux 3.13-0.bpo.1-amd64 (osd1)  09/06/2014
>> _x86_64_ (24 CPU)
> Any particular reason you're not running 3.14?
No, just that we don’t have that much time on our hands.
>> 
>> 09/06/2014 05:48:41 PM
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>   2.210.001.002.860.00   93.93
>> 
>> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdm   0.02 1.477.05   42.72 0.67 1.07
>> 71.43 0.418.176.418.46   3.44  17.13 sdn
>> 0.03 1.426.17   37.08 0.57 0.9270.51 0.08
>> 1.766.470.98   3.46  14.98 sdg   0.03 1.44
>> 6.27   36.62 0.56 0.9471.40 0.348.006.83
>> 8.20   3.45  14.78 sde   0.03 1.236.47   39.07
>> 0.59 0.9870.29 0.439.476.579.95   3.37  15.33
>> sdf   0.02 1.266.47   33.77 0.61 0.87
>> 75.30 0.225.396.005.27   3.52  14.17 sdl
>> 0.03 1.446.44   40.54 0.59 1.0872.68 0.21
>> 4.496.564.16   3.40  15.95 sdk   0.03 1.41
>> 5.62   35.92 0.52 0.9070.10 0.153.586.17
>> 3.17   3.45  14.32 sdj   0.03 1.266.30   34.23
>> 0.57 0.8370.84 0.317.656.567.85   3.48  14.10
>> 
>> Seeing that the drives are in pretty good shape but not giving lotsa
>> read, I would assume that I need to tweak the cache to swallow more IO.
>> 
> That looks indeed fine, as in, none of these disks looks suspicious to me.
> 
>> When I tweaked it before production I did not see any performance gains
>> what so ever, so they are pretty low. And it’s odd because we just saw
>> these problems a little while ago. So probably that we hit a limit where
>> the disks are getting lot of IO.
>> 
>> I know that there’s some threads about this that I will read again.
>> 
> URL?
> 
Uhm, I think you’re involved in most of them. I'll post what I do and from 
where.
> Christian
> 
>> Thanks for the hints in looking at bad drives.
>> 
>> Regards,
>> Josef
>> 
>> On 06 Sep 2014, at 17:41, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>>> 
 
 Hello,
 
 On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
 
> We manage to go through the restore, but the performance degradation
> is still there.
> 
 Manifesting itself how?
 
>>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>>> But mostly a lot of iowait.
>>> 
> Looking through the OSDs to pinpoint a source of the degradation and
> hoping the current load will be lowered.
> 
 
 You're the one looking at your cluster, the iostat, atop, iotop and
 whatnot data.
 If one particular OSD/disk stands out, investigate it, as per the
 "Good way to monitor detailed latency/throughput" thread. 
 
>>> Will read it through.
 If you have a spare and idle machine that is identical to your storage
 nodes, you could run a fio benchmark on a disk there and then compare
 the results to that of your suspect disk after setting your cluster
 to noout and stopping that particular OSD.
>>> No spare though, but I have a rough idea what it should be, what’s I’m
>>> going at right now. Right, so the cluster should be fine after I stop
>>> the OSD right? I though of stopping it a little bit to see if the IO
>>> was better afterwards from within the VMs. Not sure how good effect it
>>> makes though since it may be waiting for the IO to complete what not.
 
> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
> tough if the degradation is still there afterwards? i.e. if I set
> back the weight would it move back all the PGs?
> 
 Of course.
 
 Until you can determine that a specific OSD/disk is the culprit,
 don't do that. 
 If you have the evidence, go ahead.
 
>>> Great, that’s what I though as well.
 Regards,
 
 Christian
 
> Regards,
> Josef
> 
> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
> 
>> FWI I did restart the OSDs until I saw a server that made impact.
>> Until that server stopped doing impact, I didn’t get lower in the
>

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
Hi,

On 06 Sep 2014, at 17:59, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
> 
>> Hi,
>> 
>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>>> 
 We manage to go through the restore, but the performance degradation
 is still there.
 
>>> Manifesting itself how?
>>> 
>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>> But mostly a lot of iowait.
>> 
> I was thinking about the storage nodes. ^^
> As in, does a particular node or disk seem to be redlined all the time?
They’re idle, with little io wait.
> 
 Looking through the OSDs to pinpoint a source of the degradation and
 hoping the current load will be lowered.
 
>>> 
>>> You're the one looking at your cluster, the iostat, atop, iotop and
>>> whatnot data.
>>> If one particular OSD/disk stands out, investigate it, as per the "Good
>>> way to monitor detailed latency/throughput" thread. 
>>> 
>> Will read it through.
>>> If you have a spare and idle machine that is identical to your storage
>>> nodes, you could run a fio benchmark on a disk there and then compare
>>> the results to that of your suspect disk after setting your cluster to
>>> noout and stopping that particular OSD.
>> No spare though, but I have a rough idea what it should be, what’s I’m
>> going at right now. Right, so the cluster should be fine after I stop
>> the OSD right? I though of stopping it a little bit to see if the IO was
>> better afterwards from within the VMs. Not sure how good effect it makes
>> though since it may be waiting for the IO to complete what not.
>>> 
> If you set your cluster to noout, as in "ceph osd set noout" per
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
> before shutting down a particular ODS, no data migration will happen.
> 
> Of course you will want to shut it down as little as possible, so that
> recovery traffic when it comes back is minimized. 
> 
Good, yes will do this.
Regards,
Josef
> Christian 
> 
 I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
 tough if the degradation is still there afterwards? i.e. if I set back
 the weight would it move back all the PGs?
 
>>> Of course.
>>> 
>>> Until you can determine that a specific OSD/disk is the culprit, don't
>>> do that. 
>>> If you have the evidence, go ahead.
>>> 
>> Great, that’s what I though as well.
>>> Regards,
>>> 
>>> Christian
>>> 
 Regards,
 Josef
 
 On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
 
> FWI I did restart the OSDs until I saw a server that made impact.
> Until that server stopped doing impact, I didn’t get lower in the
> number objects being degraded. After a while it was done with
> recovering that OSD and happily started with others. I guess I will
> be seeing the same behaviour when it gets to replicating the same PGs
> that were causing troubles the first time.
> 
> On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
> 
>> Actually, it only worked with restarting  for a period of time to
>> get the recovering process going. Can’t get passed the 21k object
>> mark.
>> 
>> I’m uncertain if the disk really is messing this up right now as
>> well. So I’m not glad to start moving 300k objects around.
>> 
>> Regards,
>> Josef
>> 
>> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
>>> 
 
 Hello,
 
 On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
 
> Also putting this on the list.
> 
> On 06 Sep 2014, at 13:36, Josef Johansson 
> wrote:
> 
>> Hi,
>> 
>> Same issues again, but I think we found the drive that causes
>> the problems.
>> 
>> But this is causing problems as it’s trying to do a recover to
>> that osd at the moment.
>> 
>> So we’re left with the status message 
>> 
>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860
>> pgs: 6841 active+clean, 19 active+remapped+backfilling; 12299
>> GB data, 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd,
>> 192KB/s wr, 74op/s; 41424/15131923 degraded (0.274%);
>> recovering 0 o/s, 2035KB/s
>> 
>> 
>> It’s improving, but way too slowly. If I restart the recovery
>> (ceph osd set no recovery /unset) it doesn’t change the osd what
>> I can see.
>> 
>> Any ideas?
>> 
 I don't know the state of your cluster, i.e. what caused the
 recovery to start (how many OSDs went down?).
>>> Performance degradation, databases are the worst impacted. It’s
>>> actually a OSD that we put in that’s causing it (remove

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Christian Balzer
On Sat, 06 Sep 2014 16:06:56 + Scott Laird wrote:

> Backing up slightly, have you considered RAID 5 over your SSDs?
>  Practically speaking, there's no performance downside to RAID 5 when
> your devices aren't IOPS-bound.
> 

Well...
For starters with RAID5 you would loose 25% throughput in both Dan's and
my case (4 SSDs) compared to JBOD SSD journals. 
In Dan's case that might not matter due to other bottlenecks, in my case
it certainly would.

And while you're quite correct when it comes to IOPS, doing RAID5 will
either consume significant CPU resource in a software RAID case or require
a decent HW RAID controller. 

Christian

> On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer  wrote:
> 
> > On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:
> >
> > > September 6 2014 4:01 PM, "Christian Balzer"  wrote:
> > > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> > > >
> > > >> Hi Christian,
> > > >>
> > > >> Let's keep debating until a dev corrects us ;)
> > > >
> > > > For the time being, I give the recent:
> > > >
> > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> > > >
> > > > And not so recent:
> > > > http://www.spinics.net/lists/ceph-users/msg04152.html
> > > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > > >
> > > > And I'm not going to use BTRFS for mainly RBD backed VM images
> > > > (fragmentation city), never mind the other stability issues that
> > > > crop up here ever so often.
> > >
> > >
> > > Thanks for the links... So until I learn otherwise, I better assume
> > > the OSD is lost when the journal fails. Even though I haven't
> > > understood exactly why :( I'm going to UTSL to understand the
> > > consistency better. An op state diagram would help, but I didn't
> > > find one yet.
> > >
> > Using the source as an option of last resort is always nice, having to
> > actually do so for something like this feels a bit lacking in the
> > documentation department (that or my google foo being weak). ^o^
> >
> > > BTW, do you happen to know, _if_ we re-use an OSD after the journal
> > > has failed, are any object inconsistencies going to be found by a
> > > scrub/deep-scrub?
> > >
> > No idea.
> > And really a scenario I hope to never encounter. ^^;;
> >
> > > >>
> > > >> We have 4 servers in a 3U rack, then each of those servers is
> > > >> connected to one of these enclosures with a single SAS cable.
> > > >>
> > >  With the current config, when I dd to all drives in parallel I
> > >  can write at 24*74MB/s = 1776MB/s.
> > > >>>
> > > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe
> > > >>> 2.0 lanes, so as far as that bus goes, it can do 4GB/s.
> > > >>> And given your storage pod I assume it is connected with 2
> > > >>> mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 =
> > > >>> 48Gb/s SATA bandwidth.
> > > >>
> > > >> From above, we are only using 4 lanes -- so around 2GB/s is
> > > >> expected.
> > > >
> > > > Alright, that explains that then. Any reason for not using both
> > > > ports?
> > > >
> > >
> > > Probably to minimize costs, and since the single 10Gig-E is a
> > > bottleneck anyway. The whole thing is suboptimal anyway, since this
> > > hardware was not purchased for Ceph to begin with. Hence
> > > retrofitting SSDs, etc...
> > >
> > The single 10Gb/s link is the bottleneck for sustained stuff, but when
> > looking at spikes...
> > Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port
> > might also get some loving. ^o^
> >
> > The cluster I'm currently building is based on storage nodes with 4
> > SSDs (100GB DC 3700s, so 800MB/s would be the absolute write speed
> > limit) and 8 HDDs. Connected with 40Gb/s Infiniband. Dual port, dual
> > switch for redundancy, not speed. ^^
> >
> > > >>> Impressive, even given your huge cluster with 1128 OSDs.
> > > >>> However that's not really answering my question, how much data
> > > >>> is on an average OSD and thus gets backfilled in that hour?
> > > >>
> > > >> That's true -- our drives have around 300TB on them. So I guess it
> > > >> will take longer - 3x longer - when the drives are 1TB full.
> > > >
> > > > On your slides, when the crazy user filled the cluster with 250
> > > > million objects and thus 1PB of data, I recall seeing a 7 hour
> > > > backfill time?
> > > >
> > >
> > > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not
> > > close to 1PB. The point was that to fill the cluster with RBD, we'd
> > > need 250 million (4MB) objects. So, object-count-wise this was a full
> > > cluster, but for the real volume it was more like 70TB IIRC (there
> > > were some other larger objects too).
> > >
> > Ah, I see. ^^
> >
> > > In that case, the backfilling was CPU-bound, or perhaps
> > > wbthrottle-bound, I don't remember... It was just that there were
> > > many tiny tiny objects to synchronize.
> > >
> > Indeed. This is something me and others have seen as well, as in
> > backfilling b

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Dan Van Der Ster
RAID5... Hadn't considered it due to the IOPS penalty (it would get 1/4th of 
the IOPS of separated journal devices, according to some online raid calc). 
Compared to RAID10, I guess we'd get 50% more capacity, but lower performance.

After the anecdotes that the DCS3700 is very rarely failing, and without a 
stable bcache to build upon, I'm leaning toward the usual 5 journal partitions 
per SSD. But that will leave at least 100GB free per drive, so I might try 
running an OSD there.

Cheers, Dan

On Sep 6, 2014 6:07 PM, Scott Laird  wrote:
Backing up slightly, have you considered RAID 5 over your SSDs?  Practically 
speaking, there's no performance downside to RAID 5 when your devices aren't 
IOPS-bound.

On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer 
mailto:ch...@gol.com>> wrote:
On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:

> September 6 2014 4:01 PM, "Christian Balzer" 
> mailto:ch...@gol.com>> wrote:
> > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> >
> >> Hi Christian,
> >>
> >> Let's keep debating until a dev corrects us ;)
> >
> > For the time being, I give the recent:
> >
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> >
> > And not so recent:
> > http://www.spinics.net/lists/ceph-users/msg04152.html
> > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> >
> > And I'm not going to use BTRFS for mainly RBD backed VM images
> > (fragmentation city), never mind the other stability issues that crop
> > up here ever so often.
>
>
> Thanks for the links... So until I learn otherwise, I better assume the
> OSD is lost when the journal fails. Even though I haven't understood
> exactly why :( I'm going to UTSL to understand the consistency better.
> An op state diagram would help, but I didn't find one yet.
>
Using the source as an option of last resort is always nice, having to
actually do so for something like this feels a bit lacking in the
documentation department (that or my google foo being weak). ^o^

> BTW, do you happen to know, _if_ we re-use an OSD after the journal has
> failed, are any object inconsistencies going to be found by a
> scrub/deep-scrub?
>
No idea.
And really a scenario I hope to never encounter. ^^;;

> >>
> >> We have 4 servers in a 3U rack, then each of those servers is
> >> connected to one of these enclosures with a single SAS cable.
> >>
>  With the current config, when I dd to all drives in parallel I can
>  write at 24*74MB/s = 1776MB/s.
> >>>
> >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
> >>> lanes, so as far as that bus goes, it can do 4GB/s.
> >>> And given your storage pod I assume it is connected with 2 mini-SAS
> >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
> >>> bandwidth.
> >>
> >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> >
> > Alright, that explains that then. Any reason for not using both ports?
> >
>
> Probably to minimize costs, and since the single 10Gig-E is a bottleneck
> anyway. The whole thing is suboptimal anyway, since this hardware was
> not purchased for Ceph to begin with. Hence retrofitting SSDs, etc...
>
The single 10Gb/s link is the bottleneck for sustained stuff, but when
looking at spikes...
Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port
might also get some loving. ^o^

The cluster I'm currently building is based on storage nodes with 4 SSDs
(100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8
HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for
redundancy, not speed. ^^

> >>> Impressive, even given your huge cluster with 1128 OSDs.
> >>> However that's not really answering my question, how much data is on
> >>> an average OSD and thus gets backfilled in that hour?
> >>
> >> That's true -- our drives have around 300TB on them. So I guess it
> >> will take longer - 3x longer - when the drives are 1TB full.
> >
> > On your slides, when the crazy user filled the cluster with 250 million
> > objects and thus 1PB of data, I recall seeing a 7 hour backfill time?
> >
>
> Yeah that was fun :) It was 250 million (mostly) 4k objects, so not
> close to 1PB. The point was that to fill the cluster with RBD, we'd need
> 250 million (4MB) objects. So, object-count-wise this was a full
> cluster, but for the real volume it was more like 70TB IIRC (there were
> some other larger objects too).
>
Ah, I see. ^^

> In that case, the backfilling was CPU-bound, or perhaps
> wbthrottle-bound, I don't remember... It was just that there were many
> tiny tiny objects to synchronize.
>
Indeed. This is something me and others have seen as well, as in
backfilling being much slower than the underlying HW would permit and
being CPU intensive.

> > Anyway, I guess the lesson to take away from this is that size and
> > parallelism does indeed help, but even in a cluster like yours
> > recovering from a 2TB loss would likely be in the 10 hour range...
>
> Big

Re: [ceph-users] ceph osd unexpected error

2014-09-06 Thread Somnath Roy
Have you set the open file descriptor limit in the OSD node ?
Try setting it like 'ulimit -n 65536"

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Saturday, September 06, 2014 7:44 AM
To: 廖建锋
Cc: ceph-users; ceph-devel
Subject: Re: [ceph-users] ceph osd unexpected error

Hi,

Could you give some more detail infos such as operation before occur errors?

And what's your ceph version?

On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋  wrote:
> Dear CEPH ,
> Urgent question, I met a "FAILED assert(0 == "unexpected error")"
> yesterday , Now i have not way to start this OSDS I have attached my
> logs in the attachment, and some  ceph configurations  as below
>
>
> osd_pool_default_pgp_num = 300
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 300
> mon_host = 10.1.0.213,10.1.0.214
> osd_crush_chooseleaf_type = 1
> mds_cache_size = 50
> osd objectstore = keyvaluestore-dev
>
>
>
> Detailed error information :
>
>
>-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops
> ||
> 11642907 > 104857600
> -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops ||
> 11642899 > 104857600
> -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops ||
> 11642901 > 104857600
> -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick
> -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after
> 2014-09-05
> 15:07:05.326835)
> -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now:
> 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341)
> -- no
> -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops ||
> 11044551 > 104857600
> -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013
> -6> -->
> osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0
> 0x18dcf000
> -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops ||
> 11044553 > 104857600
> -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops ||
> 11044579 > 104857600
> -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid
> -3> argument
> not handled on operation 9 (336.0.3, or op 3, counting from 0)
> -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code
> -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump:
> { "ops": [
> { "op_num": 0,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 1,
> "op_name": "mkcoll",
> "collection": "0.a9_TEMP"},
> { "op_num": 2,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 3,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 4,
> "op_name": "omap_setheader",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "header_length": "0"},
> { "op_num": 5,
> "op_name": "write",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "length": 1160,
> "offset": 0,
> "bufferlist length": 1160},
> { "op_num": 6,
> "op_name": "omap_setkeys",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "attr_lens": {}},
> { "op_num": 7,
> "op_name": "setattrs",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "attr_lens": { "_": 239,
> "_parent": 250,
> "snapset": 31}},
> { "op_num": 8,
> "op_name": "omap_setkeys",
> "collection": "meta",
> "oid": "16ef7597\/infos\/head\/\/-1",
> "attr_lens": { "0.a9_epoch": 4,
> "0.a9_info": 684}},
> { "op_num": 9,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 10,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 11,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 12,
> "op_name": "omap_setheader",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "header_length": "0"},
> { "op_num": 13,
> "op_name": "write",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "length": 507284,
> "offset": 0,
> "bufferlist length": 507284},
> { "op_num": 14,
> "op_name": "omap_setkeys",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "attr_lens": {}},
> { "op_num": 15,
> "op_name": "setattrs",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "attr_lens": { "_": 239,
> "snapset": 31}},
> { "op_num": 16,
> "op_name": "omap_setkeys",
> "collection": "meta",
> "oid": "16ef7597\/infos\/head\/\/-1",
> "attr_lens": { "0.a9_epoch": 4,
> "0.a9_info": 684}},
> { "op_num": 17,
> "op_name": "remove",
> "col

Re: [ceph-users] resizing the OSD

2014-09-06 Thread JIten Shah
Thanks Christian.  Replies inline.
On Sep 6, 2014, at 8:04 AM, Christian Balzer  wrote:

> 
> Hello,
> 
> On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote:
> 
>> Hello Cephers,
>> 
>> We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of the
>> stuff seems to be working fine but we are seeing some degrading on the
>> osd's due to lack of space on the osd's. 
> 
> Please elaborate on that degradation.

The degradation happened on few OSD's because it got quickly filled up. They 
were not of the same size as the other OSD's. Now I want to remove these OSD's 
and readd them with correct size to match the others.
> 
>> Is there a way to resize the
>> OSD without bringing the cluster down?
>> 
> 
> Define both "resize" and "cluster down".

Basically I want to remove the OSD's with incorrect size and readd them with 
the size matching the other OSD's. 
> 
> As in, resizing how? 
> Are your current OSDs on disks/LVMs that are not fully used and thus could
> be grown?
> What is the size of your current OSDs?

The size of current OSD's is 20GB and we do have more unused space on the disk 
that we can make the LVM bigger and increase the size of the OSD's. I agree 
that we need to have all the disks of same size and I am working towards 
that.Thanks.
> 
> The normal way of growing a cluster is to add more OSDs.
> Preferably of the same size and same performance disks.
> This will not only simplify things immensely but also make them a lot more
> predictable.
> This of course depends on your use case and usage patterns, but often when
> running out of space you're also running out of other resources like CPU,
> memory or IOPS of the disks involved. So adding more instead of growing
> them is most likely the way forward.
> 
> If you were to replace actual disks with larger ones, take them (the OSDs)
> out one at a time and re-add it. If you're using ceph-deploy, it will use
> the disk size as basic weight, if you're doing things manually make sure
> to specify that size/weight accordingly.
> Again, you do want to do this for all disks to keep things uniform.
> 
> If your cluster (pools really) are set to a replica size of at least 2
> (risky!) or 3 (as per Firefly default), taking a single OSD out would of
> course never bring the cluster down.
> However taking an OSD out and/or adding a new one will cause data movement
> that might impact your cluster's performance.
> 

We have a current replica size of 2 with 100 OSD's. How many can I loose 
without affecting the performance? I understand the impact of data movement.

--Jiten





> Regards,
> 
> Christian
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Fusion Communications
> http://www.gol.com/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson
Hi,

Unfortunatly the journal tuning did not do much. That’s odd, because I don’t 
see much utilisation on OSDs themselves. Now this leads to a network-issue 
between the OSDs right?

On 06 Sep 2014, at 18:17, Josef Johansson  wrote:

> Hi,
> 
> On 06 Sep 2014, at 17:59, Christian Balzer  wrote:
> 
>> 
>> Hello,
>> 
>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
>> 
>>> Hi,
>>> 
>>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>>> 
 
 Hello,
 
 On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
 
> We manage to go through the restore, but the performance degradation
> is still there.
> 
 Manifesting itself how?
 
>>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>>> But mostly a lot of iowait.
>>> 
>> I was thinking about the storage nodes. ^^
>> As in, does a particular node or disk seem to be redlined all the time?
> They’re idle, with little io wait.
It also shows it self as earlier, with slow requests now and then.

Like this 
2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN] slow 
request 31.554785 seconds old, received at 2014-09-06 19:12:56.914688: 
osd_op(client.12483520.0:12211087 rbd_data.4b8e9b3d1b58ba.1222 
[stat,write 3813376~4096] 3.3bfab9da e15861) v4 currently waiting for subops 
from [13,2]
2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN] slow 
request 31.554736 seconds old, received at 2014-09-06 19:12:56.914737: 
osd_op(client.12483520.0:12211088 rbd_data.4b8e9b3d1b58ba.1222 
[stat,write 3842048~8192] 3.3bfab9da e15861) v4 currently waiting for subops 
from [13,2]
2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN] slow 
request 30.691760 seconds old, received at 2014-09-06 19:12:57.13: 
osd_op(client.12646408.0:36726433 rbd_data.81ab322eb141f2.ec38 
[stat,write 749568~4096] 3.7ae1c1da e15861) v4 currently waiting for subops 
from [13,2]
2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN] 23 slow 
requests, 2 included below; oldest blocked for > 42.196747 secs
2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 : [WRN] slow 
request 30.344653 seconds old, received at 2014-09-06 19:13:01.125248: 
osd_op(client.18869229.0:100325 rbd_data.41d2eb2eb141f2.2732 
[stat,write 2174976~4096] 3.55d437e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN] slow 
request 30.344579 seconds old, received at 2014-09-06 19:13:01.125322: 
osd_op(client.18869229.0:100326 rbd_data.41d2eb2eb141f2.2732 
[stat,write 2920448~4096] 3.55d437e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN] 24 slow 
requests, 1 included below; oldest blocked for > 43.196971 secs
2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 : [WRN] slow 
request 30.627252 seconds old, received at 2014-09-06 19:13:01.842873: 
osd_op(client.10785413.0:136148901 rbd_data.96803f2eb141f2.33d7 
[stat,write 4063232~4096] 3.cf740399 e15861) v4 currently waiting for subops 
from [1,13]
2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN] 27 slow 
requests, 3 included below; oldest blocked for > 48.197700 secs
2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 : [WRN] slow 
request 30.769509 seconds old, received at 2014-09-06 19:13:06.701345: 
osd_op(client.18777372.0:1605468 rbd_data.2f1e4e2eb141f2.3541 
[stat,write 1118208~4096] 3.db1ca37e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN] slow 
request 30.769458 seconds old, received at 2014-09-06 19:13:06.701396: 
osd_op(client.18777372.0:1605469 rbd_data.2f1e4e2eb141f2.3541 
[stat,write 1130496~4096] 3.db1ca37e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN] slow 
request 30.266843 seconds old, received at 2014-09-06 19:13:07.204011: 
osd_op(client.18795696.0:847270 rbd_data.30532e2eb141f2.36bd 
[stat,write 3772416~4096] 3.76f1df7e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:38.471152 osd.25 10.168.7.23:6827/11423 374 : [WRN] 30 slow 
requests, 3 included below; oldest blocked for > 49.197952 secs
2014-09-06 19:13:38.471158 osd.25 10.168.7.23:6827/11423 375 : [WRN] slow 
request 30.706236 seconds old, received at 2014-09-06 19:13:07.764870: 
osd_op(client.12483523.0:36628673 rbd_data.4defd32eb141f2.00015200 
[stat,write 2121728~4096] 3.cd82ed8a e15861) v4 currently waiting for subops 
from [0,13]
2014-09-06 19:13:38.471162 osd.25 10.168.7.23:6827/11423 376 : [WRN] slow 
request 30.695616 seconds old, received at 2014-09-06 19:13:07.775490: 
osd_op(client.10785416.0:72721328 rbd_data.96808f2eb141f2.2a37 
[stat,write 1507328~4096] 3.323e11da e15861) v4 currently wait

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

On 06 Sep 2014, at 19:37, Josef Johansson  wrote:

> Hi,
> 
> Unfortunatly the journal tuning did not do much. That’s odd, because I don’t 
> see much utilisation on OSDs themselves. Now this leads to a network-issue 
> between the OSDs right?
> 
To answer my own question. Restarted a bond and it all went up again, found the 
culprit — packet loss. Everything up and running afterwards.

I’ll be taking that beer now,
Regards,
Josef
> On 06 Sep 2014, at 18:17, Josef Johansson  wrote:
> 
>> Hi,
>> 
>> On 06 Sep 2014, at 17:59, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
>>> 
 Hi,
 
 On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
 
> 
> Hello,
> 
> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
> 
>> We manage to go through the restore, but the performance degradation
>> is still there.
>> 
> Manifesting itself how?
> 
 Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
 But mostly a lot of iowait.
 
>>> I was thinking about the storage nodes. ^^
>>> As in, does a particular node or disk seem to be redlined all the time?
>> They’re idle, with little io wait.
> It also shows it self as earlier, with slow requests now and then.
> 
> Like this 
> 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN] slow 
> request 31.554785 seconds old, received at 2014-09-06 19:12:56.914688: 
> osd_op(client.12483520.0:12211087 rbd_data.4b8e9b3d1b58ba.1222 
> [stat,write 3813376~4096] 3.3bfab9da e15861) v4 currently waiting for subops 
> from [13,2]
> 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN] slow 
> request 31.554736 seconds old, received at 2014-09-06 19:12:56.914737: 
> osd_op(client.12483520.0:12211088 rbd_data.4b8e9b3d1b58ba.1222 
> [stat,write 3842048~8192] 3.3bfab9da e15861) v4 currently waiting for subops 
> from [13,2]
> 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN] slow 
> request 30.691760 seconds old, received at 2014-09-06 19:12:57.13: 
> osd_op(client.12646408.0:36726433 rbd_data.81ab322eb141f2.ec38 
> [stat,write 749568~4096] 3.7ae1c1da e15861) v4 currently waiting for subops 
> from [13,2]
> 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN] 23 slow 
> requests, 2 included below; oldest blocked for > 42.196747 secs
> 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 : [WRN] slow 
> request 30.344653 seconds old, received at 2014-09-06 19:13:01.125248: 
> osd_op(client.18869229.0:100325 rbd_data.41d2eb2eb141f2.2732 
> [stat,write 2174976~4096] 3.55d437e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN] slow 
> request 30.344579 seconds old, received at 2014-09-06 19:13:01.125322: 
> osd_op(client.18869229.0:100326 rbd_data.41d2eb2eb141f2.2732 
> [stat,write 2920448~4096] 3.55d437e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN] 24 slow 
> requests, 1 included below; oldest blocked for > 43.196971 secs
> 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 : [WRN] slow 
> request 30.627252 seconds old, received at 2014-09-06 19:13:01.842873: 
> osd_op(client.10785413.0:136148901 rbd_data.96803f2eb141f2.33d7 
> [stat,write 4063232~4096] 3.cf740399 e15861) v4 currently waiting for subops 
> from [1,13]
> 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN] 27 slow 
> requests, 3 included below; oldest blocked for > 48.197700 secs
> 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 : [WRN] slow 
> request 30.769509 seconds old, received at 2014-09-06 19:13:06.701345: 
> osd_op(client.18777372.0:1605468 rbd_data.2f1e4e2eb141f2.3541 
> [stat,write 1118208~4096] 3.db1ca37e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN] slow 
> request 30.769458 seconds old, received at 2014-09-06 19:13:06.701396: 
> osd_op(client.18777372.0:1605469 rbd_data.2f1e4e2eb141f2.3541 
> [stat,write 1130496~4096] 3.db1ca37e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN] slow 
> request 30.266843 seconds old, received at 2014-09-06 19:13:07.204011: 
> osd_op(client.18795696.0:847270 rbd_data.30532e2eb141f2.36bd 
> [stat,write 3772416~4096] 3.76f1df7e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:38.471152 osd.25 10.168.7.23:6827/11423 374 : [WRN] 30 slow 
> requests, 3 included below; oldest blocked for > 49.197952 secs
> 2014-09-06 19:13:38.471158 osd.25 10.168.7.23:6827/11423 375 : [WRN] slow 
> request 30.706236 seconds old, received at 2014-09-06 19:13:07.764870: 
> osd_op(client.12483523.0:36628673 rbd_data.4defd

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Scott Laird
IOPS are weird things with SSDs.  In theory, you'd see 25% of the write
IOPS when writing to a 4-way RAID5 device, since you write to all 4 devices
in parallel.  Except that's not actually true--unlike HDs where an IOP is
an IOP, SSD IOPS limits are really just a function of request size.
 Because each operation would be ~1/3rd the size, you should see a net of
about 3x the performance of one drive overall, or 75% of the sum of the
drives.  The CPU use will be higher, but it may or may not be a substantial
hit for your use case.  Journals are basically write-only, and 200G S3700s
are supposed to be able to sustain around 360 MB/sec, so RAID 5 would give
you somewhere around 1 GB/sec writing on paper.  Depending on your access
patterns, that may or may not be a win vs single SSDs; it should give you
slightly lower latency for uncongested writes at the very least.  It's
probably worth benchmarking if you have the time.

OTOH, S3700s seem to be pretty reliable, and if your cluster is big enough
to handle the loss of 5 OSDs without a big hit, then the lack of complexity
may be a bigger win all on its own.


Scott

On Sat Sep 06 2014 at 9:28:32 AM Dan Van Der Ster 
wrote:

>  RAID5... Hadn't considered it due to the IOPS penalty (it would get
> 1/4th of the IOPS of separated journal devices, according to some online
> raid calc). Compared to RAID10, I guess we'd get 50% more capacity, but
> lower performance.
>
> After the anecdotes that the DCS3700 is very rarely failing, and without a
> stable bcache to build upon, I'm leaning toward the usual 5 journal
> partitions per SSD. But that will leave at least 100GB free per drive, so I
> might try running an OSD there.
>
> Cheers, Dan
> On Sep 6, 2014 6:07 PM, Scott Laird  wrote:
>  Backing up slightly, have you considered RAID 5 over your SSDs?
>  Practically speaking, there's no performance downside to RAID 5 when your
> devices aren't IOPS-bound.
>
> On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer  wrote:
>
>> On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:
>>
>> > September 6 2014 4:01 PM, "Christian Balzer"  wrote:
>> > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
>> > >
>> > >> Hi Christian,
>> > >>
>> > >> Let's keep debating until a dev corrects us ;)
>> > >
>> > > For the time being, I give the recent:
>> > >
>> > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
>> > >
>> > > And not so recent:
>> > > http://www.spinics.net/lists/ceph-users/msg04152.html
>> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
>> > >
>> > > And I'm not going to use BTRFS for mainly RBD backed VM images
>> > > (fragmentation city), never mind the other stability issues that crop
>> > > up here ever so often.
>> >
>> >
>> > Thanks for the links... So until I learn otherwise, I better assume the
>> > OSD is lost when the journal fails. Even though I haven't understood
>> > exactly why :( I'm going to UTSL to understand the consistency better.
>> > An op state diagram would help, but I didn't find one yet.
>> >
>> Using the source as an option of last resort is always nice, having to
>> actually do so for something like this feels a bit lacking in the
>> documentation department (that or my google foo being weak). ^o^
>>
>> > BTW, do you happen to know, _if_ we re-use an OSD after the journal has
>> > failed, are any object inconsistencies going to be found by a
>> > scrub/deep-scrub?
>> >
>> No idea.
>> And really a scenario I hope to never encounter. ^^;;
>>
>> > >>
>> > >> We have 4 servers in a 3U rack, then each of those servers is
>> > >> connected to one of these enclosures with a single SAS cable.
>> > >>
>> >  With the current config, when I dd to all drives in parallel I can
>> >  write at 24*74MB/s = 1776MB/s.
>> > >>>
>> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
>> > >>> lanes, so as far as that bus goes, it can do 4GB/s.
>> > >>> And given your storage pod I assume it is connected with 2 mini-SAS
>> > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
>> > >>> bandwidth.
>> > >>
>> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
>> > >
>> > > Alright, that explains that then. Any reason for not using both ports?
>> > >
>> >
>> > Probably to minimize costs, and since the single 10Gig-E is a bottleneck
>> > anyway. The whole thing is suboptimal anyway, since this hardware was
>> > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc...
>> >
>> The single 10Gb/s link is the bottleneck for sustained stuff, but when
>> looking at spikes...
>> Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port
>> might also get some loving. ^o^
>>
>> The cluster I'm currently building is based on storage nodes with 4 SSDs
>> (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8
>> HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for
>> redundancy, not speed. ^^
>>
>> > >>> 

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Christian Balzer
On Sat, 6 Sep 2014 19:47:13 +0200 Josef Johansson wrote:

> 
> On 06 Sep 2014, at 19:37, Josef Johansson  wrote:
> 
> > Hi,
> > 
> > Unfortunatly the journal tuning did not do much. That’s odd, because I
> > don’t see much utilisation on OSDs themselves. Now this leads to a
> > network-issue between the OSDs right?
> > 
> To answer my own question. Restarted a bond and it all went up again,
> found the culprit — packet loss. Everything up and running afterwards.
> 
If there were actual errors, that should have been visible in atop as well.
For utilization it isn't that obvious, as it doesn't know what bandwidth a
bond device has. Same is true for IPoIB interfaces.
And FWIW, tap (kvm guest interfaces) are wrongly pegged in the kernel at
10Mb/s, so they get to be falsely redlined on compute nodes all the time.

> I’ll be taking that beer now,

Skol.

Christian

> Regards,
> Josef
> > On 06 Sep 2014, at 18:17, Josef Johansson  wrote:
> > 
> >> Hi,
> >> 
> >> On 06 Sep 2014, at 17:59, Christian Balzer  wrote:
> >> 
> >>> 
> >>> Hello,
> >>> 
> >>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
> >>> 
>  Hi,
>  
>  On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>  
> > 
> > Hello,
> > 
> > On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
> > 
> >> We manage to go through the restore, but the performance
> >> degradation is still there.
> >> 
> > Manifesting itself how?
> > 
>  Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>  But mostly a lot of iowait.
>  
> >>> I was thinking about the storage nodes. ^^
> >>> As in, does a particular node or disk seem to be redlined all the
> >>> time?
> >> They’re idle, with little io wait.
> > It also shows it self as earlier, with slow requests now and then.
> > 
> > Like this 
> > 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN]
> > slow request 31.554785 seconds old, received at 2014-09-06
> > 19:12:56.914688: osd_op(client.12483520.0:12211087
> > rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3813376~4096]
> > 3.3bfab9da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN]
> > slow request 31.554736 seconds old, received at 2014-09-06
> > 19:12:56.914737: osd_op(client.12483520.0:12211088
> > rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3842048~8192]
> > 3.3bfab9da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN]
> > slow request 30.691760 seconds old, received at 2014-09-06
> > 19:12:57.13: osd_op(client.12646408.0:36726433
> > rbd_data.81ab322eb141f2.ec38 [stat,write 749568~4096]
> > 3.7ae1c1da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN]
> > 23 slow requests, 2 included below; oldest blocked for > 42.196747
> > secs 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 :
> > [WRN] slow request 30.344653 seconds old, received at 2014-09-06
> > 19:13:01.125248: osd_op(client.18869229.0:100325
> > rbd_data.41d2eb2eb141f2.2732 [stat,write 2174976~4096]
> > 3.55d437e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN]
> > slow request 30.344579 seconds old, received at 2014-09-06
> > 19:13:01.125322: osd_op(client.18869229.0:100326
> > rbd_data.41d2eb2eb141f2.2732 [stat,write 2920448~4096]
> > 3.55d437e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN]
> > 24 slow requests, 1 included below; oldest blocked for > 43.196971
> > secs 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 :
> > [WRN] slow request 30.627252 seconds old, received at 2014-09-06
> > 19:13:01.842873: osd_op(client.10785413.0:136148901
> > rbd_data.96803f2eb141f2.33d7 [stat,write 4063232~4096]
> > 3.cf740399 e15861) v4 currently waiting for subops from [1,13]
> > 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN]
> > 27 slow requests, 3 included below; oldest blocked for > 48.197700
> > secs 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 :
> > [WRN] slow request 30.769509 seconds old, received at 2014-09-06
> > 19:13:06.701345: osd_op(client.18777372.0:1605468
> > rbd_data.2f1e4e2eb141f2.3541 [stat,write 1118208~4096]
> > 3.db1ca37e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN]
> > slow request 30.769458 seconds old, received at 2014-09-06
> > 19:13:06.701396: osd_op(client.18777372.0:1605469
> > rbd_data.2f1e4e2eb141f2.3541 [stat,write 1130496~4096]
> > 3.db1ca37e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN]
> > slow 

[ceph-users] 答复: ceph osd unexpected error

2014-09-06 Thread 廖建锋
I use latest version 0.80.6
I am setting  the limitation now, and watching?



发件人: Somnath Roy [somnath@sandisk.com]
发送时间: 2014年9月7日 1:12
到: Haomai Wang; 廖建锋
Cc: ceph-users; ceph-devel
主题: RE: [ceph-users] ceph osd unexpected error

Have you set the open file descriptor limit in the OSD node ?
Try setting it like 'ulimit -n 65536"

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Saturday, September 06, 2014 7:44 AM
To: 廖建锋
Cc: ceph-users; ceph-devel
Subject: Re: [ceph-users] ceph osd unexpected error

Hi,

Could you give some more detail infos such as operation before occur errors?

And what's your ceph version?

On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋  wrote:
> Dear CEPH ,
> Urgent question, I met a "FAILED assert(0 == "unexpected error")"
> yesterday , Now i have not way to start this OSDS I have attached my
> logs in the attachment, and some  ceph configurations  as below
>
>
> osd_pool_default_pgp_num = 300
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 300
> mon_host = 10.1.0.213,10.1.0.214
> osd_crush_chooseleaf_type = 1
> mds_cache_size = 50
> osd objectstore = keyvaluestore-dev
>
>
>
> Detailed error information :
>
>
>-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops
> ||
> 11642907 > 104857600
> -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops ||
> 11642899 > 104857600
> -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops ||
> 11642901 > 104857600
> -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick
> -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after
> 2014-09-05
> 15:07:05.326835)
> -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now:
> 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341)
> -- no
> -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops ||
> 11044551 > 104857600
> -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013
> -6> -->
> osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0
> 0x18dcf000
> -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops ||
> 11044553 > 104857600
> -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops ||
> 11044579 > 104857600
> -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid
> -3> argument
> not handled on operation 9 (336.0.3, or op 3, counting from 0)
> -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code
> -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump:
> { "ops": [
> { "op_num": 0,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 1,
> "op_name": "mkcoll",
> "collection": "0.a9_TEMP"},
> { "op_num": 2,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 3,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0"},
> { "op_num": 4,
> "op_name": "omap_setheader",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "header_length": "0"},
> { "op_num": 5,
> "op_name": "write",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "length": 1160,
> "offset": 0,
> "bufferlist length": 1160},
> { "op_num": 6,
> "op_name": "omap_setkeys",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "attr_lens": {}},
> { "op_num": 7,
> "op_name": "setattrs",
> "collection": "0.a9_head",
> "oid": "4b0fea9\/153b885.\/head\/\/0",
> "attr_lens": { "_": 239,
> "_parent": 250,
> "snapset": 31}},
> { "op_num": 8,
> "op_name": "omap_setkeys",
> "collection": "meta",
> "oid": "16ef7597\/infos\/head\/\/-1",
> "attr_lens": { "0.a9_epoch": 4,
> "0.a9_info": 684}},
> { "op_num": 9,
> "op_name": "remove",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 10,
> "op_name": "remove",
> "collection": "0.a9_TEMP",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 11,
> "op_name": "touch",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
> { "op_num": 12,
> "op_name": "omap_setheader",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "header_length": "0"},
> { "op_num": 13,
> "op_name": "write",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "length": 507284,
> "offset": 0,
> "bufferlist length": 507284},
> { "op_num": 14,
> "op_name": "omap_setkeys",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
> "attr_lens": {}},
> { "op_num": 15,
> "op_name": "setattrs",
> "collection": "0.a9_head",
> "oid": "4c56f2a9\/1c04096.

Re: [ceph-users] resizing the OSD

2014-09-06 Thread Christian Balzer

Hello,

On Sat, 06 Sep 2014 10:28:19 -0700 JIten Shah wrote:

> Thanks Christian.  Replies inline.
> On Sep 6, 2014, at 8:04 AM, Christian Balzer  wrote:
> 
> > 
> > Hello,
> > 
> > On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote:
> > 
> >> Hello Cephers,
> >> 
> >> We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of
> >> the stuff seems to be working fine but we are seeing some degrading
> >> on the osd's due to lack of space on the osd's. 
> > 
> > Please elaborate on that degradation.
> 
> The degradation happened on few OSD's because it got quickly filled up.
> They were not of the same size as the other OSD's. Now I want to remove
> these OSD's and readd them with correct size to match the others.

Alright, that's good idea, uniformity helps. ^^

> > 
> >> Is there a way to resize the
> >> OSD without bringing the cluster down?
> >> 
> > 
> > Define both "resize" and "cluster down".
> 
> Basically I want to remove the OSD's with incorrect size and readd them
> with the size matching the other OSD's. 
> > 
> > As in, resizing how? 
> > Are your current OSDs on disks/LVMs that are not fully used and thus
> > could be grown?
> > What is the size of your current OSDs?
> 
> The size of current OSD's is 20GB and we do have more unused space on
> the disk that we can make the LVM bigger and increase the size of the
> OSD's. I agree that we need to have all the disks of same size and I am
> working towards that.Thanks.
> > 
OK, so your OSDs are backed by LVM. 
A curious choice, any particular reason to do so?

Either way, in theory you could grow things in place, obviously first the
LVM and then the underlying filesystem. Both ext4 and xfs support online
growing, so the OSD can keep running the whole time.
If you're unfamiliar with these things, play with them on a test machine
first. 

Now for the next step we will really need to know how you deployed ceph
and the result of "ceph osd tree" (not all 100 OSDs are needed, a sample of
a "small" and "big" OSD is sufficient).

Depending on the results (it will probably have varying weights depending
on the size and a reweight value of 1 for all) you will need to adjust the
weight of the grown OSD in question accordingly with "ceph osd crush
reweight". 
That step will incur data movement, so do it one OSD at a time.

> > The normal way of growing a cluster is to add more OSDs.
> > Preferably of the same size and same performance disks.
> > This will not only simplify things immensely but also make them a lot
> > more predictable.
> > This of course depends on your use case and usage patterns, but often
> > when running out of space you're also running out of other resources
> > like CPU, memory or IOPS of the disks involved. So adding more instead
> > of growing them is most likely the way forward.
> > 
> > If you were to replace actual disks with larger ones, take them (the
> > OSDs) out one at a time and re-add it. If you're using ceph-deploy, it
> > will use the disk size as basic weight, if you're doing things
> > manually make sure to specify that size/weight accordingly.
> > Again, you do want to do this for all disks to keep things uniform.
> > 
> > If your cluster (pools really) are set to a replica size of at least 2
> > (risky!) or 3 (as per Firefly default), taking a single OSD out would
> > of course never bring the cluster down.
> > However taking an OSD out and/or adding a new one will cause data
> > movement that might impact your cluster's performance.
> > 
> 
> We have a current replica size of 2 with 100 OSD's. How many can I loose
> without affecting the performance? I understand the impact of data
> movement.
> 
Unless your LVMs are in turn living on a RAID, a replica of 2 with 100
OSDs is begging Murphy for a double disk failure. I'm also curious on how
many actual physical disks those OSD live and how many physical hosts are
in your cluster.
So again, you can't loose more than one OSD at a time w/o loosing data.

The performance impact of losing a single OSD out of 100 should be small,
especially given the size of your OSDs. However w/o knowing your actual
cluster (hardware and otherwise) don't expect anybody here to make
accurate predictions. 

Christian

> --Jiten
> 
> 
> 
> 
> 
> > Regards,
> > 
> > Christian
> > -- 
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: ceph osd unexpected error

2014-09-06 Thread Haomai Wang
Yes, if you still meet this error, please add
"debug_keyvaluestore=20/20" to your config and catch the debug output

On Sun, Sep 7, 2014 at 11:11 AM, 廖建锋  wrote:
> I use latest version 0.80.6
> I am setting  the limitation now, and watching?
>
>
> 
> 发件人: Somnath Roy [somnath@sandisk.com]
> 发送时间: 2014年9月7日 1:12
> 到: Haomai Wang; 廖建锋
> Cc: ceph-users; ceph-devel
> 主题: RE: [ceph-users] ceph osd unexpected error
>
> Have you set the open file descriptor limit in the OSD node ?
> Try setting it like 'ulimit -n 65536"
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Saturday, September 06, 2014 7:44 AM
> To: 廖建锋
> Cc: ceph-users; ceph-devel
> Subject: Re: [ceph-users] ceph osd unexpected error
>
> Hi,
>
> Could you give some more detail infos such as operation before occur errors?
>
> And what's your ceph version?
>
> On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋  wrote:
>> Dear CEPH ,
>> Urgent question, I met a "FAILED assert(0 == "unexpected error")"
>> yesterday , Now i have not way to start this OSDS I have attached my
>> logs in the attachment, and some  ceph configurations  as below
>>
>>
>> osd_pool_default_pgp_num = 300
>> osd_pool_default_size = 2
>> osd_pool_default_min_size = 1
>> osd_pool_default_pg_num = 300
>> mon_host = 10.1.0.213,10.1.0.214
>> osd_crush_chooseleaf_type = 1
>> mds_cache_size = 50
>> osd objectstore = keyvaluestore-dev
>>
>>
>>
>> Detailed error information :
>>
>>
>>-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops
>> ||
>> 11642907 > 104857600
>> -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops ||
>> 11642899 > 104857600
>> -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops ||
>> 11642901 > 104857600
>> -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick
>> -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient:
>> _check_auth_rotating have uptodate secrets (they expire after
>> 2014-09-05
>> 15:07:05.326835)
>> -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now:
>> 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341)
>> -- no
>> -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops ||
>> 11044551 > 104857600
>> -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013
>> -6> -->
>> osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0
>> 0x18dcf000
>> -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops ||
>> 11044553 > 104857600
>> -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops ||
>> 11044579 > 104857600
>> -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid
>> -3> argument
>> not handled on operation 9 (336.0.3, or op 3, counting from 0)
>> -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code
>> -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump:
>> { "ops": [
>> { "op_num": 0,
>> "op_name": "remove",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0"},
>> { "op_num": 1,
>> "op_name": "mkcoll",
>> "collection": "0.a9_TEMP"},
>> { "op_num": 2,
>> "op_name": "remove",
>> "collection": "0.a9_TEMP",
>> "oid": "4b0fea9\/153b885.\/head\/\/0"},
>> { "op_num": 3,
>> "op_name": "touch",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0"},
>> { "op_num": 4,
>> "op_name": "omap_setheader",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0",
>> "header_length": "0"},
>> { "op_num": 5,
>> "op_name": "write",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0",
>> "length": 1160,
>> "offset": 0,
>> "bufferlist length": 1160},
>> { "op_num": 6,
>> "op_name": "omap_setkeys",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0",
>> "attr_lens": {}},
>> { "op_num": 7,
>> "op_name": "setattrs",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0",
>> "attr_lens": { "_": 239,
>> "_parent": 250,
>> "snapset": 31}},
>> { "op_num": 8,
>> "op_name": "omap_setkeys",
>> "collection": "meta",
>> "oid": "16ef7597\/infos\/head\/\/-1",
>> "attr_lens": { "0.a9_epoch": 4,
>> "0.a9_info": 684}},
>> { "op_num": 9,
>> "op_name": "remove",
>> "collection": "0.a9_head",
>> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
>> { "op_num": 10,
>> "op_name": "remove",
>> "collection": "0.a9_TEMP",
>> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
>> { "op_num": 11,
>> "op_name": "touch",
>> "collection": "0.a9_head",
>> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
>> { "op_num": 12,
>> "op_name": "omap_setheader",
>> "collection": "0.a9_head",
>> "oid": "4c56f2a9\/1c04096.\/head\/\/0",
>> "header_length": "0"},
>> { "op_num": 13,
>> "op_name": "write",
>> "collection": "0.a9_head",
>> "oid": "4c56f2a9\/1c04096.\/head\/\/

[ceph-users] 答复: 答复: ceph osd unexpected error

2014-09-06 Thread 廖建锋
it happend this morning, i can not wait, so I remove and add osd again
next time I will set debug level up when it happend again 
thanks very much  

发件人: Haomai Wang [haomaiw...@gmail.com]
发送时间: 2014年9月7日 12:08
到: 廖建锋
Cc: Somnath Roy; ceph-users; ceph-devel
主题: Re: 答复: [ceph-users] ceph osd unexpected error

Yes, if you still meet this error, please add
"debug_keyvaluestore=20/20" to your config and catch the debug output

On Sun, Sep 7, 2014 at 11:11 AM, 廖建锋  wrote:
> I use latest version 0.80.6
> I am setting  the limitation now, and watching?
>
>
> 
> 发件人: Somnath Roy [somnath@sandisk.com]
> 发送时间: 2014年9月7日 1:12
> 到: Haomai Wang; 廖建锋
> Cc: ceph-users; ceph-devel
> 主题: RE: [ceph-users] ceph osd unexpected error
>
> Have you set the open file descriptor limit in the OSD node ?
> Try setting it like 'ulimit -n 65536"
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Saturday, September 06, 2014 7:44 AM
> To: 廖建锋
> Cc: ceph-users; ceph-devel
> Subject: Re: [ceph-users] ceph osd unexpected error
>
> Hi,
>
> Could you give some more detail infos such as operation before occur errors?
>
> And what's your ceph version?
>
> On Fri, Sep 5, 2014 at 3:16 PM, 廖建锋  wrote:
>> Dear CEPH ,
>> Urgent question, I met a "FAILED assert(0 == "unexpected error")"
>> yesterday , Now i have not way to start this OSDS I have attached my
>> logs in the attachment, and some  ceph configurations  as below
>>
>>
>> osd_pool_default_pgp_num = 300
>> osd_pool_default_size = 2
>> osd_pool_default_min_size = 1
>> osd_pool_default_pg_num = 300
>> mon_host = 10.1.0.213,10.1.0.214
>> osd_crush_chooseleaf_type = 1
>> mds_cache_size = 50
>> osd objectstore = keyvaluestore-dev
>>
>>
>>
>> Detailed error information :
>>
>>
>>-13> 2014-09-05 15:07:35.279863 7f4d988b9700 2 waiting 51 > 50 ops
>> ||
>> 11642907 > 104857600
>> -12> 2014-09-05 15:07:35.279899 7f4d978b7700 2 waiting 51 > 50 ops ||
>> 11642899 > 104857600
>> -11> 2014-09-05 15:07:35.279919 7f4d990ba700 2 waiting 51 > 50 ops ||
>> 11642901 > 104857600
>> -10> 2014-09-05 15:07:35.326803 7f4d9a8bd700 10 monclient: tick
>> -9> 2014-09-05 15:07:35.326837 7f4d9a8bd700 10 monclient:
>> _check_auth_rotating have uptodate secrets (they expire after
>> 2014-09-05
>> 15:07:05.326835)
>> -8> 2014-09-05 15:07:35.326871 7f4d9a8bd700 10 monclient: renew subs? (now:
>> 2014-09-05 15:07:35.326871; renew after: 2014-09-05 15:10:02.464341)
>> -- no
>> -7> 2014-09-05 15:07:35.343657 7f4d978b7700 2 waiting 51 > 50 ops ||
>> 11044551 > 104857600
>> -6> 2014-09-05 15:07:35.343654 7f4e1ee72700 1 -- 10.1.0.221:6801/4013
>> -6> -->
>> osd.12 10.1.0.219:6810/32654 -- pg_info(1 pgs e1267:0.f1) v4 -- ?+0
>> 0x18dcf000
>> -5> 2014-09-05 15:07:35.343680 7f4d990ba700 2 waiting 51 > 50 ops ||
>> 11044553 > 104857600
>> -4> 2014-09-05 15:07:35.343686 7f4d988b9700 2 waiting 51 > 50 ops ||
>> 11044579 > 104857600
>> -3> 2014-09-05 15:07:35.344875 7f4e1fe74700 0 error (22) Invalid
>> -3> argument
>> not handled on operation 9 (336.0.3, or op 3, counting from 0)
>> -2> 2014-09-05 15:07:35.344902 7f4e1fe74700 0 unexpected error code
>> -1> 2014-09-05 15:07:35.344903 7f4e1fe74700 0 transaction dump:
>> { "ops": [
>> { "op_num": 0,
>> "op_name": "remove",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0"},
>> { "op_num": 1,
>> "op_name": "mkcoll",
>> "collection": "0.a9_TEMP"},
>> { "op_num": 2,
>> "op_name": "remove",
>> "collection": "0.a9_TEMP",
>> "oid": "4b0fea9\/153b885.\/head\/\/0"},
>> { "op_num": 3,
>> "op_name": "touch",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0"},
>> { "op_num": 4,
>> "op_name": "omap_setheader",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0",
>> "header_length": "0"},
>> { "op_num": 5,
>> "op_name": "write",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0",
>> "length": 1160,
>> "offset": 0,
>> "bufferlist length": 1160},
>> { "op_num": 6,
>> "op_name": "omap_setkeys",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0",
>> "attr_lens": {}},
>> { "op_num": 7,
>> "op_name": "setattrs",
>> "collection": "0.a9_head",
>> "oid": "4b0fea9\/153b885.\/head\/\/0",
>> "attr_lens": { "_": 239,
>> "_parent": 250,
>> "snapset": 31}},
>> { "op_num": 8,
>> "op_name": "omap_setkeys",
>> "collection": "meta",
>> "oid": "16ef7597\/infos\/head\/\/-1",
>> "attr_lens": { "0.a9_epoch": 4,
>> "0.a9_info": 684}},
>> { "op_num": 9,
>> "op_name": "remove",
>> "collection": "0.a9_head",
>> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
>> { "op_num": 10,
>> "op_name": "remove",
>> "collection": "0.a9_TEMP",
>> "oid": "4c56f2a9\/1c04096.\/head\/\/0"},
>> { "op_num": 11,
>> "op_name": "touch",
>> "collection": "0.

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

On 07 Sep 2014, at 04:47, Christian Balzer  wrote:

> On Sat, 6 Sep 2014 19:47:13 +0200 Josef Johansson wrote:
> 
>> 
>> On 06 Sep 2014, at 19:37, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> Unfortunatly the journal tuning did not do much. That’s odd, because I
>>> don’t see much utilisation on OSDs themselves. Now this leads to a
>>> network-issue between the OSDs right?
>>> 
>> To answer my own question. Restarted a bond and it all went up again,
>> found the culprit — packet loss. Everything up and running afterwards.
>> 
> If there were actual errors, that should have been visible in atop as well.
> For utilization it isn't that obvious, as it doesn't know what bandwidth a
> bond device has. Same is true for IPoIB interfaces.
> And FWIW, tap (kvm guest interfaces) are wrongly pegged in the kernel at
> 10Mb/s, so they get to be falsely redlined on compute nodes all the time.
> 
This is the second time I’ve seen Ceph behaving badly due to networking issues. 
Maybe @Inktank has ideas of how to announce in the ceph log that there’s packet 
loss?
Regards,
Josef
>> I’ll be taking that beer now,
> 
> Skol.
> 
> Christian
> 
>> Regards,
>> Josef
>>> On 06 Sep 2014, at 18:17, Josef Johansson  wrote:
>>> 
 Hi,
 
 On 06 Sep 2014, at 17:59, Christian Balzer  wrote:
 
> 
> Hello,
> 
> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
> 
>> Hi,
>> 
>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>>> 
 We manage to go through the restore, but the performance
 degradation is still there.
 
>>> Manifesting itself how?
>>> 
>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>> But mostly a lot of iowait.
>> 
> I was thinking about the storage nodes. ^^
> As in, does a particular node or disk seem to be redlined all the
> time?
 They’re idle, with little io wait.
>>> It also shows it self as earlier, with slow requests now and then.
>>> 
>>> Like this 
>>> 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN]
>>> slow request 31.554785 seconds old, received at 2014-09-06
>>> 19:12:56.914688: osd_op(client.12483520.0:12211087
>>> rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3813376~4096]
>>> 3.3bfab9da e15861) v4 currently waiting for subops from [13,2]
>>> 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN]
>>> slow request 31.554736 seconds old, received at 2014-09-06
>>> 19:12:56.914737: osd_op(client.12483520.0:12211088
>>> rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3842048~8192]
>>> 3.3bfab9da e15861) v4 currently waiting for subops from [13,2]
>>> 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN]
>>> slow request 30.691760 seconds old, received at 2014-09-06
>>> 19:12:57.13: osd_op(client.12646408.0:36726433
>>> rbd_data.81ab322eb141f2.ec38 [stat,write 749568~4096]
>>> 3.7ae1c1da e15861) v4 currently waiting for subops from [13,2]
>>> 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN]
>>> 23 slow requests, 2 included below; oldest blocked for > 42.196747
>>> secs 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 :
>>> [WRN] slow request 30.344653 seconds old, received at 2014-09-06
>>> 19:13:01.125248: osd_op(client.18869229.0:100325
>>> rbd_data.41d2eb2eb141f2.2732 [stat,write 2174976~4096]
>>> 3.55d437e e15861) v4 currently waiting for subops from [13,6]
>>> 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN]
>>> slow request 30.344579 seconds old, received at 2014-09-06
>>> 19:13:01.125322: osd_op(client.18869229.0:100326
>>> rbd_data.41d2eb2eb141f2.2732 [stat,write 2920448~4096]
>>> 3.55d437e e15861) v4 currently waiting for subops from [13,6]
>>> 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN]
>>> 24 slow requests, 1 included below; oldest blocked for > 43.196971
>>> secs 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 :
>>> [WRN] slow request 30.627252 seconds old, received at 2014-09-06
>>> 19:13:01.842873: osd_op(client.10785413.0:136148901
>>> rbd_data.96803f2eb141f2.33d7 [stat,write 4063232~4096]
>>> 3.cf740399 e15861) v4 currently waiting for subops from [1,13]
>>> 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN]
>>> 27 slow requests, 3 included below; oldest blocked for > 48.197700
>>> secs 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 :
>>> [WRN] slow request 30.769509 seconds old, received at 2014-09-06
>>> 19:13:06.701345: osd_op(client.18777372.0:1605468
>>> rbd_data.2f1e4e2eb141f2.3541 [stat,write 1118208~4096]
>>> 3.db1ca37e e15861) v4 currently waiting for subops from [13,6]
>>> 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN]
>>> slow request 30.769458 seconds old, received at 2014-09-06
>>>