Re: [ceph-users] TCP failed connection attempts

2014-03-27 Thread Dan Van Der Ster
On 26 Mar 2014 at 21:33:06, Sergey Malinin 
(h...@newmail.com) wrote:
This is typical (output from netstat -s):

50329019 active connections openings
15218590 passive connection openings
44167087 failed connection attempts

Taking into account that presumably you don't have anything besides osd daemon 
running on the machine, i would say that this is extraordinarily large
number indicating that something is definitely going wrong.

This was my thinking as well. I am seeing this on a test cluster with very few 
real clients, so most of the connect attempts should be replication from other 
OSDs.

The suggested sysctl changes didn’t stop the failed conn attempts from 
increasing. I’m going to keep looking around…

Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ec pools and radosgw

2014-03-27 Thread Loic Dachary
Hi Michael,

Could you please show the exact commands you've used to modify the k & m values 
?

Cheers

On 27/03/2014 00:48, Michael Nelson wrote:
> I am playing around with erasure coded pools on 0.78-348 (firefly) and am 
> attempting to enable EC on the .rgw.buckets pool for radosgw
> (fresh install).
> 
> If I use a plain EC profile (no settings changed), uploads of various sizes 
> work fine and EC seems to be working based on how much space is
> being used in the cluster. If I start playing with k or m values, multipart 
> uploads start failing (on the first chunk). I haven't seen issues with rados 
> put or rados bench on EC pools. I saw the same behavior on the official v0.78 
> release.
> 
> I turned up verbose logging on OSDs and RGW and I don't see obvious errors. 
> Here is a snippet from the RGW log from the context/thread that failed:
> 
> 7f8224dfa700  1 -- 198.18.32.12:0/1015918 --> 198.18.32.13:6815/28535 -- 
> osd_op(client.4362.0:206 .dir.default.4327.1 [call rgw.bucket_complete_op] 
> 10.ffda47da ack+ondisk+write e85) v4 -- ?+0 0x7f81a8094d30 con 0x7f82400023c0
> 7f8224dfa700 20 -- 198.18.32.12:0/1015918 submit_message 
> osd_op(client.4362.0:206 .dir.default.4327.1 [call rgw.bucket_complete_op] 
> 10.ffda47da ack+ondisk+write e85) v4 remote, 198.18.32.13:6815/28535, have 
> pipe.
> 7f8224dfa700  0 WARNING: set_req_state_err err_no=95 resorting to 500
> 7f8224dfa700  2 req 7:0.072198:s3:PUT /xyzxyzxyz:put_obj:http status=500
> 7f8224dfa700  1 == req done req=0x7f823000f880 http_status=500 ==
> 
> Thanks,
> -mike
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] TCP failed connection attempts

2014-03-27 Thread Sergey Malinin

On 27.03.14, 10:52, Dan Van Der Ster wrote:
On 26 Mar 2014 at 21:33:06, Sergey Malinin (h...@newmail.com 
) wrote:


This is typical (output from netstat -s):

50329019 active connections openings
15218590 passive connection openings
44167087 failed connection attempts

Taking into account that presumably you don't have anything besides 
osd daemon running on the machine, i would say that this is 
extraordinarily large

number indicating that something is definitely going wrong.


This was my thinking as well. I am seeing this on a test cluster with 
very few real clients, so most of the connect attempts should be 
replication from other OSDs.


This figure represents connections initiated locally, i.e. replication 
*to* other osds.


The suggested sysctl changes didn’t stop the failed conn attempts from 
increasing. I’m going to keep looking around…


sysctl has nothing to do with that since those are just counters. You 
can debug failed connections by logging connection resets:

iptables -I INPUT -p tcp -m tcp --tcp-flags RST RST -j LOG


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] TCP failed connection attempts

2014-03-27 Thread Dan Van Der Ster
On 27 Mar 2014 at 10:44:35, Sergey Malinin 
(h...@newmail.com) wrote:
sysctl has nothing to do with that since those are just counters. You can debug 
failed connections by logging connection resets:
iptables -I INPUT -p tcp -m tcp --tcp-flags RST RST -j LOG

Thanks for that… you helped me identify a host that was looping through many 
failed connections to our OSD servers. (It was a half-disabled ceph dashboard 
tool… my fault, probably).

But even after removing that host, I still see the failed connections counter 
increasing while there are no packets logged from above.

So I’m still looking …

Cheers, Dan


-- Dan van der Ster || Data & Storage Services || CERN IT Department --
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

2014-03-27 Thread Craig Lewis




The osd.8 log shows it doing some deep scrubbing here. Perhaps that is
what caused your earlier issues with CPU usage?
When I first noticed the CPU usage, I checked iotop and iostat. Both 
said there was no disk activity, on any OSD.





At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.
regions list hung, and I killed At 14:18:15, I stopped ceph-osd id=8.
At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.  It
returned successfully.
At 14:19:10, I stopped ceph-osd id=*/4/*.


Since you've got the noout flag set, when osd.8 goes down any objects
for which osd.8 is the primary will not be readable. Since ceph reads
from primaries, and the noout flag prevents another osd from being
selected, which would happen if osd.8 were marked out, these objects
(which apparently happen to include some needed for regions list or
regionmap get) are inaccessible.

Josh



Taking osd.8 down (regardless of the noout flag) was the only way to 
things to respond.  I have not set nodown, just noout.




When I got in this morning, I had 4 more flapping OSDs: osd.4, osd.12, 
osd.13, and osd.6.  All 4 daemons were all using 100% CPU, and no disk I/O.


osd.1 and osd.14 are the only ones currently using disk I/O.


There are 3 PGs being deepscrubbed:
root@ceph1c:/var/log/radosgw-agent# ceph pg dump | grep deep
dumped all in format plain
pg_statobjectsmipdegrunfbyteslog disklog
statestate_stampvreported *up* *acting*last_scrub
scrub_stamplast_deep_scrub deep_scrub_stamp
11.774868200076146550603001 3001
active+clean+scrubbing+deep2014-03-27 10:20:30.598032
8381'51805148521:6520833 *[13,4]* *[13,4]*7894'5176984
2014-03-20 04:41:48.762996 7894'51769842014-03-20 04:41:48.762996
11.698858700077237371713001 3001
active+clean+scrubbing+deep2014-03-27 10:16:31.292487
8383'4833128521:618864 *[14,1]* *[14,1]*7894'479783
2014-03-20 03:53:18.024015 7894'4797832014-03-20 03:53:18.024015
11.d8874300075703659093409 3409
active+clean+scrubbing+deep2014-03-27 10:15:39.558121
8396'17534078521:2417672 *[12,6]* *[12,6]*7894'1459230
2014-03-20 02:40:22.123236 7894'14592302014-03-20 02:40:22.123236



These PGs are on the 6 OSDs mentioned.  osd.1 and osd.14 are not using 
100% CPU and are using disk IO.  osd.12, osd.6, osd.4, and osd.13 are 
using 100% CPU, and 0 kB/s of disk IO.  Here's iostat on ceph0c, which 
contains osd.1 (/dev/sdd), osd.4 (/dev/sde), and osd.6 (/dev/sdg):

root@ceph0c:/var/log/ceph# iostat -p sdd,sde,sdh 1
Linux 3.5.0-46-generic (ceph0c) 03/27/2014 _x86_64_(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  32.640.005.524.420.00   57.42

Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn
sdd 113.00   900.00 0.00 900  0
sdd1113.00   900.00 0.00 900  0
sde   0.00 0.00 0.00 0  0
sde1  0.00 0.00 0.00 0  0
sdh   0.00 0.00 0.00 0  0
sdh1  0.00 0.00 0.00 0  0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  29.900.004.412.820.00   62.87

Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn
sdd 181.00  1332.00 0.00 1332  0
sdd1181.00  1332.00 0.00 1332  0
sde  22.00 8.00   328.00 8328
sde1 18.00 8.00   328.00 8328
sdh  18.00 4.00   228.00 4228
sdh1 15.00 4.00   228.00 4228

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  30.210.004.261.710.00   63.82

Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn
sdd 180.00  1044.00   200.00 1044200
sdd1177.00  1044.00   200.00 1044200
sde   0.00 0.00 0.00 0  0
sde1  0.00 0.00 0.00 0  0
sdh   0.00 0.00 0.00 0  0
sdh1  0.00 0.00 0.00 0  0


So it's not no disk activity, but it's pretty close.  The disks continue 
to have 0 kB_read and 0kB_wrtn for the next 60 seconds. It's much lower 
than I would expect for OSDs executing a deepscrub.



I restarted the 4 flapping OSDs.  They recovered, then started flapping 
within 5 minutes.  I shut all of the ceph daemons down, and rebooted all 
nodes at the same time.  The OSDs return to 100% CPU usage very soon 
after boot.






I was go

Re: [ceph-users] if partition name changes, will ceph get corrupted?

2014-03-27 Thread Chris Kitzmiller
>> We use /dev/disk/by-path for this reason, but we confirmed that is stable
>> for our HBAs. Maybe /dev/disk/by-something is consistent with your
>> controller.
> 
> The upstart/udev scripts will handle mounting and osd id detection, at
> least on Ubuntu.

I'll caution that while the OSD will be correctly ID'd you might have trouble 
if you've got your journals on different disks. Running Ubuntu 13.10 with ceph 
0.72.2 I had my journal disk swap places with another drive on reboot and all 
associated OSDs failed to start. In the future I'll have ceph-deploy use 
/dev/disk/by-partuuid for the journals instead of /dev/sd*#.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ec pools and radosgw

2014-03-27 Thread Michael Nelson

On Thu, 27 Mar 2014, Loic Dachary wrote:


Hi Michael,

Could you please show the exact commands you've used to modify the k & m values 
?


ceph osd crush rule create-erasure ecruleset
ceph osd erasure-code-profile set myprofile ruleset-failure-domain=osd k=3 m=3
ceph osd pool create .rgw.buckets 400 400 erasure myprofile ecruleset

I have 4 machines and 3-5 OSDs per machine (15 in total).

-mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ec pools and radosgw

2014-03-27 Thread Yehuda Sadeh
On Wed, Mar 26, 2014 at 4:48 PM, Michael Nelson  wrote:
> I am playing around with erasure coded pools on 0.78-348 (firefly) and am
> attempting to enable EC on the .rgw.buckets pool for radosgw
> (fresh install).
>
> If I use a plain EC profile (no settings changed), uploads of various sizes
> work fine and EC seems to be working based on how much space is
> being used in the cluster. If I start playing with k or m values, multipart
> uploads start failing (on the first chunk). I haven't seen issues with rados
> put or rados bench on EC pools. I saw the same behavior on the official
> v0.78 release.
>
> I turned up verbose logging on OSDs and RGW and I don't see obvious errors.
> Here is a snippet from the RGW log from the context/thread that failed:
>
> 7f8224dfa700  1 -- 198.18.32.12:0/1015918 --> 198.18.32.13:6815/28535 --
> osd_op(client.4362.0:206 .dir.default.4327.1 [call rgw.bucket_complete_op]
> 10.ffda47da ack+ondisk+write e85) v4 -- ?+0 0x7f81a8094d30 con
> 0x7f82400023c0
> 7f8224dfa700 20 -- 198.18.32.12:0/1015918 submit_message
> osd_op(client.4362.0:206 .dir.default.4327.1 [call rgw.bucket_complete_op]
> 10.ffda47da ack+ondisk+write e85) v4 remote, 198.18.32.13:6815/28535, have
> pipe.
> 7f8224dfa700  0 WARNING: set_req_state_err err_no=95 resorting to 500
> 7f8224dfa700  2 req 7:0.072198:s3:PUT /xyzxyzxyz:put_obj:http status=500
> 7f8224dfa700  1 == req done req=0x7f823000f880 http_status=500 ==
>

There's an issue with EC and multipart uploiad, and a corresponding
ceph tracker issue was created (#7676). A fix for that was merged a
couple of days ago but did not make the cut to 0.78. The fix itself
requires setting up another replicated pool on the zone for holding
the relevant information that cannot be stored on an EC pool.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ec pools and radosgw

2014-03-27 Thread Yehuda Sadeh
On Thu, Mar 27, 2014 at 1:17 PM, Michael Nelson  wrote:
>
>
> On Thu, 27 Mar 2014, Yehuda Sadeh wrote:
>
>> On Wed, Mar 26, 2014 at 4:48 PM, Michael Nelson 
>> wrote:
>>>
>>> I am playing around with erasure coded pools on 0.78-348 (firefly) and am
>>> attempting to enable EC on the .rgw.buckets pool for radosgw
>>> (fresh install).
>>>
>>> If I use a plain EC profile (no settings changed), uploads of various
>>> sizes
>>> work fine and EC seems to be working based on how much space is
>>> being used in the cluster. If I start playing with k or m values,
>>> multipart
>>> uploads start failing (on the first chunk). I haven't seen issues with
>>> rados
>>> put or rados bench on EC pools. I saw the same behavior on the official
>>> v0.78 release.
>>>
>>> I turned up verbose logging on OSDs and RGW and I don't see obvious
>>> errors.
>>> Here is a snippet from the RGW log from the context/thread that failed:
>>>
>>> 7f8224dfa700  1 -- 198.18.32.12:0/1015918 --> 198.18.32.13:6815/28535 --
>>> osd_op(client.4362.0:206 .dir.default.4327.1 [call
>>> rgw.bucket_complete_op]
>>> 10.ffda47da ack+ondisk+write e85) v4 -- ?+0 0x7f81a8094d30 con
>>> 0x7f82400023c0
>>> 7f8224dfa700 20 -- 198.18.32.12:0/1015918 submit_message
>>> osd_op(client.4362.0:206 .dir.default.4327.1 [call
>>> rgw.bucket_complete_op]
>>> 10.ffda47da ack+ondisk+write e85) v4 remote, 198.18.32.13:6815/28535,
>>> have
>>> pipe.
>>> 7f8224dfa700  0 WARNING: set_req_state_err err_no=95 resorting to 500
>>> 7f8224dfa700  2 req 7:0.072198:s3:PUT /xyzxyzxyz:put_obj:http status=500
>>> 7f8224dfa700  1 == req done req=0x7f823000f880 http_status=500 ==
>>>
>>
>> There's an issue with EC and multipart uploiad, and a corresponding
>> ceph tracker issue was created (#7676). A fix for that was merged a
>> couple of days ago but did not make the cut to 0.78. The fix itself
>> requires setting up another replicated pool on the zone for holding
>> the relevant information that cannot be stored on an EC pool.
>
>
> OK, make sense. If I am doing something like this:
>
> ceph osd crush rule create-erasure ecruleset --debug-ms=20
>
> ceph osd erasure-code-profile set myprofile ruleset-failure-domain=osd k=3
> m=3
> ceph osd pool create .rgw.buckets 400 400 erasure myprofile ecruleset
>
> Will the replicated pool be created automatically like the other pools are?
>

No. At this point you'll have to create it manually, e.g.,

ceph osd pool create .rgw.buckets.extra 400 400

And then set it in your zone configuration.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] degraded objects after adding OSD?

2014-03-27 Thread Chad Seys
Hi all,
  Beginning with a cluster with only "active+clean" PGS, adding an OSD causes 
objects to be "degraded". 
  Does this mean that ceph deletes replicas before copying them to the new 
OSD?
  Or does degraded also mean that there are not replicas on the target OSD, 
even though there are already the desired number of replicas in the cluster?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph meetup Amsterdam: April 24th 2014

2014-03-27 Thread Wido den Hollander

Hi all,

I think it's time to organize a informal Ceph meetup in Amsterdam :-)

I have some office space available in Amsterdam at a datacenter (with 
Ceph clusters running there!) and I think it would be fun to organize a 
Ceph meetup.


No formal schedule or something, just some Ceph users (or potential 
users) doing a meetup for a few hours.


Talk about Ceph, your ideas or just ask a lot of questions!

Want to share your Ceph experience? Make a small presentation and we can 
come up with a schedule on the spot.


I'm buying pizza and drinks, so I need to know how many people will join.

Date: 24-04-2014
Location: Gyroscoopweg 134, 1042AZ Amsterdam, The Netherlands
Time: 18:00 ~ 21:00

Should you need a pickup from the trainstation (3km), let me know. I'll 
make sure somebody is there to pick you up and bring you back again.


If you are interested, let me know! I'll make sure there is pizza around 
18:30, so I need to know how many people will come.


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs vanishing from Ceph cluster?

2014-03-27 Thread Dan Koren
Just ran into this problem: a week ago I set up a Ceph cluster on 4
systems, with one admin node and 3 mon+osd nodes, then ran a few
casual IO tests. I returned to work after a few days out of town at
a conference, and now my Ceph cluster appears to have no OSDs!

root@rts24:/var/log/ceph# ceph status
cluster 284dbfe0-e612-4732-9d26-2c5909f0fbd1
 health HEALTH_ERR 119 pgs degraded; 192 pgs stale; 192 pgs stuck
stale; 119 pgs stuck unclean; recovery 2/4 objects degraded (50.000%); no
osds
 monmap e1: 3 mons at {rts21=
172.29.0.21:6789/0,rts22=172.29.0.22:6789/0,rts23=172.29.0.23:6789/0},
election epoch 32, quorum 0,1,2 rts21,rts22,rts23
 osdmap e33: 0 osds: 0 up, 0 in
  pgmap v2774: 192 pgs, 3 pools, 135 bytes data, 2 objects
0 kB used, 0 kB / 0 kB avail
2/4 objects degraded (50.000%)
  73 stale+active+clean
 119 stale+active+degraded


I would appreciate if anyone could explain how can something like
this happen, or where to look for any evidence that might help me
understand what happened. The log files in /var/log/ceph/ show no
activity except for the monitors' Paxos chatter.

Thx,


*Dan Koren*Director of Software
*DATERA* | 650.210.7910 | @dateranews
d...@datera.io
--

This email and any attachments thereto may contain private,
confidential, and privileged material for the sole use of the
intended recipient. Any review, copying, or distribution of
this email (or any attachments thereto) by others is strictly
prohibited. If you are not the intended recipient, please
contact the sender immediately and permanently delete the
original and any copies of this email and any attachments
thereto.
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash when client goes to sleep

2014-03-27 Thread hjcho616
Looks like client is waking up ok now.  Thanks.

Will those fixes be included in next release? Firefly?

Regards,
Hong



 From: hjcho616 
To: Gregory Farnum  
Cc: "ceph-users@lists.ceph.com"  
Sent: Tuesday, March 25, 2014 11:56 AM
Subject: Re: [ceph-users] MDS crash when client goes to sleep
 


I am merely putting the client to sleep and waking it up.  When it is up, 
running ls on the mounted directory.  As far as I am concerned at very high 
level I am doing the same thing.  All are running 3.13 kernel Debian provided.

When that infinite loop of decrypt error happened, I waited about 10 minutes 
before I restarted MDS.

Last time MDS crashed restarting MDS didn't get out of that degraded mode 
without me restarting the OSD for hours.  That's why I started restarting OSDs 
shortly after MDS restarts.  Next time I'll try to wait a bit more.

Regards,
Hong



 From: Gregory Farnum 
To: hjcho616  
Cc: Mohd Bazli Ab Karim ; "Yan, Zheng" 
; Sage Weil ; "ceph-users@lists.ceph.com" 
 
Sent: Tuesday, March 25, 2014 11:05 AM
Subject: Re: [ceph-users] MDS crash when client goes to sleep
 

On Mon, Mar 24, 2014 at 6:26 PM, hjcho616  wrote:
> I tried the patch twice.  First time, it worked.  There was no issue.
> Connected back to MDS and was happily running.  All three MDS demons were
> running ok.
>
> Second time though... all three demons were alive.  Health was reported OK.
> However client does not connect to MDS.  MDS demon gets following messages
> over and over and over again.  192.168.1.30 is one of the OSD.
> 2014-03-24 20:20:51.722367 7f400c735700  0 cephx: verify_reply couldn't
> decrypt with error: error decoding block for decryption
> 2014-03-24 20:20:51.722392 7f400c735700  0 -- 192.168.1.20:6803/21678 >>
> 192.168.1.30:6806/3796 pipe(0x2be3b80 sd=20 :56656 s=1 pgs=0 cs=0 l=1
> c=0x2bd6840).failed verifying
 authorize reply

This sounds different than the scenario you initially described, with
a client going to sleep. Exactly what are you doing?

>
> When I restart the MDS (not OSDs) when I do ceph health detail I did see a
> mds degraded message with a replay.  I restarted OSDs again and OSDs and it
> was ok.  Is there something I can do to prevent this?

That sounds normal -- the MDS has to replay its journal when it
restarts. It shouldn't take too long, but restarting OSDs definitely
won't help since the MDS is trying to read data off of them.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd + qemu osd performance

2014-03-27 Thread Cédric Lemarchand


> Le 26 mars 2014 à 00:30, Andrei Mikhailovsky  a écrit :
> 
> The osd fragmentation level of zfs is at 8% at the moment, not sure if this 
> should impact the performance by this much. I will defrag it over night and 
> check tomorrow to see if it makes the difference.

Sorry if this is a little bit out of the scope of the thread, but I am very 
interested to know how you get this value (8% fragmentation) and how you can 
defragment a cow fs ?

Cheers,

Cedric 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ec pools and radosgw

2014-03-27 Thread Michael Nelson



On Thu, 27 Mar 2014, Yehuda Sadeh wrote:


On Wed, Mar 26, 2014 at 4:48 PM, Michael Nelson  wrote:

I am playing around with erasure coded pools on 0.78-348 (firefly) and am
attempting to enable EC on the .rgw.buckets pool for radosgw
(fresh install).

If I use a plain EC profile (no settings changed), uploads of various sizes
work fine and EC seems to be working based on how much space is
being used in the cluster. If I start playing with k or m values, multipart
uploads start failing (on the first chunk). I haven't seen issues with rados
put or rados bench on EC pools. I saw the same behavior on the official
v0.78 release.

I turned up verbose logging on OSDs and RGW and I don't see obvious errors.
Here is a snippet from the RGW log from the context/thread that failed:

7f8224dfa700  1 -- 198.18.32.12:0/1015918 --> 198.18.32.13:6815/28535 --
osd_op(client.4362.0:206 .dir.default.4327.1 [call rgw.bucket_complete_op]
10.ffda47da ack+ondisk+write e85) v4 -- ?+0 0x7f81a8094d30 con
0x7f82400023c0
7f8224dfa700 20 -- 198.18.32.12:0/1015918 submit_message
osd_op(client.4362.0:206 .dir.default.4327.1 [call rgw.bucket_complete_op]
10.ffda47da ack+ondisk+write e85) v4 remote, 198.18.32.13:6815/28535, have
pipe.
7f8224dfa700  0 WARNING: set_req_state_err err_no=95 resorting to 500
7f8224dfa700  2 req 7:0.072198:s3:PUT /xyzxyzxyz:put_obj:http status=500
7f8224dfa700  1 == req done req=0x7f823000f880 http_status=500 ==



There's an issue with EC and multipart uploiad, and a corresponding
ceph tracker issue was created (#7676). A fix for that was merged a
couple of days ago but did not make the cut to 0.78. The fix itself
requires setting up another replicated pool on the zone for holding
the relevant information that cannot be stored on an EC pool.


OK, make sense. If I am doing something like this:

ceph osd crush rule create-erasure ecruleset --debug-ms=20
ceph osd erasure-code-profile set myprofile ruleset-failure-domain=osd k=3 m=3
ceph osd pool create .rgw.buckets 400 400 erasure myprofile ecruleset

Will the replicated pool be created automatically like the other pools 
are?


Thanks,
-mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help, add mon failed lead to cluster failure

2014-03-27 Thread Joao Eduardo Luis

On 26/03/14 10:40, duan.xuf...@zte.com.cn wrote:


Hi,
 I just add a new mon to a health cluster by following website
manual "http://ceph.com/docs/master/rados/operations/add-or-rm-mons/";
"ADDING MONITORS" step by step,

but when i execute step 6:
ceph mon add  [:]

the command didn't return, then i execute "ceph -s" on health mon node,
this command didn't return either.

so i try to restart mon to recover the whole cluster, but it seems never
recover.

Please anyone tell me how to deal with it?


First you should make sure you are able to reach the first monitor from 
the new monitor's server.  A simple 'telnet IP PORT' should be enough, 
with IP being the IP of the server where the monitor lives, PORT being 
the monitor's port (6789 by default).


If that fails, you should probably check firewall rules 
dropping/rejecting connections between the servers, make sure the 
servers are on the same subnets, etc.


You may want to check both monitor's monmaps and mon status via the 
admin socket: 'ceph daemon mon.FOO mon_status', with FOO being the 
monitor's id.  This command must be run on the monitor's server.


If all else fails, please set 'debug mon = 10' and 'debug ms = 1' on 
both monitors, restart them and send us the logs.


My guess is that there's something preventing the new monitor from 
reaching the first monitor, and the first monitor is unable to form a 
quorum just by itself (as it needs a second monitor to join in order to 
establish a majority).


You may also want to check the monitor troubleshooting section on the 
docs page:


http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/

  -Joao


--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw-admin usage show Does not seem to work properly with start and end dates

2014-03-27 Thread Michael Baysek

I've noticed what seems to be a strange artifact in the radosgw-admin tool when 
I query for usage data in one hour intervals.

For this exercise, I have had a script uploading and downloading files to the 
object store constantly, waiting 5 minutes in between runs.  The user in this 
case is 'mike' and the bucket is 'abc'.  All transfers are completing 
successfully.

Software information / versions:
radosgw 0.72.2-1precise
also running ceph 0.72.2-1precise

Suspect Behavior:

For example, take the following loop, which iterates through each hour of the
day and requests usage:

for x in `seq -w 0 23`
do 
  /bin/echo -e "\n-\nHOUR $x\n"
  radosgw-admin usage show --start-date="2014-03-19 $x:00:00" 
--end-date="2014-03-19 $x:59:59"
done

And the output:  http://pastebin.com/njEuVKx9

I'm looping for each hour of the day and hoping to get usage data totals for 
each
hour.  But what's happening is that I only get data back on every 6th hour.  
The 
queries for the other 5 hours in between are returning empty.

Desired Behavior:

I expect to receive the data for the time period I specify when running 
this command.

Any thoughts as to why this might be happening?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD as backend for iSCSI SAN Targets

2014-03-27 Thread Karol Kozubal
Hi Jianing,

Sorry for the late reply, I missed your contribution to the thread.

Thank you for your response. I am still waiting for some of my hardware
and will begin testing the new setup with firefly once it is available as
a long term support release. I am looking forward to testing the new setup.

I am curious about more details on your proxy node configuration for the
tgt deamons? I am interested if your setup tolerates node failure on the
iscsi end of things, if so how it is configured?

Thanks,

Karol





On 2014-03-19, 6:58 AM, "Jianing Yang"  wrote:

>Hi, Karol
>
>Here is something that I can share. We are running Ceph as an Exchange
>Backend via iSCSI. We currently host about 2000 mailboxes which is about
>7 TB data overall. Our configuration is
>
>- Proxy Node (with tgt daemon) x 2
>- Ceph Monitor x 3 (virtual machines)
>- Ceph OSD x 50 (SATA 7200rpm 2T), Replica = 2, Journal on OSD (I know it
>is
>bad, but ...)
>
>We tested RBD using fio and got a randwrite around 1500 iops. On the
>living system, I saw the highest op/s around 3.1k.
>
>I've benchmarked "tgt with librdb" vs "tgt with kernel rbd" using my
>virtual machines. It seems that "tgt with librdb" doesn't perform
>well. It has only 1/5 iops of kernel rbd.
>
>We are new to Ceph and still finding ways to improve the performance. I
>am really looking forward to your benchmark.
>
>On Sun 16 Mar 2014 12:40:53 AM CST, Karol Kozubal wrote:
>
> > Hi Wido,
>
> > I will have some new hardware for running tests in the next two weeks
>or
> > so and will report my findings once I get a chance to run some tests. I
> > will disable writeback on the target side as I will be attempting to
> > configure an ssd caching pool of 24 ssd's with writeback for the main
>pool
> > with 360 disks with a 5 osd spinners to 1 ssd journal ratio. I will be
> > running everything through 10Gig SFP+ Ethernet interfaces with a
>dedicated
> > cluster network interface, dedicated public ceph interface and a
>separate
> > iscsi network also with 10 gig interfaces for the target machines.
>
> > I am ideally looking for a 20,000 to 60,000 IOPS from this system if I
>can
> > get the caching pool configuration right. The application has a 30ms
>max
> > latency requirement for the storage.
>
> > In my current tests I have only spinners with SAS 10K disks, 4.2ms
>write
> > latency on the disks with separate journaling on SAS 15K disks with a
> > 3.3ms write latency. With 20 OSDs and 4 Journals I am only concerned
>with
> > the overall operation apply latency that I have been seeing (1-6ms
>idle is
> > normal, but up to 60-170ms for a moderate workload using rbd
>bench-write)
> > however I am on a network where I am bound to 1500 mtu and I will get
>to
> > test jumbo frames with the next setup in addition to the ssd¹s. I
>suspect
> > the overall performance will be good in the new test setup and I am
> > curious to see what my tests will yield.
>
> > Thanks for the response!
>
> > Karol
>
>
>
> > On 2014-03-15, 12:18 PM, "Wido den Hollander"  wrote:
>
> > >On 03/15/2014 04:11 PM, Karol Kozubal wrote:
> > >> Hi Everyone,
> > >>
> > >> I am just wondering if any of you are running a ceph cluster with an
> > >> iSCSI target front end? I know this isn¹t available out of the box,
> > >> unfortunately in one particular use case we are looking at providing
> > >> iSCSI access and it's a necessity. I am liking the idea of having
>rbd
> > >> devices serving block level storage to the iSCSI Target servers
>while
> > >> providing a unified backed for native rbd access by openstack and
> > >> various application servers. On multiple levels this would reduce
>the
> > >> complexity of our SAN environment and move us away from expensive
> > >> proprietary solutions that don¹t scale out.
> > >>
> > >> If any of you have deployed any HA iSCSI Targets backed by rbd I
>would
> > >> really appreciate your feedback and any thoughts.
> > >>
> > >
> > >I haven't used it in production, but a couple of things which come to
> > >mind:
> > >
> > >- Use TGT so you can run it all in userspace backed by librbd
> > >- Do not use writeback caching on the targets
> > >
> > >You could use multipathing if you don't use writeback caching. Use
> > >writeback would also cause data loss/corruption in case of multiple
> > >targets.
> > >
> > >It will probably just work with TGT, but I don't know anything about
>the
> > >performance.
> > >
> > >> Karol
> > >>
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >
> > >
> > >--
> > >Wido den Hollander
> > >42on B.V.
> > >
> > >Phone: +31 (0)20 700 9902
> > >Skype: contact42on
> > >___
> > >ceph-users mailing list
> > >ceph-users@lists.ceph.com
> > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> > ___
> > ceph-users mailing list
> 

Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

2014-03-27 Thread Craig Lewis
In the interest of removing variables, I removed all snapshots on all 
pools, then restarted all ceph daemons at the same time.  This brought 
up osd.8 as well.


The cluster started recovering.  Now osd.4 and osd.13 are doing this.


Any suggestions for how I can see what the hung OSDs are doing? The logs 
don't look interesting.  Is there a higher log level I can use?



I'm trying to use strace on osd.4:
strace -tt -f -ff -o ./ceph-osd.4.strace -x /usr/bin/ceph-osd 
--cluster=ceph -i 4 -f


So far, strace is running, and the process isn't hung.  After I ran 
this, the cluster finally finished backfilling the last of the PGs (all 
on osd.4).


Since the cluster is healthy again, I killed the strace, and started 
daemon normally (start ceph-osd id=4).  Things seem fine now.  I'm going 
to let it scrub and deepscrub overnight.  I'll restart radosgw-agent 
tomorrow.












*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



On 3/27/14 10:44 , Craig Lewis wrote:




The osd.8 log shows it doing some deep scrubbing here. Perhaps that is
what caused your earlier issues with CPU usage?
When I first noticed the CPU usage, I checked iotop and iostat. Both 
said there was no disk activity, on any OSD.





At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.
regions list hung, and I killed At 14:18:15, I stopped ceph-osd id=8.
At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.  It
returned successfully.
At 14:19:10, I stopped ceph-osd id=*/4/*.


Since you've got the noout flag set, when osd.8 goes down any objects
for which osd.8 is the primary will not be readable. Since ceph reads
from primaries, and the noout flag prevents another osd from being
selected, which would happen if osd.8 were marked out, these objects
(which apparently happen to include some needed for regions list or
regionmap get) are inaccessible.

Josh



Taking osd.8 down (regardless of the noout flag) was the only way to 
things to respond.  I have not set nodown, just noout.




When I got in this morning, I had 4 more flapping OSDs: osd.4, osd.12, 
osd.13, and osd.6.  All 4 daemons were all using 100% CPU, and no disk 
I/O.


osd.1 and osd.14 are the only ones currently using disk I/O.


There are 3 PGs being deepscrubbed:
root@ceph1c:/var/log/radosgw-agent# ceph pg dump | grep deep
dumped all in format plain
pg_statobjectsmipdegrunfbytes logdisklog
statestate_stampvreported *up* *acting*last_scrub
scrub_stamplast_deep_scrub deep_scrub_stamp
11.774868200076146550603001 3001
active+clean+scrubbing+deep2014-03-27 10:20:30.598032
8381'51805148521:6520833 *[13,4]* *[13,4]*7894'5176984
2014-03-20 04:41:48.762996 7894'51769842014-03-20 04:41:48.762996
11.698858700077237371713001 3001
active+clean+scrubbing+deep2014-03-27 10:16:31.292487
8383'4833128521:618864 *[14,1]* *[14,1]*7894'479783
2014-03-20 03:53:18.024015 7894'4797832014-03-20 03:53:18.024015
11.d8874300075703659093409 3409
active+clean+scrubbing+deep2014-03-27 10:15:39.558121
8396'17534078521:2417672 *[12,6]* *[12,6]*7894'1459230
2014-03-20 02:40:22.123236 7894'14592302014-03-20 02:40:22.123236



These PGs are on the 6 OSDs mentioned.  osd.1 and osd.14 are not using 
100% CPU and are using disk IO.  osd.12, osd.6, osd.4, and osd.13 are 
using 100% CPU, and 0 kB/s of disk IO.  Here's iostat on ceph0c, which 
contains osd.1 (/dev/sdd), osd.4 (/dev/sde), and osd.6 (/dev/sdg):

root@ceph0c:/var/log/ceph# iostat -p sdd,sde,sdh 1
Linux 3.5.0-46-generic (ceph0c) 03/27/2014 _x86_64_(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  32.640.005.524.420.00   57.42

Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn
sdd 113.00   900.00 0.00 900  0
sdd1113.00   900.00 0.00 900  0
sde   0.00 0.00 0.00 0  0
sde1  0.00 0.00 0.00 0  0
sdh   0.00 0.00 0.00 0  0
sdh1  0.00 0.00 0.00 0  0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  29.900.004.412.820.00   62.87

Device:tpskB_read/skB_wrtn/s kB_read