[ceph-users] Did I permanently break it?

2013-07-29 Thread Jeff Moskow
I've had a 4 node ceph cluster working well for month. This weekend I added a 5th node to the cluster and after many hours of rebalancing I have the following warning: HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean But, my big problem is that the cluster is

[ceph-users] "rbd ls -l" hangs

2013-07-30 Thread Jeff Moskow
This is the same issue as yesterday, but I'm still searching for a solution. We have a lot of data on the cluster that we need and can't get to it reasonably (It took over 12 hours to export a 2GB image). The only thing that status reports as wrong is: health HEALTH_WARN 1 pgs incomplete;

Re: [ceph-users] "rbd ls -l" hangs

2013-07-30 Thread Jeff Moskow
Thanks! I tried restarting osd.11 (the primary osd for the incomplete pg) and that helped a LOT. We went from 0/1 op/s to 10-800+ op/s! We still have "HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean", but at least we can use our cluster :-) ceph pg dump_stuck inactive

Re: [ceph-users] "rbd ls -l" hangs

2013-07-30 Thread Jeff Moskow
OK - so while things are definitely better, we still are not where we were and "rbd ls -l" still hangs. Any suggestions? -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] "rbd ls -l" hangs

2013-08-01 Thread Jeff Moskow
Greg, Thanks for the hints. I looked through the logs and found OSD's with RETRY's. I marked those "out" (marked in orange) and let ceph rebalance. Then I ran the bench command. I now have many more errors than before :-(. health HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 151

[ceph-users] re-initializing a ceph cluster

2013-08-05 Thread Jeff Moskow
After more than a week of trying to restore our cluster I've given up. I'd like to reset the data, metadata and rbd pools to their initial clean states (wiping out all data). Is there an easy way to do this? I tried deleting and adding pools, but still have: health HEALTH_WARN 32 pgs

[ceph-users] pgs stuck unclean -- how to fix? (fwd)

2013-08-09 Thread Jeff Moskow
Hi, I have a 5 node ceph cluster that is running well (no problems using any of the rbd images and that's really all we use). I have replication set to 3 on all three pools (data, metadata and rbd). "ceph -s" reports: health HEALTH_WARN 3 pgs degraded;

Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)

2013-08-09 Thread Jeff Moskow
Thanks for the suggestion. I had tried stopping each OSD for 30 seconds, then restarting it, waiting 2 minutes and then doing the next one (all OSD's eventually restarted). I tried this twice. -- ___ ceph-users mailing list ceph-users@lists.ceph.co

[ceph-users] ceph rbd io tracking (rbdtop?)

2013-08-12 Thread Jeff Moskow
Hi, The activity on our ceph cluster has gone up a lot. We are using exclusively RBD storage right now. Is there a tool/technique that could be used to find out which rbd images are receiving the most activity (something like "rbdtop")? Thanks, Jeff -- ___

Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)

2013-08-12 Thread Jeff Moskow
; > On Fri, Aug 9, 2013 at 4:28 AM, Jeff Moskow wrote: > > Thanks for the suggestion. I had tried stopping each OSD for 30 seconds, > > then restarting it, waiting 2 minutes and then doing the next one (all OSD&#

Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)

2013-08-12 Thread Jeff Moskow
02:41:11PM -0700, Samuel Just wrote: > Are you using any kernel clients? Will osds 3,14,16 be coming back? > -Sam > > On Mon, Aug 12, 2013 at 2:26 PM, Jeff Moskow wrote: > > Sam, > > > > I've attached both files. > > > > Thanks! > >

Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)

2013-08-13 Thread Jeff Moskow
Sam, Thanks that did it :-) health HEALTH_OK monmap e17: 5 mons at {a=172.16.170.1:6789/0,b=172.16.170.2:6789/0,c=172.16.170.3:6789/0,d=172.16.170.4:6789/0,e=172.16.170.5:6789/0}, election epoch 9794, quorum 0,1,2,3,4 a,b,c,d,e osdmap e23445: 14 osds: 13 up, 13 in pgmap v1355

Re: [ceph-users] "rbd ls -l" hangs

2013-08-15 Thread Jeff Moskow
ncing everything is working fine :-) (ceph auth del osd.x ; ceph osd crush rm osd.x ; ceph osd rm osd.x). Jeff On Wed, Aug 14, 2013 at 01:54:16PM -0700, Gregory Farnum wrote: > On Thu, Aug 1, 2013 at 9:57 AM, Jeff Moskow wrote: > > Greg, > > > > Thanks for the hints.

[ceph-users] performance questions

2013-08-17 Thread Jeff Moskow
Hi, When we rebuilt our ceph cluster, we opted to make our rbd storage replication level 3 rather than the previously configured replication level 2. Things are MUCH slower (5 nodes, 13 osd's) than before even though most of our I/O is read. Is this to be expected? What are th

Re: [ceph-users] performance questions

2013-08-20 Thread Jeff Moskow
00, Sage Weil wrote: > On Sat, 17 Aug 2013, Jeff Moskow wrote: > > Hi, > > > > When we rebuilt our ceph cluster, we opted to make our rbd storage > > replication level 3 rather than the previously configured replication > > level 2. > > > > T

Re: [ceph-users] performance questions

2013-08-20 Thread Jeff Moskow
Hi, More information. If I look in /var/log/ceph/ceph.log, I see 7893 slow requests in the last 3 hours of which 7890 are from osd.4. Should I assume a bad drive? I SMART says the drive is healthy? Bad osd? Thanks, Jeff -- ___ ceph

Re: [ceph-users] performance questions

2013-08-20 Thread Jeff Moskow
Martin, Thanks for the confirmation about 3-replica performance. dmesg | fgrep /dev/sdb # returns no matches Jeff -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com