Hi All,

I set up a new cluster today w/ 20 OSDs spanning 4 machines (journals not
stored on separate disks), and a single MON running on a separate server
(understand the single MON is not ideal for production environments).

The cluster had the default pools along w/ the ones created by radosgw.
 There was next to no user data on the cluster with the exception of a few
test files uploaded via swift client.

I ran the following on one node to increase replica size from 2 to 3:

for x in $(rados lspools); do ceph osd pool set $x size 3; done

After doing this, I noticed that 5 OSDs were down and repeatedly restarting
them using the following brings them back online momentarily but then they
go down / out again:

start ceph-osd id=X

Looking across the affected nodes, I'm seeing errors like this in the
respective osd logs:

osd/ReplicatedPG.cc: 5405: FAILED assert(ssc)

 ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
 1: (ReplicatedPG::prep_push_to_replica(ObjectContext*, hobject_t const&,
int, int, PushOp*)+0x8ea)
 [0x5fd50a]
 2: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t,
int, std::map<int, std:
:vector<PushOp, std::allocator<PushOp> >, std::less<int>,
std::allocator<std::pair<int const, std::
vector<PushOp, std::allocator<PushOp> > > > >*)+0x722) [0x5fe552]
 3: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x657)
[0x5ff487]
 4: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*,
ThreadPool::TPHandle&)+0x736) [0x61d9c6]
 5: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x1b8) [0x6863e8]
 6: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x11) [0x6c5541]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8b8df6]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x8bac00]
 9: (()+0x7e9a) [0x7f610c09fe9a]
 10: (clone()+0x6d) [0x7f610a91dccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

Have I done something foolish, or am I hitting a legitimate issue here?

On a side note, my cluster is now in the following state:

2013-09-17 20:47:13.651250 mon.0 [INF] pgmap v1536: 248 pgs: 243
active+clean, 2 active+recovery_wait, 3 active+recovering; 5497 bytes data,
866 MB used, 999 GB / 1000 GB avail; 21/255 degraded (8.235%); 7/85 unfound
(8.235%)

According to a ceph health detail, the unfound are on the .users.uid
and .rgw radosgw pools; I suppose I can remove those pools and have radosgw
recreate them?  If this is not recoverable is it advisable to just format
the cluster and start again?

Thanks in advance for the help.

Regards,
Matt
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to