hello, i'm currently running 0.61, with about 44 osd's and 4 monitors, one as a spare.
with about 6 hosts. I've been running into an issue where when one ceph host would go down the entire system become unusable. today we recovered from a ssd crash crash for an osd's journal, and it was a lot of work to get it back up, i couldn't get monitors to come up and establish quorum. I was going to rebuild it manually, but the documentation for ceph is outdated to manually (dirty) remove a monitor using the monmap tool, i couldn't find the /mon-$id/monmap directory. anyway, I recovered eventually and was able to run with 4 monitors, and i updated the crushmap and it crashed the monitor that i was updating the crushmap too. it now gives me [976]: (33) Numerical argument out of domain when i try to manually start it, i've seen this assert failure before, just not sure whats causing it. below i the log from the crash. https://docs.google.com/a/nopatentpending.com/file/d/0BwQnRodV8ActNTVFUVpLVjdMSGc/edit i'm not even really sure if my configs are right, i'm still pretty new at this. below are the configs, and the last map ceph.conf https://docs.google.com/file/d/0BwQnRodV8Acta3ZfSnBrOU40MW8/edit?usp=sharing crush.map.txt https://docs.google.com/file/d/0BwQnRodV8Actbl9hY054Mm9UTXM/edit?usp=sharing if you need additional dumps from the monitor i can get it. thanks mr.npp
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com