[ceph-users] MDS Crash on recovery (0.60)
All of my MDS daemons have begun crashing when I start them up, and they try to begin recovery. Log attached Mike -- Mike Bryant | Systems Administrator | Ocado Technology mike.bry...@ocado.com | 01707 382148 | www.ocado.com -- Notice: This email is confidential and may contain copyright material of Ocado Limited (the "Company"). Opinions and views expressed in this message may not necessarily reflect the opinions and views of the Company. If you are not the intended recipient, please notify us immediately and delete all copies of this message. Please note that it is your responsibility to scan this message for viruses. Company reg. no. 3875000. Ocado Limited Titan Court 3 Bishops Square Hatfield Business Park Hatfield Herts AL10 9NE obelisk-mds.obelisk-hotcpc9882.log Description: Binary data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failed assert when starting new OSDs in 0.60
Hi Sam, I was prepared to write in and say that the problem had gone away. I tried restarting several OSDs last night in the hopes of capturing the problem on and OSD that hadn't failed yet, but didn't have any luck. So I did indeed re-create the cluster from scratch (using mkcephfs), and what do you know -- everything worked. I got everything in a nice stable state, then decided to do a full cluster restart, just to be sure. Sure enough, one OSD failed to come up, and has the same stack trace. So I believe I have the log you want -- just from the OSD that failed, right? Question -- any feeling for what parts of the log you need? It's 688MB uncompressed (two hours!), so I'd like to be able to trim some off for you before making it available. Do you only need/want the part from after the OSD was restarted? Or perhaps the corruption happens on OSD shutdown and you need some before that? If you are fine with that large of a file, I can just make that available too. Let me know. - Travis On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden wrote: > Hi Sam, > > No problem, I'll leave that debugging turned up high, and do a mkcephfs > from scratch and see what happens. Not sure if it will happen again or > not. =) > > Thanks again. > > - Travis > > > On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just wrote: > >> Hmm, I need logging from when the corruption happened. If this is >> reproducible, can you enable that logging on a clean osd (or better, a >> clean cluster) until the assert occurs? >> -Sam >> >> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden wrote: >> > Also, I can note that it does not take a full cluster restart to trigger >> > this. If I just restart an OSD that was up/in previously, the same >> error >> > can happen (though not every time). So restarting OSD's for me is a bit >> > like Russian roullette. =) Even though restarting an OSD may not also >> > result in the error, it seems that once it happens that OSD is gone for >> > good. No amount of restart has brought any of the dead ones back. >> > >> > I'd really like to get to the bottom of it. Let me know if I can do >> > anything to help. >> > >> > I may also have to try completely wiping/rebuilding to see if I can make >> > this thing usable. >> > >> > >> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden >> wrote: >> >> >> >> Hi Sam, >> >> >> >> Thanks for being willing to take a look. >> >> >> >> I applied the debug settings on one host that 3 out of 3 OSDs with this >> >> problem. Then tried to start them up. Here are the resulting logs: >> >> >> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz >> >> >> >> - Travis >> >> >> >> >> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just >> wrote: >> >>> >> >>> You appear to be missing pg metadata for some reason. If you can >> >>> reproduce it with >> >>> debug osd = 20 >> >>> debug filestore = 20 >> >>> debug ms = 1 >> >>> on all of the OSDs, I should be able to track it down. >> >>> >> >>> I created a bug: #4855. >> >>> >> >>> Thanks! >> >>> -Sam >> >>> >> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden >> wrote: >> >>> > Thanks Greg. >> >>> > >> >>> > I quit playing with it because every time I restarted the cluster >> >>> > (service >> >>> > ceph -a restart), I lost more OSDs.. First time it was 1, 2nd 10, >> 3rd >> >>> > time >> >>> > 13... All 13 down OSDs all show the same stacktrace. >> >>> > >> >>> > - Travis >> >>> > >> >>> > >> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum >> >>> > wrote: >> >>> >> >> >>> >> This sounds vaguely familiar to me, and I see >> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't >> >>> >> reproduce" — I think maybe this is fixed in "next" and "master", >> but >> >>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam. >> >>> >> -Greg >> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >>> >> >> >>> >> >> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden >> >>> >> wrote: >> >>> >> > Hey folks, >> >>> >> > >> >>> >> > I'm helping put together a new test/experimental cluster, and hit >> >>> >> > this >> >>> >> > today >> >>> >> > when bringing the cluster up for the first time (using mkcephfs). >> >>> >> > >> >>> >> > After doing the normal "service ceph -a start", I noticed one OSD >> >>> >> > was >> >>> >> > down, >> >>> >> > and a lot of PGs were stuck creating. I tried restarting the >> down >> >>> >> > OSD, >> >>> >> > but >> >>> >> > it would come up. It always had this error: >> >>> >> > >> >>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000 2 osd.1 0 boot >> >>> >> > 0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In >> >>> >> > function >> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, >> hobject_t&, >> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27 >> 18:11:56.399089 >> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1) >> >>> >> > >> >>> >> > ceph version 0.60-401-g17a3859 >> >>> >> >
Re: [ceph-users] Failed assert when starting new OSDs in 0.60
Interestingly, the down OSD does not get marked out after 5 minutes. Probably that is already fixed by http://tracker.ceph.com/issues/4822. On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden wrote: > Hi Sam, > > I was prepared to write in and say that the problem had gone away. I > tried restarting several OSDs last night in the hopes of capturing the > problem on and OSD that hadn't failed yet, but didn't have any luck. So I > did indeed re-create the cluster from scratch (using mkcephfs), and what do > you know -- everything worked. I got everything in a nice stable state, > then decided to do a full cluster restart, just to be sure. Sure enough, > one OSD failed to come up, and has the same stack trace. So I believe I > have the log you want -- just from the OSD that failed, right? > > Question -- any feeling for what parts of the log you need? It's 688MB > uncompressed (two hours!), so I'd like to be able to trim some off for you > before making it available. Do you only need/want the part from after the > OSD was restarted? Or perhaps the corruption happens on OSD shutdown and > you need some before that? If you are fine with that large of a file, I > can just make that available too. Let me know. > > - Travis > > > On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden wrote: > >> Hi Sam, >> >> No problem, I'll leave that debugging turned up high, and do a mkcephfs >> from scratch and see what happens. Not sure if it will happen again or >> not. =) >> >> Thanks again. >> >> - Travis >> >> >> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just wrote: >> >>> Hmm, I need logging from when the corruption happened. If this is >>> reproducible, can you enable that logging on a clean osd (or better, a >>> clean cluster) until the assert occurs? >>> -Sam >>> >>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden >>> wrote: >>> > Also, I can note that it does not take a full cluster restart to >>> trigger >>> > this. If I just restart an OSD that was up/in previously, the same >>> error >>> > can happen (though not every time). So restarting OSD's for me is a >>> bit >>> > like Russian roullette. =) Even though restarting an OSD may not also >>> > result in the error, it seems that once it happens that OSD is gone for >>> > good. No amount of restart has brought any of the dead ones back. >>> > >>> > I'd really like to get to the bottom of it. Let me know if I can do >>> > anything to help. >>> > >>> > I may also have to try completely wiping/rebuilding to see if I can >>> make >>> > this thing usable. >>> > >>> > >>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden >>> wrote: >>> >> >>> >> Hi Sam, >>> >> >>> >> Thanks for being willing to take a look. >>> >> >>> >> I applied the debug settings on one host that 3 out of 3 OSDs with >>> this >>> >> problem. Then tried to start them up. Here are the resulting logs: >>> >> >>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz >>> >> >>> >> - Travis >>> >> >>> >> >>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just >>> wrote: >>> >>> >>> >>> You appear to be missing pg metadata for some reason. If you can >>> >>> reproduce it with >>> >>> debug osd = 20 >>> >>> debug filestore = 20 >>> >>> debug ms = 1 >>> >>> on all of the OSDs, I should be able to track it down. >>> >>> >>> >>> I created a bug: #4855. >>> >>> >>> >>> Thanks! >>> >>> -Sam >>> >>> >>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden >>> wrote: >>> >>> > Thanks Greg. >>> >>> > >>> >>> > I quit playing with it because every time I restarted the cluster >>> >>> > (service >>> >>> > ceph -a restart), I lost more OSDs.. First time it was 1, 2nd 10, >>> 3rd >>> >>> > time >>> >>> > 13... All 13 down OSDs all show the same stacktrace. >>> >>> > >>> >>> > - Travis >>> >>> > >>> >>> > >>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum >> > >>> >>> > wrote: >>> >>> >> >>> >>> >> This sounds vaguely familiar to me, and I see >>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't >>> >>> >> reproduce" — I think maybe this is fixed in "next" and "master", >>> but >>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam. >>> >>> >> -Greg >>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >> >>> >>> >> >>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden >> > >>> >>> >> wrote: >>> >>> >> > Hey folks, >>> >>> >> > >>> >>> >> > I'm helping put together a new test/experimental cluster, and >>> hit >>> >>> >> > this >>> >>> >> > today >>> >>> >> > when bringing the cluster up for the first time (using >>> mkcephfs). >>> >>> >> > >>> >>> >> > After doing the normal "service ceph -a start", I noticed one >>> OSD >>> >>> >> > was >>> >>> >> > down, >>> >>> >> > and a lot of PGs were stuck creating. I tried restarting the >>> down >>> >>> >> > OSD, >>> >>> >> > but >>> >>> >> > it would come up. It always had this error: >>> >>> >> > >>> >>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000 2 osd.1 0 boot >>> >
Re: [ceph-users] MDS Crash on recovery (0.60)
On Tue, Apr 30, 2013 at 03:10:00PM +0100, Mike Bryant wrote: > All of my MDS daemons have begun crashing when I start them up, and > they try to begin recovery. Hi, It seems to be the same bug as #4644 http://tracker.ceph.com/issues/4644 -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failed assert when starting new OSDs in 0.60
What version of leveldb is installed? Ubuntu/version? -Sam On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden wrote: > Interestingly, the down OSD does not get marked out after 5 minutes. > Probably that is already fixed by http://tracker.ceph.com/issues/4822. > > > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden wrote: >> >> Hi Sam, >> >> I was prepared to write in and say that the problem had gone away. I >> tried restarting several OSDs last night in the hopes of capturing the >> problem on and OSD that hadn't failed yet, but didn't have any luck. So I >> did indeed re-create the cluster from scratch (using mkcephfs), and what do >> you know -- everything worked. I got everything in a nice stable state, >> then decided to do a full cluster restart, just to be sure. Sure enough, >> one OSD failed to come up, and has the same stack trace. So I believe I >> have the log you want -- just from the OSD that failed, right? >> >> Question -- any feeling for what parts of the log you need? It's 688MB >> uncompressed (two hours!), so I'd like to be able to trim some off for you >> before making it available. Do you only need/want the part from after the >> OSD was restarted? Or perhaps the corruption happens on OSD shutdown and >> you need some before that? If you are fine with that large of a file, I can >> just make that available too. Let me know. >> >> - Travis >> >> >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden wrote: >>> >>> Hi Sam, >>> >>> No problem, I'll leave that debugging turned up high, and do a mkcephfs >>> from scratch and see what happens. Not sure if it will happen again or not. >>> =) >>> >>> Thanks again. >>> >>> - Travis >>> >>> >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just >>> wrote: Hmm, I need logging from when the corruption happened. If this is reproducible, can you enable that logging on a clean osd (or better, a clean cluster) until the assert occurs? -Sam On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden wrote: > Also, I can note that it does not take a full cluster restart to > trigger > this. If I just restart an OSD that was up/in previously, the same > error > can happen (though not every time). So restarting OSD's for me is a > bit > like Russian roullette. =) Even though restarting an OSD may not > also > result in the error, it seems that once it happens that OSD is gone > for > good. No amount of restart has brought any of the dead ones back. > > I'd really like to get to the bottom of it. Let me know if I can do > anything to help. > > I may also have to try completely wiping/rebuilding to see if I can > make > this thing usable. > > > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden > wrote: >> >> Hi Sam, >> >> Thanks for being willing to take a look. >> >> I applied the debug settings on one host that 3 out of 3 OSDs with >> this >> problem. Then tried to start them up. Here are the resulting logs: >> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz >> >> - Travis >> >> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just >> wrote: >>> >>> You appear to be missing pg metadata for some reason. If you can >>> reproduce it with >>> debug osd = 20 >>> debug filestore = 20 >>> debug ms = 1 >>> on all of the OSDs, I should be able to track it down. >>> >>> I created a bug: #4855. >>> >>> Thanks! >>> -Sam >>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden >>> wrote: >>> > Thanks Greg. >>> > >>> > I quit playing with it because every time I restarted the cluster >>> > (service >>> > ceph -a restart), I lost more OSDs.. First time it was 1, 2nd 10, >>> > 3rd >>> > time >>> > 13... All 13 down OSDs all show the same stacktrace. >>> > >>> > - Travis >>> > >>> > >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum >>> > >>> > wrote: >>> >> >>> >> This sounds vaguely familiar to me, and I see >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't >>> >> reproduce" — I think maybe this is fixed in "next" and "master", >>> >> but >>> >> I'm not sure. For more than that I'd have to defer to Sage or >>> >> Sam. >>> >> -Greg >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >> >>> >> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden >>> >> >>> >> wrote: >>> >> > Hey folks, >>> >> > >>> >> > I'm helping put together a new test/experimental cluster, and >>> >> > hit >>> >> > this >>> >> > today >>> >> > when bringing the cluster up for the first time (using >>> >> > mkcephfs). >>> >> > >>> >> > After doing the norm
Re: [ceph-users] Failed assert when starting new OSDs in 0.60
On the OSD node: root@cepha0:~# lsb_release -a No LSB modules are available. Distributor ID:Ubuntu Description:Ubuntu 12.10 Release:12.10 Codename:quantal root@cepha0:~# dpkg -l "*leveldb*" Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==---== ii libleveldb1:armhf 0+20120530.gitdd0d562-2 armhffast key-value storage library root@cepha0:~# uname -a Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013 armv7l armv7l armv7l GNU/Linux On the MON node: # lsb_release -a No LSB modules are available. Distributor ID:Ubuntu Description:Ubuntu 12.10 Release:12.10 Codename:quantal # uname -a Linux 3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux # dpkg -l "*leveldb*" Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==---== un leveldb-doc (no description available) ii libleveldb-dev:amd64 0+20120530.gitdd0d562-2 amd64fast key-value storage library (development files) ii libleveldb1:amd64 0+20120530.gitdd0d562-2 amd64fast key-value storage library On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just wrote: > What version of leveldb is installed? Ubuntu/version? > -Sam > > On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden wrote: > > Interestingly, the down OSD does not get marked out after 5 minutes. > > Probably that is already fixed by http://tracker.ceph.com/issues/4822. > > > > > > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden > wrote: > >> > >> Hi Sam, > >> > >> I was prepared to write in and say that the problem had gone away. I > >> tried restarting several OSDs last night in the hopes of capturing the > >> problem on and OSD that hadn't failed yet, but didn't have any luck. > So I > >> did indeed re-create the cluster from scratch (using mkcephfs), and > what do > >> you know -- everything worked. I got everything in a nice stable state, > >> then decided to do a full cluster restart, just to be sure. Sure > enough, > >> one OSD failed to come up, and has the same stack trace. So I believe I > >> have the log you want -- just from the OSD that failed, right? > >> > >> Question -- any feeling for what parts of the log you need? It's 688MB > >> uncompressed (two hours!), so I'd like to be able to trim some off for > you > >> before making it available. Do you only need/want the part from after > the > >> OSD was restarted? Or perhaps the corruption happens on OSD shutdown > and > >> you need some before that? If you are fine with that large of a file, > I can > >> just make that available too. Let me know. > >> > >> - Travis > >> > >> > >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden > wrote: > >>> > >>> Hi Sam, > >>> > >>> No problem, I'll leave that debugging turned up high, and do a mkcephfs > >>> from scratch and see what happens. Not sure if it will happen again > or not. > >>> =) > >>> > >>> Thanks again. > >>> > >>> - Travis > >>> > >>> > >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just > >>> wrote: > > Hmm, I need logging from when the corruption happened. If this is > reproducible, can you enable that logging on a clean osd (or better, a > clean cluster) until the assert occurs? > -Sam > > On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden > wrote: > > Also, I can note that it does not take a full cluster restart to > > trigger > > this. If I just restart an OSD that was up/in previously, the same > > error > > can happen (though not every time). So restarting OSD's for me is a > > bit > > like Russian roullette. =) Even though restarting an OSD may not > > also > > result in the error, it seems that once it happens that OSD is gone > > for > > good. No amount of restart has brought any of the dead ones back. > > > > I'd really like to get to the bottom of it. Let me know if I can do > > anything to help. > > > > I may also have to try completely wiping/rebuilding to see if I can > > make > > this thing usable. > > > > > >>>
Re: [ceph-users] MDS Crash on recovery (0.60)
Ah, looks like it was. I've got a gitbuilder build of the mds running and it seems to be working. Thanks! Mike On 30 April 2013 16:56, Kevin Decherf wrote: > On Tue, Apr 30, 2013 at 03:10:00PM +0100, Mike Bryant wrote: >> All of my MDS daemons have begun crashing when I start them up, and >> they try to begin recovery. > > Hi, > > It seems to be the same bug as #4644 > http://tracker.ceph.com/issues/4644 > > -- > Kevin Decherf - @Kdecherf > GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F > http://kdecherf.com -- Mike Bryant | Systems Administrator | Ocado Technology mike.bry...@ocado.com | 01707 382148 | www.ocado.com -- Notice: This email is confidential and may contain copyright material of Ocado Limited (the "Company"). Opinions and views expressed in this message may not necessarily reflect the opinions and views of the Company. If you are not the intended recipient, please notify us immediately and delete all copies of this message. Please note that it is your responsibility to scan this message for viruses. Company reg. no. 3875000. Ocado Limited Titan Court 3 Bishops Square Hatfield Business Park Hatfield Herts AL10 9NE ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Initial Phase of Deploying Openstack
Hopefully I'm performing the correct task to ask a quick, facepalm-inducing question: After attending the Openstack Summit in Portland, my company is planning to implement our own private cloud and are beginning to develop ideas regarding its architecture. Skipping extraneous information: I'm curious to know if it's possible to deploy Ceph (we like the idea of combining block and object-based storage instead of using 'the other guys' separately) on a SINGLE storage server, and if so-- how would you recommend it be done? DAS? NAS? Obviously this is not ideal for failover or redundancy, but for our initial configuration, we will likely be going this route. Thank you in advance for your time-- we greatly appreciate your efforts! Regards, Christopher Coulson --- Systems Administrator CPI Group, Inc. 3719 Corporex Park Drive, Suite #50 Tampa, FL 33619 Phone: 813.254.6112 (ext. 635) Fax: 813.514.0637 Email: chr...@thecpigroup.com --- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Initial Phase of Deploying Openstack
On Tue, Apr 30, 2013 at 12:35 PM, Chris Coulson wrote: > Hopefully I'm performing the correct task to ask a quick, facepalm-inducing > question: > > > After attending the Openstack Summit in Portland, my company is planning to > implement our own private cloud and are beginning to develop ideas regarding > its architecture. Skipping extraneous information: I'm curious to know if > it's possible to deploy Ceph (we like the idea of combining block and > object-based storage instead of using 'the other guys' separately) on a > SINGLE storage server, and if so-- how would you recommend it be done? DAS? > NAS? Obviously this is not ideal for failover or redundancy, but for our > initial configuration, we will likely be going this route. I'm not quite sure what you're asking about here. It's perfectly possible to deploy Ceph on a single node; just run an OSD daemon per drive, and a monitor daemon on a drive. You'd have to connect to it through the interfaces you're interested in. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Initial Phase of Deploying Openstack
On Apr 30, 2013, at 1:35 PM, Chris Coulson wrote: > After attending the Openstack Summit in Portland, my company is planning to > implement our own private cloud and are beginning to develop ideas regarding > its architecture. Skipping extraneous information: I'm curious to know if > it's possible to deploy Ceph (we like the idea of combining block and > object-based storage instead of using 'the other guys' separately) on a > SINGLE storage server, and if so-- how would you recommend it be done? DAS? > NAS? Obviously this is not ideal for failover or redundancy, but for our > initial configuration, we will likely be going this route. You can certainly use a single Ceph cluster for both block and object storage. It's possible but not recommended to run a Ceph cluster on a single server. You could maybe go that route for a proof-of-concept. If you are serious about the project, plan to start with three servers. For block storage you'll want to use RBD. Qemu supports this nicely using librbd so you don't need the kernel RBD support. For object storage you can either use RADOS directly or use the RADOS gateway. The latter is compatible with both S3 and Swift API's, which should make integrating with Openstack straightforward. See also http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/ and other posts on Sebastien's blog. JN ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Initial Phase of Deploying Openstack
[ Please keep discussions on the list — thanks! :) ] On Tue, Apr 30, 2013 at 1:09 PM, Chris Coulson wrote: > First, thank you for your reply. I guess I wasn't specific enough-- my > question really should have been: when planning to use Ceph, should there be > any specific hardware considerations for a consolidated deployment on a > single node? We're interested in taking advantage of both the object-based > and block storage offered by Ceph, and I'm not sure if we could simply pick > up a basic DAS/NAS storage server, install Ceph, and be good to go, or if > we're missing something. Ah, yeah. Just pick a server that can hold a bunch of drives. We've seen that some of them have issues with eg oversubscribed SAS/SATA expanders so you'll want to check their basic disk capability, and you want enough compute power to handle each drive (we recommend 1GHz of CPU and 1GB of RAM per OSD/disk), but within those constraints you can go wild. Like John said, you probably want more than one server for a production deployment (just in terms of reliability), but in terms of the software you can configure everything to work that way. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com > > > > Regards, > > Christopher Coulson > --- > Systems Administrator > CPI Group, Inc. > 3719 Corporex Park Drive, Suite #50 > Tampa, FL 33619 > Phone: 813.254.6112 (ext. 635) > Fax: 813.514.0637 > Email: chr...@thecpigroup.com > --- > > > On Tue, Apr 30, 2013 at 4:03 PM, Gregory Farnum wrote: >> >> On Tue, Apr 30, 2013 at 12:35 PM, Chris Coulson >> wrote: >> > Hopefully I'm performing the correct task to ask a quick, >> > facepalm-inducing >> > question: >> > >> > >> > After attending the Openstack Summit in Portland, my company is planning >> > to >> > implement our own private cloud and are beginning to develop ideas >> > regarding >> > its architecture. Skipping extraneous information: I'm curious to know >> > if >> > it's possible to deploy Ceph (we like the idea of combining block and >> > object-based storage instead of using 'the other guys' separately) on a >> > SINGLE storage server, and if so-- how would you recommend it be done? >> > DAS? >> > NAS? Obviously this is not ideal for failover or redundancy, but for our >> > initial configuration, we will likely be going this route. >> >> I'm not quite sure what you're asking about here. It's perfectly >> possible to deploy Ceph on a single node; just run an OSD daemon per >> drive, and a monitor daemon on a drive. You'd have to connect to it >> through the interfaces you're interested in. >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Initial Phase of Deploying Openstack
You guys are great. Thank you very much for the assistance! Regards, Christopher Coulson --- Systems Administrator CPI Group, Inc. 3719 Corporex Park Drive, Suite #50 Tampa, FL 33619 Phone: 813.254.6112 (ext. 635) Fax: 813.514.0637 Email: chr...@thecpigroup.com --- On Tue, Apr 30, 2013 at 4:15 PM, Gregory Farnum wrote: > [ Please keep discussions on the list — thanks! :) ] > > On Tue, Apr 30, 2013 at 1:09 PM, Chris Coulson > wrote: > > First, thank you for your reply. I guess I wasn't specific enough-- my > > question really should have been: when planning to use Ceph, should > there be > > any specific hardware considerations for a consolidated deployment on a > > single node? We're interested in taking advantage of both the > object-based > > and block storage offered by Ceph, and I'm not sure if we could simply > pick > > up a basic DAS/NAS storage server, install Ceph, and be good to go, or if > > we're missing something. > > Ah, yeah. Just pick a server that can hold a bunch of drives. We've > seen that some of them have issues with eg oversubscribed SAS/SATA > expanders so you'll want to check their basic disk capability, and you > want enough compute power to handle each drive (we recommend 1GHz of > CPU and 1GB of RAM per OSD/disk), but within those constraints you can > go wild. > Like John said, you probably want more than one server for a > production deployment (just in terms of reliability), but in terms of > the software you can configure everything to work that way. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > > > > > > > Regards, > > > > Christopher Coulson > > --- > > Systems Administrator > > CPI Group, Inc. > > 3719 Corporex Park Drive, Suite #50 > > Tampa, FL 33619 > > Phone: 813.254.6112 (ext. 635) > > Fax: 813.514.0637 > > Email: chr...@thecpigroup.com > > --- > > > > > > On Tue, Apr 30, 2013 at 4:03 PM, Gregory Farnum > wrote: > >> > >> On Tue, Apr 30, 2013 at 12:35 PM, Chris Coulson > > >> wrote: > >> > Hopefully I'm performing the correct task to ask a quick, > >> > facepalm-inducing > >> > question: > >> > > >> > > >> > After attending the Openstack Summit in Portland, my company is > planning > >> > to > >> > implement our own private cloud and are beginning to develop ideas > >> > regarding > >> > its architecture. Skipping extraneous information: I'm curious to know > >> > if > >> > it's possible to deploy Ceph (we like the idea of combining block and > >> > object-based storage instead of using 'the other guys' separately) on > a > >> > SINGLE storage server, and if so-- how would you recommend it be done? > >> > DAS? > >> > NAS? Obviously this is not ideal for failover or redundancy, but for > our > >> > initial configuration, we will likely be going this route. > >> > >> I'm not quite sure what you're asking about here. It's perfectly > >> possible to deploy Ceph on a single node; just run an OSD daemon per > >> drive, and a monitor daemon on a drive. You'd have to connect to it > >> through the interfaces you're interested in. > >> -Greg > >> Software Engineer #42 @ http://inktank.com | http://ceph.com > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failed assert when starting new OSDs in 0.60
I'm getting the same issue with one of my OSD's. Calculating dependencies... done! [ebuild R ~] app-arch/snappy-1.1.0 USE="-static-libs" 0 kB [ebuild R ~] dev-libs/leveldb-1.9.0-r5 USE="snappy -static-libs" 0 kB [ebuild R ~] sys-cluster/ceph-0.60-r1 USE="-debug -fuse -gtk -libatomic -radosgw -static-libs -tcmalloc" 0 kB below is my log https://docs.google.com/file/d/0BwQnRodV8Actd2NQT25FSnA2cjg/edit?usp=sharing thanks mr.npp On Tue, Apr 30, 2013 at 9:17 AM, Travis Rhoden wrote: > On the OSD node: > > root@cepha0:~# lsb_release -a > No LSB modules are available. > Distributor ID:Ubuntu > Description:Ubuntu 12.10 > Release:12.10 > Codename:quantal > root@cepha0:~# dpkg -l "*leveldb*" > Desired=Unknown/Install/Remove/Purge/Hold > | > Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend > |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) > ||/ Name Version > Architecture Description > > +++-==---== > ii libleveldb1:armhf 0+20120530.gitdd0d562-2 > armhffast key-value storage library > root@cepha0:~# uname -a > Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013 > armv7l armv7l armv7l GNU/Linux > > > On the MON node: > # lsb_release -a > No LSB modules are available. > Distributor ID:Ubuntu > Description:Ubuntu 12.10 > Release:12.10 > Codename:quantal > # uname -a > Linux 3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64 > x86_64 x86_64 GNU/Linux > # dpkg -l "*leveldb*" > Desired=Unknown/Install/Remove/Purge/Hold > | > Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend > |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) > ||/ Name Version > Architecture Description > > +++-==---== > un leveldb-doc > (no description available) > ii libleveldb-dev:amd64 0+20120530.gitdd0d562-2 > amd64fast key-value storage library (development files) > ii libleveldb1:amd64 0+20120530.gitdd0d562-2 > amd64fast key-value storage library > > > On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just wrote: > >> What version of leveldb is installed? Ubuntu/version? >> -Sam >> >> On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden wrote: >> > Interestingly, the down OSD does not get marked out after 5 minutes. >> > Probably that is already fixed by http://tracker.ceph.com/issues/4822. >> > >> > >> > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden >> wrote: >> >> >> >> Hi Sam, >> >> >> >> I was prepared to write in and say that the problem had gone away. I >> >> tried restarting several OSDs last night in the hopes of capturing the >> >> problem on and OSD that hadn't failed yet, but didn't have any luck. >> So I >> >> did indeed re-create the cluster from scratch (using mkcephfs), and >> what do >> >> you know -- everything worked. I got everything in a nice stable >> state, >> >> then decided to do a full cluster restart, just to be sure. Sure >> enough, >> >> one OSD failed to come up, and has the same stack trace. So I believe >> I >> >> have the log you want -- just from the OSD that failed, right? >> >> >> >> Question -- any feeling for what parts of the log you need? It's 688MB >> >> uncompressed (two hours!), so I'd like to be able to trim some off for >> you >> >> before making it available. Do you only need/want the part from after >> the >> >> OSD was restarted? Or perhaps the corruption happens on OSD shutdown >> and >> >> you need some before that? If you are fine with that large of a file, >> I can >> >> just make that available too. Let me know. >> >> >> >> - Travis >> >> >> >> >> >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden >> wrote: >> >>> >> >>> Hi Sam, >> >>> >> >>> No problem, I'll leave that debugging turned up high, and do a >> mkcephfs >> >>> from scratch and see what happens. Not sure if it will happen again >> or not. >> >>> =) >> >>> >> >>> Thanks again. >> >>> >> >>> - Travis >> >>> >> >>> >> >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just >> >>> wrote: >> >> Hmm, I need logging from when the corruption happened. If this is >> reproducible, can you enable that logging on a clean osd (or better, >> a >> clean cluster) until the assert occurs? >> -Sam >> >> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden >> wrote: >> > Also, I can note that it does not take a full cluster restart to >> > trigger >> > this. If I