[ceph-users] MDS Crash on recovery (0.60)

2013-04-30 Thread Mike Bryant
All of my MDS daemons have begun crashing when I start them up, and
they try to begin recovery.

Log attached
Mike

--
Mike Bryant | Systems Administrator | Ocado Technology
mike.bry...@ocado.com | 01707 382148 | www.ocado.com

-- 
Notice:  This email is confidential and may contain copyright material of 
Ocado Limited (the "Company"). Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the Company.

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses.

Company reg. no. 3875000.

Ocado Limited
Titan Court
3 Bishops Square
Hatfield Business Park
Hatfield
Herts
AL10 9NE


obelisk-mds.obelisk-hotcpc9882.log
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed assert when starting new OSDs in 0.60

2013-04-30 Thread Travis Rhoden
Hi Sam,

I was prepared to write in and say that the problem had gone away.  I tried
restarting several OSDs last night in the hopes of capturing the problem on
and OSD that hadn't failed yet, but didn't have any luck.  So I did indeed
re-create the cluster from scratch (using mkcephfs), and what do you know
-- everything worked.  I got everything in a nice stable state, then
decided to do a full cluster restart, just to be sure.  Sure enough, one
OSD failed to come up, and has the same stack trace.  So I believe I have
the log you want -- just from the OSD that failed, right?

Question -- any feeling for what parts of the log you need?  It's 688MB
uncompressed (two hours!), so I'd like to be able to trim some off for you
before making it available.  Do you only need/want the part from after the
OSD was restarted?  Or perhaps the corruption happens on OSD shutdown and
you need some before that?  If you are fine with that large of a file, I
can just make that available too.  Let me know.

 - Travis


On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden  wrote:

> Hi Sam,
>
> No problem, I'll leave that debugging turned up high, and do a mkcephfs
> from scratch and see what happens.  Not sure if it will happen again or
> not.  =)
>
> Thanks again.
>
>  - Travis
>
>
> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just  wrote:
>
>> Hmm, I need logging from when the corruption happened.  If this is
>> reproducible, can you enable that logging on a clean osd (or better, a
>> clean cluster) until the assert occurs?
>> -Sam
>>
>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden  wrote:
>> > Also, I can note that it does not take a full cluster restart to trigger
>> > this.  If I just restart an OSD that was up/in previously, the same
>> error
>> > can happen (though not every time).  So restarting OSD's for me is a bit
>> > like Russian roullette.  =)  Even though restarting an OSD may not also
>> > result in the error, it seems that once it happens that OSD is gone for
>> > good.  No amount of restart has brought any of the dead ones back.
>> >
>> > I'd really like to get to the bottom of it.  Let me know if I can do
>> > anything to help.
>> >
>> > I may also have to try completely wiping/rebuilding to see if I can make
>> > this thing usable.
>> >
>> >
>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden 
>> wrote:
>> >>
>> >> Hi Sam,
>> >>
>> >> Thanks for being willing to take a look.
>> >>
>> >> I applied the debug settings on one host that 3 out of 3 OSDs with this
>> >> problem.  Then tried to start them up.  Here are the resulting logs:
>> >>
>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
>> >>
>> >>  - Travis
>> >>
>> >>
>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just 
>> wrote:
>> >>>
>> >>> You appear to be missing pg metadata for some reason.  If you can
>> >>> reproduce it with
>> >>> debug osd = 20
>> >>> debug filestore = 20
>> >>> debug ms = 1
>> >>> on all of the OSDs, I should be able to track it down.
>> >>>
>> >>> I created a bug: #4855.
>> >>>
>> >>> Thanks!
>> >>> -Sam
>> >>>
>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden 
>> wrote:
>> >>> > Thanks Greg.
>> >>> >
>> >>> > I quit playing with it because every time I restarted the cluster
>> >>> > (service
>> >>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd 10,
>> 3rd
>> >>> > time
>> >>> > 13...  All 13 down OSDs all show the same stacktrace.
>> >>> >
>> >>> >  - Travis
>> >>> >
>> >>> >
>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum 
>> >>> > wrote:
>> >>> >>
>> >>> >> This sounds vaguely familiar to me, and I see
>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't
>> >>> >> reproduce" — I think maybe this is fixed in "next" and "master",
>> but
>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam.
>> >>> >> -Greg
>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >>> >>
>> >>> >>
>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden 
>> >>> >> wrote:
>> >>> >> > Hey folks,
>> >>> >> >
>> >>> >> > I'm helping put together a new test/experimental cluster, and hit
>> >>> >> > this
>> >>> >> > today
>> >>> >> > when bringing the cluster up for the first time (using mkcephfs).
>> >>> >> >
>> >>> >> > After doing the normal "service ceph -a start", I noticed one OSD
>> >>> >> > was
>> >>> >> > down,
>> >>> >> > and a lot of PGs were stuck creating.  I tried restarting the
>> down
>> >>> >> > OSD,
>> >>> >> > but
>> >>> >> > it would come up.  It always had this error:
>> >>> >> >
>> >>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot
>> >>> >> >  0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In
>> >>> >> > function
>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t,
>> hobject_t&,
>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27
>> 18:11:56.399089
>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)
>> >>> >> >
>> >>> >> >  ceph version 0.60-401-g17a3859
>> >>> >> > 

Re: [ceph-users] Failed assert when starting new OSDs in 0.60

2013-04-30 Thread Travis Rhoden
Interestingly, the down OSD does not get marked out after 5 minutes.
Probably that is already fixed by http://tracker.ceph.com/issues/4822.


On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden  wrote:

> Hi Sam,
>
> I was prepared to write in and say that the problem had gone away.  I
> tried restarting several OSDs last night in the hopes of capturing the
> problem on and OSD that hadn't failed yet, but didn't have any luck.  So I
> did indeed re-create the cluster from scratch (using mkcephfs), and what do
> you know -- everything worked.  I got everything in a nice stable state,
> then decided to do a full cluster restart, just to be sure.  Sure enough,
> one OSD failed to come up, and has the same stack trace.  So I believe I
> have the log you want -- just from the OSD that failed, right?
>
> Question -- any feeling for what parts of the log you need?  It's 688MB
> uncompressed (two hours!), so I'd like to be able to trim some off for you
> before making it available.  Do you only need/want the part from after the
> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown and
> you need some before that?  If you are fine with that large of a file, I
> can just make that available too.  Let me know.
>
>  - Travis
>
>
> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden  wrote:
>
>> Hi Sam,
>>
>> No problem, I'll leave that debugging turned up high, and do a mkcephfs
>> from scratch and see what happens.  Not sure if it will happen again or
>> not.  =)
>>
>> Thanks again.
>>
>>  - Travis
>>
>>
>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just wrote:
>>
>>> Hmm, I need logging from when the corruption happened.  If this is
>>> reproducible, can you enable that logging on a clean osd (or better, a
>>> clean cluster) until the assert occurs?
>>> -Sam
>>>
>>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden 
>>> wrote:
>>> > Also, I can note that it does not take a full cluster restart to
>>> trigger
>>> > this.  If I just restart an OSD that was up/in previously, the same
>>> error
>>> > can happen (though not every time).  So restarting OSD's for me is a
>>> bit
>>> > like Russian roullette.  =)  Even though restarting an OSD may not also
>>> > result in the error, it seems that once it happens that OSD is gone for
>>> > good.  No amount of restart has brought any of the dead ones back.
>>> >
>>> > I'd really like to get to the bottom of it.  Let me know if I can do
>>> > anything to help.
>>> >
>>> > I may also have to try completely wiping/rebuilding to see if I can
>>> make
>>> > this thing usable.
>>> >
>>> >
>>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden 
>>> wrote:
>>> >>
>>> >> Hi Sam,
>>> >>
>>> >> Thanks for being willing to take a look.
>>> >>
>>> >> I applied the debug settings on one host that 3 out of 3 OSDs with
>>> this
>>> >> problem.  Then tried to start them up.  Here are the resulting logs:
>>> >>
>>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
>>> >>
>>> >>  - Travis
>>> >>
>>> >>
>>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just 
>>> wrote:
>>> >>>
>>> >>> You appear to be missing pg metadata for some reason.  If you can
>>> >>> reproduce it with
>>> >>> debug osd = 20
>>> >>> debug filestore = 20
>>> >>> debug ms = 1
>>> >>> on all of the OSDs, I should be able to track it down.
>>> >>>
>>> >>> I created a bug: #4855.
>>> >>>
>>> >>> Thanks!
>>> >>> -Sam
>>> >>>
>>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden 
>>> wrote:
>>> >>> > Thanks Greg.
>>> >>> >
>>> >>> > I quit playing with it because every time I restarted the cluster
>>> >>> > (service
>>> >>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd 10,
>>> 3rd
>>> >>> > time
>>> >>> > 13...  All 13 down OSDs all show the same stacktrace.
>>> >>> >
>>> >>> >  - Travis
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum >> >
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> This sounds vaguely familiar to me, and I see
>>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't
>>> >>> >> reproduce" — I think maybe this is fixed in "next" and "master",
>>> but
>>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam.
>>> >>> >> -Greg
>>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> >>> >>
>>> >>> >>
>>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden >> >
>>> >>> >> wrote:
>>> >>> >> > Hey folks,
>>> >>> >> >
>>> >>> >> > I'm helping put together a new test/experimental cluster, and
>>> hit
>>> >>> >> > this
>>> >>> >> > today
>>> >>> >> > when bringing the cluster up for the first time (using
>>> mkcephfs).
>>> >>> >> >
>>> >>> >> > After doing the normal "service ceph -a start", I noticed one
>>> OSD
>>> >>> >> > was
>>> >>> >> > down,
>>> >>> >> > and a lot of PGs were stuck creating.  I tried restarting the
>>> down
>>> >>> >> > OSD,
>>> >>> >> > but
>>> >>> >> > it would come up.  It always had this error:
>>> >>> >> >
>>> >>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot
>>> >

Re: [ceph-users] MDS Crash on recovery (0.60)

2013-04-30 Thread Kevin Decherf
On Tue, Apr 30, 2013 at 03:10:00PM +0100, Mike Bryant wrote:
> All of my MDS daemons have begun crashing when I start them up, and
> they try to begin recovery.

Hi,

It seems to be the same bug as #4644
http://tracker.ceph.com/issues/4644

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed assert when starting new OSDs in 0.60

2013-04-30 Thread Samuel Just
What version of leveldb is installed?  Ubuntu/version?
-Sam

On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden  wrote:
> Interestingly, the down OSD does not get marked out after 5 minutes.
> Probably that is already fixed by http://tracker.ceph.com/issues/4822.
>
>
> On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden  wrote:
>>
>> Hi Sam,
>>
>> I was prepared to write in and say that the problem had gone away.  I
>> tried restarting several OSDs last night in the hopes of capturing the
>> problem on and OSD that hadn't failed yet, but didn't have any luck.  So I
>> did indeed re-create the cluster from scratch (using mkcephfs), and what do
>> you know -- everything worked.  I got everything in a nice stable state,
>> then decided to do a full cluster restart, just to be sure.  Sure enough,
>> one OSD failed to come up, and has the same stack trace.  So I believe I
>> have the log you want -- just from the OSD that failed, right?
>>
>> Question -- any feeling for what parts of the log you need?  It's 688MB
>> uncompressed (two hours!), so I'd like to be able to trim some off for you
>> before making it available.  Do you only need/want the part from after the
>> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown and
>> you need some before that?  If you are fine with that large of a file, I can
>> just make that available too.  Let me know.
>>
>>  - Travis
>>
>>
>> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden  wrote:
>>>
>>> Hi Sam,
>>>
>>> No problem, I'll leave that debugging turned up high, and do a mkcephfs
>>> from scratch and see what happens.  Not sure if it will happen again or not.
>>> =)
>>>
>>> Thanks again.
>>>
>>>  - Travis
>>>
>>>
>>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just 
>>> wrote:

 Hmm, I need logging from when the corruption happened.  If this is
 reproducible, can you enable that logging on a clean osd (or better, a
 clean cluster) until the assert occurs?
 -Sam

 On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden 
 wrote:
 > Also, I can note that it does not take a full cluster restart to
 > trigger
 > this.  If I just restart an OSD that was up/in previously, the same
 > error
 > can happen (though not every time).  So restarting OSD's for me is a
 > bit
 > like Russian roullette.  =)  Even though restarting an OSD may not
 > also
 > result in the error, it seems that once it happens that OSD is gone
 > for
 > good.  No amount of restart has brought any of the dead ones back.
 >
 > I'd really like to get to the bottom of it.  Let me know if I can do
 > anything to help.
 >
 > I may also have to try completely wiping/rebuilding to see if I can
 > make
 > this thing usable.
 >
 >
 > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden 
 > wrote:
 >>
 >> Hi Sam,
 >>
 >> Thanks for being willing to take a look.
 >>
 >> I applied the debug settings on one host that 3 out of 3 OSDs with
 >> this
 >> problem.  Then tried to start them up.  Here are the resulting logs:
 >>
 >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
 >>
 >>  - Travis
 >>
 >>
 >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just 
 >> wrote:
 >>>
 >>> You appear to be missing pg metadata for some reason.  If you can
 >>> reproduce it with
 >>> debug osd = 20
 >>> debug filestore = 20
 >>> debug ms = 1
 >>> on all of the OSDs, I should be able to track it down.
 >>>
 >>> I created a bug: #4855.
 >>>
 >>> Thanks!
 >>> -Sam
 >>>
 >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden 
 >>> wrote:
 >>> > Thanks Greg.
 >>> >
 >>> > I quit playing with it because every time I restarted the cluster
 >>> > (service
 >>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd 10,
 >>> > 3rd
 >>> > time
 >>> > 13...  All 13 down OSDs all show the same stacktrace.
 >>> >
 >>> >  - Travis
 >>> >
 >>> >
 >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum
 >>> > 
 >>> > wrote:
 >>> >>
 >>> >> This sounds vaguely familiar to me, and I see
 >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't
 >>> >> reproduce" — I think maybe this is fixed in "next" and "master",
 >>> >> but
 >>> >> I'm not sure. For more than that I'd have to defer to Sage or
 >>> >> Sam.
 >>> >> -Greg
 >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
 >>> >>
 >>> >>
 >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden
 >>> >> 
 >>> >> wrote:
 >>> >> > Hey folks,
 >>> >> >
 >>> >> > I'm helping put together a new test/experimental cluster, and
 >>> >> > hit
 >>> >> > this
 >>> >> > today
 >>> >> > when bringing the cluster up for the first time (using
 >>> >> > mkcephfs).
 >>> >> >
 >>> >> > After doing the norm

Re: [ceph-users] Failed assert when starting new OSDs in 0.60

2013-04-30 Thread Travis Rhoden
On the OSD node:

root@cepha0:~# lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 12.10
Release:12.10
Codename:quantal
root@cepha0:~# dpkg -l "*leveldb*"
Desired=Unknown/Install/Remove/Purge/Hold
|
Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name   Version
Architecture Description
+++-==---==
ii  libleveldb1:armhf  0+20120530.gitdd0d562-2
armhffast key-value storage library
root@cepha0:~# uname -a
Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013
armv7l armv7l armv7l GNU/Linux


On the MON node:
# lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 12.10
Release:12.10
Codename:quantal
# uname -a
Linux  3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64
x86_64 x86_64 GNU/Linux
# dpkg -l "*leveldb*"
Desired=Unknown/Install/Remove/Purge/Hold
|
Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name   Version
Architecture Description
+++-==---==
un  leveldb-doc
(no description available)
ii  libleveldb-dev:amd64   0+20120530.gitdd0d562-2
amd64fast key-value storage library (development files)
ii  libleveldb1:amd64  0+20120530.gitdd0d562-2
amd64fast key-value storage library


On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just  wrote:

> What version of leveldb is installed?  Ubuntu/version?
> -Sam
>
> On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden  wrote:
> > Interestingly, the down OSD does not get marked out after 5 minutes.
> > Probably that is already fixed by http://tracker.ceph.com/issues/4822.
> >
> >
> > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden 
> wrote:
> >>
> >> Hi Sam,
> >>
> >> I was prepared to write in and say that the problem had gone away.  I
> >> tried restarting several OSDs last night in the hopes of capturing the
> >> problem on and OSD that hadn't failed yet, but didn't have any luck.
>  So I
> >> did indeed re-create the cluster from scratch (using mkcephfs), and
> what do
> >> you know -- everything worked.  I got everything in a nice stable state,
> >> then decided to do a full cluster restart, just to be sure.  Sure
> enough,
> >> one OSD failed to come up, and has the same stack trace.  So I believe I
> >> have the log you want -- just from the OSD that failed, right?
> >>
> >> Question -- any feeling for what parts of the log you need?  It's 688MB
> >> uncompressed (two hours!), so I'd like to be able to trim some off for
> you
> >> before making it available.  Do you only need/want the part from after
> the
> >> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown
> and
> >> you need some before that?  If you are fine with that large of a file,
> I can
> >> just make that available too.  Let me know.
> >>
> >>  - Travis
> >>
> >>
> >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden 
> wrote:
> >>>
> >>> Hi Sam,
> >>>
> >>> No problem, I'll leave that debugging turned up high, and do a mkcephfs
> >>> from scratch and see what happens.  Not sure if it will happen again
> or not.
> >>> =)
> >>>
> >>> Thanks again.
> >>>
> >>>  - Travis
> >>>
> >>>
> >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just 
> >>> wrote:
> 
>  Hmm, I need logging from when the corruption happened.  If this is
>  reproducible, can you enable that logging on a clean osd (or better, a
>  clean cluster) until the assert occurs?
>  -Sam
> 
>  On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden 
>  wrote:
>  > Also, I can note that it does not take a full cluster restart to
>  > trigger
>  > this.  If I just restart an OSD that was up/in previously, the same
>  > error
>  > can happen (though not every time).  So restarting OSD's for me is a
>  > bit
>  > like Russian roullette.  =)  Even though restarting an OSD may not
>  > also
>  > result in the error, it seems that once it happens that OSD is gone
>  > for
>  > good.  No amount of restart has brought any of the dead ones back.
>  >
>  > I'd really like to get to the bottom of it.  Let me know if I can do
>  > anything to help.
>  >
>  > I may also have to try completely wiping/rebuilding to see if I can
>  > make
>  > this thing usable.
>  >
>  >
> >>>

Re: [ceph-users] MDS Crash on recovery (0.60)

2013-04-30 Thread Mike Bryant
Ah, looks like it was.
I've got a gitbuilder build of the mds running and it seems to be working.

Thanks!
Mike

On 30 April 2013 16:56, Kevin Decherf  wrote:
> On Tue, Apr 30, 2013 at 03:10:00PM +0100, Mike Bryant wrote:
>> All of my MDS daemons have begun crashing when I start them up, and
>> they try to begin recovery.
>
> Hi,
>
> It seems to be the same bug as #4644
> http://tracker.ceph.com/issues/4644
>
> --
> Kevin Decherf - @Kdecherf
> GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
> http://kdecherf.com



--
Mike Bryant | Systems Administrator | Ocado Technology
mike.bry...@ocado.com | 01707 382148 | www.ocado.com

-- 
Notice:  This email is confidential and may contain copyright material of 
Ocado Limited (the "Company"). Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the Company.

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses.

Company reg. no. 3875000.

Ocado Limited
Titan Court
3 Bishops Square
Hatfield Business Park
Hatfield
Herts
AL10 9NE
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Initial Phase of Deploying Openstack

2013-04-30 Thread Chris Coulson
Hopefully I'm performing the correct task to ask a quick, facepalm-inducing
question:


After attending the Openstack Summit in Portland, my company is planning to
implement our own private cloud and are beginning to develop ideas
regarding its architecture. Skipping extraneous information: I'm curious to
know if it's possible to deploy Ceph (we like the idea of combining block
and object-based storage instead of using 'the other guys' separately) on a
SINGLE storage server, and if so-- how would you recommend it be done? DAS?
NAS? Obviously this is not ideal for failover or redundancy, but for our
initial configuration, we will likely be going this route.

Thank you in advance for your time-- we greatly appreciate your efforts!


Regards,

Christopher Coulson
---
Systems Administrator
CPI Group, Inc.
3719 Corporex Park Drive, Suite #50
Tampa, FL 33619
Phone: 813.254.6112 (ext. 635)
Fax:   813.514.0637
Email: chr...@thecpigroup.com
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Initial Phase of Deploying Openstack

2013-04-30 Thread Gregory Farnum
On Tue, Apr 30, 2013 at 12:35 PM, Chris Coulson  wrote:
> Hopefully I'm performing the correct task to ask a quick, facepalm-inducing
> question:
>
>
> After attending the Openstack Summit in Portland, my company is planning to
> implement our own private cloud and are beginning to develop ideas regarding
> its architecture. Skipping extraneous information: I'm curious to know if
> it's possible to deploy Ceph (we like the idea of combining block and
> object-based storage instead of using 'the other guys' separately) on a
> SINGLE storage server, and if so-- how would you recommend it be done? DAS?
> NAS? Obviously this is not ideal for failover or redundancy, but for our
> initial configuration, we will likely be going this route.

I'm not quite sure what you're asking about here. It's perfectly
possible to deploy Ceph on a single node; just run an OSD daemon per
drive, and a monitor daemon on a drive. You'd have to connect to it
through the interfaces you're interested in.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Initial Phase of Deploying Openstack

2013-04-30 Thread John Nielsen
On Apr 30, 2013, at 1:35 PM, Chris Coulson  wrote:

> After attending the Openstack Summit in Portland, my company is planning to 
> implement our own private cloud and are beginning to develop ideas regarding 
> its architecture. Skipping extraneous information: I'm curious to know if 
> it's possible to deploy Ceph (we like the idea of combining block and 
> object-based storage instead of using 'the other guys' separately) on a 
> SINGLE storage server, and if so-- how would you recommend it be done? DAS? 
> NAS? Obviously this is not ideal for failover or redundancy, but for our 
> initial configuration, we will likely be going this route.

You can certainly use a single Ceph cluster for both block and object storage. 
It's possible but not recommended to run a Ceph cluster on a single server. You 
could maybe go that route for a proof-of-concept. If you are serious about the 
project, plan to start with three servers.

For block storage you'll want to use RBD. Qemu supports this nicely using 
librbd so you don't need the kernel RBD support.

For object storage you can either use RADOS directly or use the RADOS gateway. 
The latter is compatible with both S3 and Swift API's, which should make 
integrating with Openstack straightforward.

See also 
http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/ and 
other posts on Sebastien's blog.

JN

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Initial Phase of Deploying Openstack

2013-04-30 Thread Gregory Farnum
[ Please keep discussions on the list — thanks! :) ]

On Tue, Apr 30, 2013 at 1:09 PM, Chris Coulson  wrote:
> First, thank you for your reply. I guess I wasn't specific enough-- my
> question really should have been: when planning to use Ceph, should there be
> any specific hardware considerations for a consolidated deployment on a
> single node? We're interested in taking advantage of both the object-based
> and block storage offered by Ceph, and I'm not sure if we could simply pick
> up a basic DAS/NAS storage server, install Ceph, and be good to go, or if
> we're missing something.

Ah, yeah. Just pick a server that can hold a bunch of drives. We've
seen that some of them have issues with eg oversubscribed SAS/SATA
expanders so you'll want to check their basic disk capability, and you
want enough compute power to handle each drive (we recommend 1GHz of
CPU and 1GB of RAM per OSD/disk), but within those constraints you can
go wild.
Like John said, you probably want more than one server for a
production deployment (just in terms of reliability), but in terms of
the software you can configure everything to work that way.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
>
>
> Regards,
>
> Christopher Coulson
> ---
> Systems Administrator
> CPI Group, Inc.
> 3719 Corporex Park Drive, Suite #50
> Tampa, FL 33619
> Phone: 813.254.6112 (ext. 635)
> Fax:   813.514.0637
> Email: chr...@thecpigroup.com
> ---
>
>
> On Tue, Apr 30, 2013 at 4:03 PM, Gregory Farnum  wrote:
>>
>> On Tue, Apr 30, 2013 at 12:35 PM, Chris Coulson 
>> wrote:
>> > Hopefully I'm performing the correct task to ask a quick,
>> > facepalm-inducing
>> > question:
>> >
>> >
>> > After attending the Openstack Summit in Portland, my company is planning
>> > to
>> > implement our own private cloud and are beginning to develop ideas
>> > regarding
>> > its architecture. Skipping extraneous information: I'm curious to know
>> > if
>> > it's possible to deploy Ceph (we like the idea of combining block and
>> > object-based storage instead of using 'the other guys' separately) on a
>> > SINGLE storage server, and if so-- how would you recommend it be done?
>> > DAS?
>> > NAS? Obviously this is not ideal for failover or redundancy, but for our
>> > initial configuration, we will likely be going this route.
>>
>> I'm not quite sure what you're asking about here. It's perfectly
>> possible to deploy Ceph on a single node; just run an OSD daemon per
>> drive, and a monitor daemon on a drive. You'd have to connect to it
>> through the interfaces you're interested in.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Initial Phase of Deploying Openstack

2013-04-30 Thread Chris Coulson
You guys are great. Thank you very much for the assistance!



Regards,

Christopher Coulson
---
Systems Administrator
CPI Group, Inc.
3719 Corporex Park Drive, Suite #50
Tampa, FL 33619
Phone: 813.254.6112 (ext. 635)
Fax:   813.514.0637
Email: chr...@thecpigroup.com
---


On Tue, Apr 30, 2013 at 4:15 PM, Gregory Farnum  wrote:

> [ Please keep discussions on the list — thanks! :) ]
>
> On Tue, Apr 30, 2013 at 1:09 PM, Chris Coulson 
> wrote:
> > First, thank you for your reply. I guess I wasn't specific enough-- my
> > question really should have been: when planning to use Ceph, should
> there be
> > any specific hardware considerations for a consolidated deployment on a
> > single node? We're interested in taking advantage of both the
> object-based
> > and block storage offered by Ceph, and I'm not sure if we could simply
> pick
> > up a basic DAS/NAS storage server, install Ceph, and be good to go, or if
> > we're missing something.
>
> Ah, yeah. Just pick a server that can hold a bunch of drives. We've
> seen that some of them have issues with eg oversubscribed SAS/SATA
> expanders so you'll want to check their basic disk capability, and you
> want enough compute power to handle each drive (we recommend 1GHz of
> CPU and 1GB of RAM per OSD/disk), but within those constraints you can
> go wild.
> Like John said, you probably want more than one server for a
> production deployment (just in terms of reliability), but in terms of
> the software you can configure everything to work that way.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
> >
> >
> >
> > Regards,
> >
> > Christopher Coulson
> > ---
> > Systems Administrator
> > CPI Group, Inc.
> > 3719 Corporex Park Drive, Suite #50
> > Tampa, FL 33619
> > Phone: 813.254.6112 (ext. 635)
> > Fax:   813.514.0637
> > Email: chr...@thecpigroup.com
> > ---
> >
> >
> > On Tue, Apr 30, 2013 at 4:03 PM, Gregory Farnum 
> wrote:
> >>
> >> On Tue, Apr 30, 2013 at 12:35 PM, Chris Coulson  >
> >> wrote:
> >> > Hopefully I'm performing the correct task to ask a quick,
> >> > facepalm-inducing
> >> > question:
> >> >
> >> >
> >> > After attending the Openstack Summit in Portland, my company is
> planning
> >> > to
> >> > implement our own private cloud and are beginning to develop ideas
> >> > regarding
> >> > its architecture. Skipping extraneous information: I'm curious to know
> >> > if
> >> > it's possible to deploy Ceph (we like the idea of combining block and
> >> > object-based storage instead of using 'the other guys' separately) on
> a
> >> > SINGLE storage server, and if so-- how would you recommend it be done?
> >> > DAS?
> >> > NAS? Obviously this is not ideal for failover or redundancy, but for
> our
> >> > initial configuration, we will likely be going this route.
> >>
> >> I'm not quite sure what you're asking about here. It's perfectly
> >> possible to deploy Ceph on a single node; just run an OSD daemon per
> >> drive, and a monitor daemon on a drive. You'd have to connect to it
> >> through the interfaces you're interested in.
> >> -Greg
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed assert when starting new OSDs in 0.60

2013-04-30 Thread Mr. NPP
I'm getting the same issue with one of my OSD's.

Calculating dependencies... done!
[ebuild   R   ~] app-arch/snappy-1.1.0  USE="-static-libs" 0 kB
[ebuild   R   ~] dev-libs/leveldb-1.9.0-r5  USE="snappy -static-libs" 0 kB
[ebuild   R   ~] sys-cluster/ceph-0.60-r1  USE="-debug -fuse -gtk
-libatomic -radosgw -static-libs -tcmalloc" 0 kB

below is my log
https://docs.google.com/file/d/0BwQnRodV8Actd2NQT25FSnA2cjg/edit?usp=sharing

thanks
mr.npp


On Tue, Apr 30, 2013 at 9:17 AM, Travis Rhoden  wrote:

> On the OSD node:
>
> root@cepha0:~# lsb_release -a
> No LSB modules are available.
> Distributor ID:Ubuntu
> Description:Ubuntu 12.10
> Release:12.10
> Codename:quantal
> root@cepha0:~# dpkg -l "*leveldb*"
> Desired=Unknown/Install/Remove/Purge/Hold
> |
> Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name   Version
> Architecture Description
>
> +++-==---==
> ii  libleveldb1:armhf  0+20120530.gitdd0d562-2
> armhffast key-value storage library
> root@cepha0:~# uname -a
> Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013
> armv7l armv7l armv7l GNU/Linux
>
>
> On the MON node:
> # lsb_release -a
> No LSB modules are available.
> Distributor ID:Ubuntu
> Description:Ubuntu 12.10
> Release:12.10
> Codename:quantal
> # uname -a
> Linux  3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64
> x86_64 x86_64 GNU/Linux
> # dpkg -l "*leveldb*"
> Desired=Unknown/Install/Remove/Purge/Hold
> |
> Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name   Version
> Architecture Description
>
> +++-==---==
> un  leveldb-doc
> (no description available)
> ii  libleveldb-dev:amd64   0+20120530.gitdd0d562-2
> amd64fast key-value storage library (development files)
> ii  libleveldb1:amd64  0+20120530.gitdd0d562-2
> amd64fast key-value storage library
>
>
> On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just wrote:
>
>> What version of leveldb is installed?  Ubuntu/version?
>> -Sam
>>
>> On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden  wrote:
>> > Interestingly, the down OSD does not get marked out after 5 minutes.
>> > Probably that is already fixed by http://tracker.ceph.com/issues/4822.
>> >
>> >
>> > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden 
>> wrote:
>> >>
>> >> Hi Sam,
>> >>
>> >> I was prepared to write in and say that the problem had gone away.  I
>> >> tried restarting several OSDs last night in the hopes of capturing the
>> >> problem on and OSD that hadn't failed yet, but didn't have any luck.
>>  So I
>> >> did indeed re-create the cluster from scratch (using mkcephfs), and
>> what do
>> >> you know -- everything worked.  I got everything in a nice stable
>> state,
>> >> then decided to do a full cluster restart, just to be sure.  Sure
>> enough,
>> >> one OSD failed to come up, and has the same stack trace.  So I believe
>> I
>> >> have the log you want -- just from the OSD that failed, right?
>> >>
>> >> Question -- any feeling for what parts of the log you need?  It's 688MB
>> >> uncompressed (two hours!), so I'd like to be able to trim some off for
>> you
>> >> before making it available.  Do you only need/want the part from after
>> the
>> >> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown
>> and
>> >> you need some before that?  If you are fine with that large of a file,
>> I can
>> >> just make that available too.  Let me know.
>> >>
>> >>  - Travis
>> >>
>> >>
>> >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden 
>> wrote:
>> >>>
>> >>> Hi Sam,
>> >>>
>> >>> No problem, I'll leave that debugging turned up high, and do a
>> mkcephfs
>> >>> from scratch and see what happens.  Not sure if it will happen again
>> or not.
>> >>> =)
>> >>>
>> >>> Thanks again.
>> >>>
>> >>>  - Travis
>> >>>
>> >>>
>> >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just 
>> >>> wrote:
>> 
>>  Hmm, I need logging from when the corruption happened.  If this is
>>  reproducible, can you enable that logging on a clean osd (or better,
>> a
>>  clean cluster) until the assert occurs?
>>  -Sam
>> 
>>  On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden 
>>  wrote:
>>  > Also, I can note that it does not take a full cluster restart to
>>  > trigger
>>  > this.  If I