Re: [ceph-users] radosgw secret_key

2015-09-01 Thread Saverio Proto
Look at this:
https://github.com/ncw/rclone/issues/47

Because this is a json dump, it is encoding the / as \/.

It was source of confusion also for me.

Best regards

Saverio




2015-08-24 16:58 GMT+02:00 Luis Periquito :
> When I create a new user using radosgw-admin most of the time the secret key
> gets escaped with a backslash, making it not work. Something like
> "secret_key": "xx\/\/".
>
> Why would the "/" need to be escaped? Why is it printing the "\/" instead of
> "/" that does work?
>
> Usually I just remove the backslash and it works fine. I've seen this on
> several different clusters.
>
> Is it just me?
>
> This may require opening a bug in the tracking tool, but just asking here
> first.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any recommendation of using EnhanceIO?

2015-09-01 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wang, Zhiqiang
> Sent: 01 September 2015 02:48
> To: Nick Fisk ; 'Samuel Just' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Nick Fisk
> > Sent: Wednesday, August 19, 2015 5:25 AM
> > To: 'Samuel Just'
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >
> > Hi Sam,
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Samuel Just
> > > Sent: 18 August 2015 21:38
> > > To: Nick Fisk 
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > > 1.  We've kicked this around a bit.  What kind of failure semantics
> > > would
> > you
> > > be comfortable with here (that is, what would be reasonable behavior
> > > if
> > the
> > > client side cache fails)?
> >
> > I would either expect to provide the cache with a redundant block
> > device (ie
> > RAID1 SSD's) or the cache to allow itself to be configured to mirror
> > across two SSD's. Of course single SSD's can be used if the user accepts
the
> risk.
> > If the cache did the mirroring then you could do fancy stuff like
> > mirror the writes, but leave the read cache blocks as single copies to
> > increase the cache capacity.
> >
> > In either case although an outage is undesirable, its only data loss
> > which would be unacceptable, which would hopefully be avoided by the
> > mirroring. As part of this, it would need to be a way to make sure a
> > "dirty" RBD can't be accessed unless the corresponding cache is also
> attached.
> >
> > I guess as it caching the RBD and not the pool or entire cluster, the
> > cache only needs to match the failure requirements of the application
its
> caching.
> > If I need to cache a RBD that is on  a single server, there is no
> > requirement to make the cache redundant across
> racks/PDU's/servers...etc.
> >
> > I hope I've answered your question?
> >
> >
> > > 2. We've got a branch which should merge soon (tomorrow probably)
> > > which actually does allow writes to be proxied, so that should
> > > alleviate some of these pain points somewhat.  I'm not sure it is
> > > clever enough to allow through writefulls for an ec base tier though
> > > (but it would be a good
> > idea!) -
> >
> > Excellent news, I shall look forward to testing in the future. I did
> > mention the proxy write for write fulls to someone who was working on
> > the proxy write code, but I'm not sure if it ever got followed up.
> 
> I think someone here is me. In the current code, for an ec base tier,
writefull
> can be proxied to the base.

Excellent news. Is this intelligent enough to determine when say a normal
write IO from a RBD is equal to the underlying object size and then turn
this normal write effectively into a write full?

> 
> >
> > > Sam
> > >
> > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk  wrote:
> > > >
> > > >
> > > >
> > > >
> > > >> -Original Message-
> > > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > >> Behalf Of Mark Nelson
> > > >> Sent: 18 August 2015 18:51
> > > >> To: Nick Fisk ; 'Jan Schermer' 
> > > >> Cc: ceph-users@lists.ceph.com
> > > >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > >>
> > > >>
> > > >>
> > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > >> > 
> > > >> 
> > > >>  Here's kind of how I see the field right now:
> > > >> 
> > > >>  1) Cache at the client level.  Likely fastest but obvious
> > > >>  issues like
> > > > above.
> > > >>  RAID1 might be an option at increased cost.  Lack of
> > > >>  barriers in some implementations scary.
> > > >> >>>
> > > >> >>> Agreed.
> > > >> >>>
> > > >> 
> > > >>  2) Cache below the OSD.  Not much recent data on this.  Not
> > > >>  likely as fast as client side cache, but likely cheaper
> > > >>  (fewer OSD nodes than client
> > > >> >> nodes?).
> > > >>  Lack of barriers in some implementations scary.
> > > >> >>>
> > > >> >>> This also has the benefit of caching the leveldb on the OSD,
> > > >> >>> so get a big
> > > >> >> performance gain from there too for small sequential writes. I
> > > >> >> looked at using Flashcache for this too but decided it was
> > > >> >> adding to much complexity and risk.
> > > >> >>>
> > > >> >>> I thought I read somewhere that RocksDB allows you to move
> > > >> >>> its WAL to
> > > >> >> SSD, is there anything in the pipeline for something like
> > > >> >> moving the filestore to use RocksDB?
> > > >> >>
> > > >> >> I believe you can already do this, though I haven't tested it.
> > > >> >> You can certainly move the monitors to rocksdb (tested) and
> > > >> >> newstore uses
> > 

Re: [ceph-users] Testing CephFS

2015-09-01 Thread Simon Hallam
Hi Greg, Zheng,

Is this fixed in a later version of the kernel client? Or would it be wise for 
us to start using the fuse client?

Cheers,

Simon

> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: 31 August 2015 13:02
> To: Yan, Zheng
> Cc: Simon Hallam; Zheng Yan; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Testing CephFS
> 
> On Mon, Aug 31, 2015 at 12:16 PM, Yan, Zheng  wrote:
> > On Mon, Aug 24, 2015 at 6:38 PM, Gregory Farnum
>  wrote:
> >> On Mon, Aug 24, 2015 at 11:35 AM, Simon  Hallam 
> wrote:
> >>> Hi Greg,
> >>>
> >>> The MDS' detect that the other one went down and started the replay.
> >>>
> >>> I did some further testing with 20 client machines. Of the 20 client
> machines, 5 hung with the following error:
> >>>
> >>> [Aug24 10:53] ceph: mds0 caps stale
> >>> [Aug24 10:54] ceph: mds0 caps stale
> >>> [Aug24 10:58] ceph: mds0 hung
> >>> [Aug24 11:03] ceph: mds0 came back
> >>> [  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state
> OPEN)
> >>> [  +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new
> mon
> >>> [Aug24 11:04] ceph: mds0 reconnect start
> >>> [  +0.084938] libceph: mon2 10.15.0.3:6789 session established
> >>> [  +0.008475] ceph: mds0 reconnect denied
> >>
> >> Oh, this might be a kernel bug, failing to ask for mdsmap updates when
> >> the connection goes away. Zheng, does that sound familiar?
> >> -Greg
> >>
> >
> > I reproduced this locally (use SIGSTOP to stop the monitor) . I think
> > the root cause is that kernel client does not implement
> > CEPH_FEATURE_MSGR_KEEPALIVE2. So the kernel client couldn't reliably
> > detect the event that network cable got unplugged. It kept waiting for
> > new events from the disconnected connection.
> 
> Yeah, the userspace client maintains an ongoing MDSMap subscription
> from the monitors in order to hear about this. It puts more load on
> the monitors but right now that's the solution we're going with: the
> monitor times out the MDS, publishes a series of new maps (pushed to
> the clients) in order to activate a standby, and the clients see that
> they need to connect to the new MDS instance.
> -Greg


Please visit our new website at www.pml.ac.uk and follow us on Twitter  
@PlymouthMarine

Winner of the Environment & Conservation category, the Charity Awards 2014.

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered 
in England & Wales, company number 4178503. Registered Charity No. 1091222. 
Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK. 

This message is private and confidential. If you have received this message in 
error, please notify the sender and remove it from your system. You are 
reminded that e-mail communications are not secure and may contain viruses; PML 
accepts no liability for any loss or damage which may be caused by viruses.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting rgw bucket list

2015-09-01 Thread Sam Wouters
Hi, I've started the bucket --check --fix on friday evening and it's
still running. 'ceph -s' shows the cluster health as OK, I don't know if
there is anything else I could check? Is there a way of finding out if
its actually doing something?

We only have this issue on the one bucket with versioning enabled, I
can't get rid of the feeling it has something todo with that. The
"underscore bug" is also still present on that bucket
(http://tracker.ceph.com/issues/12819). Not sure if thats related in any
way.
Are there any alternatives, as for example copy all the objects into a
new bucket without versioning? Simple way would be to list the objects
and copy them to a new bucket, but bucket listing is not working so...

-Sam


On 31-08-15 10:47, Gregory Farnum wrote:
> This generally shouldn't be a problem at your bucket sizes. Have you
> checked that the cluster is actually in a healthy state? The sleeping
> locks are normal but should be getting woken up; if they aren't it
> means the object access isn't working for some reason. A down PG or
> something would be the simplest explanation.
> -Greg
>
> On Fri, Aug 28, 2015 at 6:52 PM, Sam Wouters  wrote:
>> Ok, maybe I'm to impatient. It would be great if there were some verbose
>> or progress logging of the radosgw-admin tool.
>> I will start a check and let it run over the weekend.
>>
>> tnx,
>> Sam
>>
>> On 28-08-15 18:16, Sam Wouters wrote:
>>> Hi,
>>>
>>> this bucket only has 13389 objects, so the index size shouldn't be a
>>> problem. Also, on the same cluster we have an other bucket with 1200543
>>> objects (but no versioning configured), which has no issues.
>>>
>>> when we run a radosgw-admin bucket --check (--fix), nothing seems to be
>>> happening. Putting an strace on the process shows a lot of lines like these:
>>> [pid 99372] futex(0x2d730d4, FUTEX_WAIT_PRIVATE, 156619, NULL
>>> 
>>> [pid 99385] futex(0x2da9410, FUTEX_WAIT_PRIVATE, 2, NULL 
>>> [pid 99371] futex(0x2da9410, FUTEX_WAKE_PRIVATE, 1 
>>> [pid 99385] <... futex resumed> )   = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> [pid 99371] <... futex resumed> )   = 0
>>>
>>> but no errors in the ceph logs or health warnings.
>>>
>>> r,
>>> Sam
>>>
>>> On 28-08-15 17:49, Ben Hines wrote:
 How many objects in the bucket?

 RGW has problems with index size once number of objects gets into the
 90+ level. The buckets need to be recreated with 'sharded bucket
 indexes' on:

 rgw override bucket index max shards = 23

 You could also try repairing the index with:

  radosgw-admin bucket check --fix --bucket=

 -Ben

 On Fri, Aug 28, 2015 at 8:38 AM, Sam Wouters  wrote:
> Hi,
>
> we have a rgw bucket (with versioning) where PUT and GET operations for
> specific objects succeed,  but retrieving an object list fails.
> Using python-boto, after a timeout just gives us an 500 internal error;
> radosgw-admin just hangs.
> Also a radosgw-admin bucket check just seems to hang...
>
> ceph version is 0.94.3 but this also was happening with 0.94.2, we
> quietly hoped upgrading would fix but it didn't...
>
> r,
> Sam
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any recommendation of using EnhanceIO?

2015-09-01 Thread Wang, Zhiqiang
> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Tuesday, September 1, 2015 3:55 PM
> To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> 
> 
> 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Wang, Zhiqiang
> > Sent: 01 September 2015 02:48
> > To: Nick Fisk ; 'Samuel Just' 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Nick Fisk
> > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > To: 'Samuel Just'
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > > Hi Sam,
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Samuel Just
> > > > Sent: 18 August 2015 21:38
> > > > To: Nick Fisk 
> > > > Cc: ceph-users@lists.ceph.com
> > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > >
> > > > 1.  We've kicked this around a bit.  What kind of failure
> > > > semantics would
> > > you
> > > > be comfortable with here (that is, what would be reasonable
> > > > behavior if
> > > the
> > > > client side cache fails)?
> > >
> > > I would either expect to provide the cache with a redundant block
> > > device (ie
> > > RAID1 SSD's) or the cache to allow itself to be configured to mirror
> > > across two SSD's. Of course single SSD's can be used if the user
> > > accepts
> the
> > risk.
> > > If the cache did the mirroring then you could do fancy stuff like
> > > mirror the writes, but leave the read cache blocks as single copies
> > > to increase the cache capacity.
> > >
> > > In either case although an outage is undesirable, its only data loss
> > > which would be unacceptable, which would hopefully be avoided by the
> > > mirroring. As part of this, it would need to be a way to make sure a
> > > "dirty" RBD can't be accessed unless the corresponding cache is also
> > attached.
> > >
> > > I guess as it caching the RBD and not the pool or entire cluster,
> > > the cache only needs to match the failure requirements of the
> > > application
> its
> > caching.
> > > If I need to cache a RBD that is on  a single server, there is no
> > > requirement to make the cache redundant across
> > racks/PDU's/servers...etc.
> > >
> > > I hope I've answered your question?
> > >
> > >
> > > > 2. We've got a branch which should merge soon (tomorrow probably)
> > > > which actually does allow writes to be proxied, so that should
> > > > alleviate some of these pain points somewhat.  I'm not sure it is
> > > > clever enough to allow through writefulls for an ec base tier
> > > > though (but it would be a good
> > > idea!) -
> > >
> > > Excellent news, I shall look forward to testing in the future. I did
> > > mention the proxy write for write fulls to someone who was working
> > > on the proxy write code, but I'm not sure if it ever got followed up.
> >
> > I think someone here is me. In the current code, for an ec base tier,
> writefull
> > can be proxied to the base.
> 
> Excellent news. Is this intelligent enough to determine when say a normal 
> write
> IO from a RBD is equal to the underlying object size and then turn this normal
> write effectively into a write full?

Checked the code, seems we don't do this right now... Would this be much 
helpful? I think we can do this if the answer is yes.

> 
> >
> > >
> > > > Sam
> > > >
> > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk  wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >> -Original Message-
> > > > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > >> Behalf Of Mark Nelson
> > > > >> Sent: 18 August 2015 18:51
> > > > >> To: Nick Fisk ; 'Jan Schermer'
> > > > >> 
> > > > >> Cc: ceph-users@lists.ceph.com
> > > > >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > > >> > 
> > > > >> 
> > > > >>  Here's kind of how I see the field right now:
> > > > >> 
> > > > >>  1) Cache at the client level.  Likely fastest but obvious
> > > > >>  issues like
> > > > > above.
> > > > >>  RAID1 might be an option at increased cost.  Lack of
> > > > >>  barriers in some implementations scary.
> > > > >> >>>
> > > > >> >>> Agreed.
> > > > >> >>>
> > > > >> 
> > > > >>  2) Cache below the OSD.  Not much recent data on this.
> > > > >>  Not likely as fast as client side cache, but likely
> > > > >>  cheaper (fewer OSD nodes than client
> > > > >> >> nodes?).
> > > > >>  Lack of barriers in some implementations scary.
> > > > >> >>>
> > > > >> >>> This also has the benefit of cac

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-09-01 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wang, Zhiqiang
> Sent: 01 September 2015 09:18
> To: Nick Fisk ; 'Samuel Just' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> 
> > -Original Message-
> > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > Sent: Tuesday, September 1, 2015 3:55 PM
> > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> > Cc: ceph-users@lists.ceph.com
> > Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> >
> >
> >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Wang, Zhiqiang
> > > Sent: 01 September 2015 02:48
> > > To: Nick Fisk ; 'Samuel Just' 
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Nick Fisk
> > > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > > To: 'Samuel Just'
> > > > Cc: ceph-users@lists.ceph.com
> > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > >
> > > > Hi Sam,
> > > >
> > > > > -Original Message-
> > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > > Behalf Of Samuel Just
> > > > > Sent: 18 August 2015 21:38
> > > > > To: Nick Fisk 
> > > > > Cc: ceph-users@lists.ceph.com
> > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > >
> > > > > 1.  We've kicked this around a bit.  What kind of failure
> > > > > semantics would
> > > > you
> > > > > be comfortable with here (that is, what would be reasonable
> > > > > behavior if
> > > > the
> > > > > client side cache fails)?
> > > >
> > > > I would either expect to provide the cache with a redundant block
> > > > device (ie
> > > > RAID1 SSD's) or the cache to allow itself to be configured to
> > > > mirror across two SSD's. Of course single SSD's can be used if the
> > > > user accepts
> > the
> > > risk.
> > > > If the cache did the mirroring then you could do fancy stuff like
> > > > mirror the writes, but leave the read cache blocks as single
> > > > copies to increase the cache capacity.
> > > >
> > > > In either case although an outage is undesirable, its only data
> > > > loss which would be unacceptable, which would hopefully be avoided
> > > > by the mirroring. As part of this, it would need to be a way to
> > > > make sure a "dirty" RBD can't be accessed unless the corresponding
> > > > cache is also
> > > attached.
> > > >
> > > > I guess as it caching the RBD and not the pool or entire cluster,
> > > > the cache only needs to match the failure requirements of the
> > > > application
> > its
> > > caching.
> > > > If I need to cache a RBD that is on  a single server, there is no
> > > > requirement to make the cache redundant across
> > > racks/PDU's/servers...etc.
> > > >
> > > > I hope I've answered your question?
> > > >
> > > >
> > > > > 2. We've got a branch which should merge soon (tomorrow
> > > > > probably) which actually does allow writes to be proxied, so
> > > > > that should alleviate some of these pain points somewhat.  I'm
> > > > > not sure it is clever enough to allow through writefulls for an
> > > > > ec base tier though (but it would be a good
> > > > idea!) -
> > > >
> > > > Excellent news, I shall look forward to testing in the future. I
> > > > did mention the proxy write for write fulls to someone who was
> > > > working on the proxy write code, but I'm not sure if it ever got
followed
> up.
> > >
> > > I think someone here is me. In the current code, for an ec base
> > > tier,
> > writefull
> > > can be proxied to the base.
> >
> > Excellent news. Is this intelligent enough to determine when say a
> > normal write IO from a RBD is equal to the underlying object size and
> > then turn this normal write effectively into a write full?
> 
> Checked the code, seems we don't do this right now... Would this be much
> helpful? I think we can do this if the answer is yes.

Hopefully yes. Erasure code is very suited to storing backups capacity wise
and in a lot of backup software you can configure it to write in static size
blocks, which could be set to the object size. With the current tiering code
you end up with a lot of IO amplification and poor performance, if the above
feature was possible, it should perform a lot better.

Does that make sense?

If you are also caching the RBD, through some sort of block cache like
mentioned in this thread, then small sequential writes could also be
assembled in cache and then flushed straight through to the erasure tier as
proxy full writes. This is probably less appealing than the backup case but
gives the same advantages as RAID5/6 when equipped with a battery backed
cache, which also has massive performance gains when able to write a full
stripe.

> 
> >
> > >

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-09-01 Thread Wang, Zhiqiang
> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Tuesday, September 1, 2015 4:37 PM
> To: Wang, Zhiqiang; 'Samuel Just'
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> 
> 
> 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Wang, Zhiqiang
> > Sent: 01 September 2015 09:18
> > To: Nick Fisk ; 'Samuel Just' 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >
> > > -Original Message-
> > > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > > Sent: Tuesday, September 1, 2015 3:55 PM
> > > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > >
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Wang, Zhiqiang
> > > > Sent: 01 September 2015 02:48
> > > > To: Nick Fisk ; 'Samuel Just' 
> > > > Cc: ceph-users@lists.ceph.com
> > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > >
> > > > > -Original Message-
> > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > > Behalf Of Nick Fisk
> > > > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > > > To: 'Samuel Just'
> > > > > Cc: ceph-users@lists.ceph.com
> > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > >
> > > > > Hi Sam,
> > > > >
> > > > > > -Original Message-
> > > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > > > Behalf Of Samuel Just
> > > > > > Sent: 18 August 2015 21:38
> > > > > > To: Nick Fisk 
> > > > > > Cc: ceph-users@lists.ceph.com
> > > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > > >
> > > > > > 1.  We've kicked this around a bit.  What kind of failure
> > > > > > semantics would
> > > > > you
> > > > > > be comfortable with here (that is, what would be reasonable
> > > > > > behavior if
> > > > > the
> > > > > > client side cache fails)?
> > > > >
> > > > > I would either expect to provide the cache with a redundant
> > > > > block device (ie
> > > > > RAID1 SSD's) or the cache to allow itself to be configured to
> > > > > mirror across two SSD's. Of course single SSD's can be used if
> > > > > the user accepts
> > > the
> > > > risk.
> > > > > If the cache did the mirroring then you could do fancy stuff
> > > > > like mirror the writes, but leave the read cache blocks as
> > > > > single copies to increase the cache capacity.
> > > > >
> > > > > In either case although an outage is undesirable, its only data
> > > > > loss which would be unacceptable, which would hopefully be
> > > > > avoided by the mirroring. As part of this, it would need to be a
> > > > > way to make sure a "dirty" RBD can't be accessed unless the
> > > > > corresponding cache is also
> > > > attached.
> > > > >
> > > > > I guess as it caching the RBD and not the pool or entire
> > > > > cluster, the cache only needs to match the failure requirements
> > > > > of the application
> > > its
> > > > caching.
> > > > > If I need to cache a RBD that is on  a single server, there is
> > > > > no requirement to make the cache redundant across
> > > > racks/PDU's/servers...etc.
> > > > >
> > > > > I hope I've answered your question?
> > > > >
> > > > >
> > > > > > 2. We've got a branch which should merge soon (tomorrow
> > > > > > probably) which actually does allow writes to be proxied, so
> > > > > > that should alleviate some of these pain points somewhat.  I'm
> > > > > > not sure it is clever enough to allow through writefulls for
> > > > > > an ec base tier though (but it would be a good
> > > > > idea!) -
> > > > >
> > > > > Excellent news, I shall look forward to testing in the future. I
> > > > > did mention the proxy write for write fulls to someone who was
> > > > > working on the proxy write code, but I'm not sure if it ever got
> followed
> > up.
> > > >
> > > > I think someone here is me. In the current code, for an ec base
> > > > tier,
> > > writefull
> > > > can be proxied to the base.
> > >
> > > Excellent news. Is this intelligent enough to determine when say a
> > > normal write IO from a RBD is equal to the underlying object size
> > > and then turn this normal write effectively into a write full?
> >
> > Checked the code, seems we don't do this right now... Would this be
> > much helpful? I think we can do this if the answer is yes.
> 
> Hopefully yes. Erasure code is very suited to storing backups capacity wise 
> and
> in a lot of backup software you can configure it to write in static size 
> blocks,
> which could be set to the object size. With the current tiering code you end 
> up
> with a lot of IO amplification and poor performance, if the above feature was
> possible, it should perform a lot

Re: [ceph-users] Append data via librados C API in erasure coded pool

2015-09-01 Thread Loic Dachary
Hi,

Like Shylesh said: you need to obey alignment constraints. See 
rados_ioctx_pool_requires_alignment in 
http://ceph.com/docs/hammer/rados/api/librados/

Cheers

On 01/09/2015 08:49, shylesh kumar wrote:
> I think this could be misaligned writes.
> Is it multiple of 4k ?? Its just a wild guess.
> 
> thanks,
> Shylesh
> 
> On Tue, Sep 1, 2015 at 9:17 AM, Hercules  > wrote:
> 
> Hello,
> 
> I use librados C API rados_append() to append object data in erasure 
> coded pool, it always return -95 (Operation not supported). 
> Buf if i use the same code to append object data in replicated pool, it 
> works fine.
> Does erasure coded pool not support append write?
> 
> Below is my erasure coded pool setting. 
> pool 2 'Edata_pool' erasure size 6 min_size 4 crush_ruleset 8 object_hash 
> rjenkins pg_num 512 pgp_num 512 last_change 139 flags hashpspool stripe_width 
> 4096
> 
> Any advice will appreciate.
> Hercules
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> -- 
> Thanks & Regards
> Shylesh Kumar M
>  
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any recommendation of using EnhanceIO?

2015-09-01 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wang, Zhiqiang
> Sent: 01 September 2015 09:48
> To: Nick Fisk ; 'Samuel Just' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> 
> > -Original Message-
> > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > Sent: Tuesday, September 1, 2015 4:37 PM
> > To: Wang, Zhiqiang; 'Samuel Just'
> > Cc: ceph-users@lists.ceph.com
> > Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> >
> >
> >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Wang, Zhiqiang
> > > Sent: 01 September 2015 09:18
> > > To: Nick Fisk ; 'Samuel Just' 
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > > > -Original Message-
> > > > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > > > Sent: Tuesday, September 1, 2015 3:55 PM
> > > > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> > > > Cc: ceph-users@lists.ceph.com
> > > > Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > > Behalf Of Wang, Zhiqiang
> > > > > Sent: 01 September 2015 02:48
> > > > > To: Nick Fisk ; 'Samuel Just'
> > > > > 
> > > > > Cc: ceph-users@lists.ceph.com
> > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > >
> > > > > > -Original Message-
> > > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > > > Behalf Of Nick Fisk
> > > > > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > > > > To: 'Samuel Just'
> > > > > > Cc: ceph-users@lists.ceph.com
> > > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > > >
> > > > > > Hi Sam,
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
> > > > > > > On Behalf Of Samuel Just
> > > > > > > Sent: 18 August 2015 21:38
> > > > > > > To: Nick Fisk 
> > > > > > > Cc: ceph-users@lists.ceph.com
> > > > > > > Subject: Re: [ceph-users] any recommendation of using
> EnhanceIO?
> > > > > > >
> > > > > > > 1.  We've kicked this around a bit.  What kind of failure
> > > > > > > semantics would
> > > > > > you
> > > > > > > be comfortable with here (that is, what would be reasonable
> > > > > > > behavior if
> > > > > > the
> > > > > > > client side cache fails)?
> > > > > >
> > > > > > I would either expect to provide the cache with a redundant
> > > > > > block device (ie
> > > > > > RAID1 SSD's) or the cache to allow itself to be configured to
> > > > > > mirror across two SSD's. Of course single SSD's can be used if
> > > > > > the user accepts
> > > > the
> > > > > risk.
> > > > > > If the cache did the mirroring then you could do fancy stuff
> > > > > > like mirror the writes, but leave the read cache blocks as
> > > > > > single copies to increase the cache capacity.
> > > > > >
> > > > > > In either case although an outage is undesirable, its only
> > > > > > data loss which would be unacceptable, which would hopefully
> > > > > > be avoided by the mirroring. As part of this, it would need to
> > > > > > be a way to make sure a "dirty" RBD can't be accessed unless
> > > > > > the corresponding cache is also
> > > > > attached.
> > > > > >
> > > > > > I guess as it caching the RBD and not the pool or entire
> > > > > > cluster, the cache only needs to match the failure
> > > > > > requirements of the application
> > > > its
> > > > > caching.
> > > > > > If I need to cache a RBD that is on  a single server, there is
> > > > > > no requirement to make the cache redundant across
> > > > > racks/PDU's/servers...etc.
> > > > > >
> > > > > > I hope I've answered your question?
> > > > > >
> > > > > >
> > > > > > > 2. We've got a branch which should merge soon (tomorrow
> > > > > > > probably) which actually does allow writes to be proxied, so
> > > > > > > that should alleviate some of these pain points somewhat.
> > > > > > > I'm not sure it is clever enough to allow through writefulls
> > > > > > > for an ec base tier though (but it would be a good
> > > > > > idea!) -
> > > > > >
> > > > > > Excellent news, I shall look forward to testing in the future.
> > > > > > I did mention the proxy write for write fulls to someone who
> > > > > > was working on the proxy write code, but I'm not sure if it
> > > > > > ever got
> > followed
> > > up.
> > > > >
> > > > > I think someone here is me. In the current code, for an ec base
> > > > > tier,
> > > > writefull
> > > > > can be proxied to the base.
> > > >
> > > > Excellent news. Is this intelligent enough to determine when say a
> > > > normal write IO from a RBD is equal to the underlying object size
> > > > and then turn this normal write effe

Re: [ceph-users] Testing CephFS

2015-09-01 Thread Yan, Zheng

> On Sep 1, 2015, at 16:13, Simon Hallam  wrote:
> 
> Hi Greg, Zheng,
> 
> Is this fixed in a later version of the kernel client? Or would it be wise 
> for us to start using the fuse client?
> 
> Cheers,

I just wrote a fix 
https://github.com/ceph/ceph-client/commit/33b68dde7f27927a7cb1a7691e3c5b6f847ffd14
 
.
  Yes, you should try ceps-fuse if this bug causes problems for you.

Regards
Yan, Zheng

> 
> Simon
> 
>> -Original Message-
>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>> Sent: 31 August 2015 13:02
>> To: Yan, Zheng
>> Cc: Simon Hallam; Zheng Yan; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Testing CephFS
>> 
>> On Mon, Aug 31, 2015 at 12:16 PM, Yan, Zheng  wrote:
>>> On Mon, Aug 24, 2015 at 6:38 PM, Gregory Farnum
>>  wrote:
 On Mon, Aug 24, 2015 at 11:35 AM, Simon  Hallam 
>> wrote:
> Hi Greg,
> 
> The MDS' detect that the other one went down and started the replay.
> 
> I did some further testing with 20 client machines. Of the 20 client
>> machines, 5 hung with the following error:
> 
> [Aug24 10:53] ceph: mds0 caps stale
> [Aug24 10:54] ceph: mds0 caps stale
> [Aug24 10:58] ceph: mds0 hung
> [Aug24 11:03] ceph: mds0 came back
> [  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state
>> OPEN)
> [  +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new
>> mon
> [Aug24 11:04] ceph: mds0 reconnect start
> [  +0.084938] libceph: mon2 10.15.0.3:6789 session established
> [  +0.008475] ceph: mds0 reconnect denied
 
 Oh, this might be a kernel bug, failing to ask for mdsmap updates when
 the connection goes away. Zheng, does that sound familiar?
 -Greg
 
>>> 
>>> I reproduced this locally (use SIGSTOP to stop the monitor) . I think
>>> the root cause is that kernel client does not implement
>>> CEPH_FEATURE_MSGR_KEEPALIVE2. So the kernel client couldn't reliably
>>> detect the event that network cable got unplugged. It kept waiting for
>>> new events from the disconnected connection.
>> 
>> Yeah, the userspace client maintains an ongoing MDSMap subscription
>> from the monitors in order to hear about this. It puts more load on
>> the monitors but right now that's the solution we're going with: the
>> monitor times out the MDS, publishes a series of new maps (pushed to
>> the clients) in order to activate a standby, and the clients see that
>> they need to connect to the new MDS instance.
>> -Greg
> 
> 
> Please visit our new website at www.pml.ac.uk and follow us on Twitter  
> @PlymouthMarine
> 
> Winner of the Environment & Conservation category, the Charity Awards 2014.
> 
> Plymouth Marine Laboratory (PML) is a company limited by guarantee registered 
> in England & Wales, company number 4178503. Registered Charity No. 1091222. 
> Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK. 
> 
> This message is private and confidential. If you have received this message 
> in error, please notify the sender and remove it from your system. You are 
> reminded that e-mail communications are not secure and may contain viruses; 
> PML accepts no liability for any loss or damage which may be caused by 
> viruses.
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD won't go up after node reboot

2015-09-01 Thread Евгений Д .
Data lives in another container attached to OSD container as Docker volume.
According to `deis ps -a`, this volume was created two weeks ago, though
all files in `current` are very recent. I suspect that something removed
files in the data volume after reboot. As reboot was caused by CoreOS
update, it might be newer version of Docker (1.6 -> 1.7) that introduced
the problem. Or maybe it was container initialization process that somehow
removed and recreated files.
I don't have this data volume anymore, so can only guess.

2015-08-31 18:28 GMT+03:00 Jan Schermer :

> Is it possible that something else was mounted there?
> Or is it possible nothing was mounted there?
> That would explain such behaviour...
>
> Jan
>
> On 31 Aug 2015, at 17:07, Евгений Д.  wrote:
>
> No, it really was in the cluster. Before reboot cluster had HEALTH_OK.
> Though now I've checked `current` directory and it doesn't contain any
> data:
>
> root@staging-coreos-1:/var/lib/ceph/osd/ceph-0# ls current
> commit_op_seq  meta  nosnap  omap
>
> while other OSDs do. It really looks like something was broken on reboot,
> probably during container start, so it's not really related to Ceph. I'll
> go with OSD recreation.
>
> Thank you.
>
> 2015-08-31 11:50 GMT+03:00 Gregory Farnum :
>
>> On Sat, Aug 29, 2015 at 3:32 PM, Евгений Д.  wrote:
>> > I'm running 3-node cluster with Ceph (it's Deis cluster, so Ceph
>> daemons are
>> > containerized). There are 3 OSDs and 3 mons. After rebooting all nodes
>> one
>> > by one all monitors are up, but only two OSDs of three are up. 'Down'
>> OSD is
>> > really running but is never marked up/in.
>> > All three mons are reachable from inside the OSD container.
>> > I've run `log dump` for this OSD and found this line:
>> >
>> > Aug 29 06:19:39 staging-coreos-1 sh[7393]: -99> 2015-08-29
>> 06:18:51.855432
>> > 7f5902009700  3 osd.0 0 handle_osd_map epochs [1,90], i have 0, src has
>> > [1,90]
>> >
>> > Is it the reason why OSD cannot connect to the cluster? If yes, why
>> could it
>> > happen? I haven't removed any data from /var/lib/ceph/osd.
>> > Is it possible to bring this OSD back to cluster without completely
>> > recreating it?
>> >
>> > Ceph version is:
>> >
>> > root@staging-coreos-1:/# ceph -v
>> > ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>>
>> It's pretty unlikely. I presume (since the OSD has no maps) that it's
>> never actually been up and in the cluster? Or else its data store has
>> been pretty badly corrupted since it doesn't have any of the requisite
>> metadata. In which case you'll probably be best off recreating it
>> (with 3 OSDs I assume all your PGs are still active).
>> -Greg
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph distributed osd

2015-09-01 Thread gjprabu
Hi Robert,



We are going to use ceph with ocfs2 in production. Here my doubt is rbd 
mounted in 12 clients using ocfs2 clustering and network for server & 
client will be 1 Gig. Is the throughput performance is ok for this setup? 



Regards

Prabu




  On Thu, 20 Aug 2015 02:15:53 +0530 gjprabu  
wrote 




Hi Robert,



 Thanks for your replay. We understand the senarios.





Regards

Prabu








  On Thu, 20 Aug 2015 00:15:41 +0530 rob...@leblancnet.us wrote 



-BEGIN PGP SIGNED MESSAGE- 

Hash: SHA256 

 

By default, all pools will use all OSDs. Each RBD, for instance, is 

broken up into 4 MB objects and those objects are somewhat uniformly 

distributed between the OSDs. When you add another OSD, the CRUSH map 

is recalculated and the OSDs shuffle the objects to their new 

locations somewhat uniformly distributing them across all available 

OSDs. 

 

I say uniformly distributed because it is based on the hashing 

algorithm of the name and size is not taken into account. So you may 

have more larger objects on some OSDs than others. The number of PGs 

affect the ability to more uniformly distribute the data (more hash 

buckets for data to land in). 

 

You can create CRUSH rules that limit selection of OSDs to a subset 

and then configure a pool to use those rules. This is a pretty 

advanced configuration option. 

 

I hope that helps with your question. 

-BEGIN PGP SIGNATURE- 

Version: Mailvelope v1.0.0 

Comment: https://www.mailvelope.com 

 

wsFcBAEBCAAQBQJV1M7SCRDmVDuy+mK58QAASbYQAMG0oPEu56Uz0/9cb4LY 

E7QTeX2hUGRX5c65Zurr9p+/Sc4WCvDEZm/aPPcB9UtO0O5dvWXULWjXRgr0 

Z13/28OozLxWQihRc80OhY2MskNfgPA0zYwaANgUR0xJV4YFQ1ORa13rj0L8 

SL4z/IDK9tK/NDLxnjq/iMPXCTTcg3ufiB+0Njl3zLRbGEOAix6H5hzi0239 

qHb7UniTtailICcSI0byQE2vKPWQbJ7GueECbcAn/MkqU0uZqzyh5HotiBFq 

9ut/ui3ec0Sg/3puD6TOhipQlP998sMnAa5hFi+hoNbVbljGZ9dGZ+inVlJy 

kSQTbNDs0Xo2QijGH11LrQ4yL47Trr2WkIriHONtvbncgZg3qK7uR39k6kZ9 

dfGUdtstkn8sh5gt98jFNvjWL8UTH9puAJv5C9TzPuq+cq3kr3dwhy4WxrN+ 

MNISYwJOvncY/2kl03FLL/Z0HxDx1mjjJMQdzM+q9+D0m/EYfUpe/DxMqqMI 

4t8hD5UPBhkv1sgLYSWyJ5vxLnNOZP7roe2Jp0KwwlSADM9DJb4MEx/1nNcb 

6emts8KUhhtb1jsH8gu9Z0tzHcaqNE8N1z9JiveaNCjs6wTp8xbtmDB7p9k4 

uZzzoIXTJWrIN/Qqukza+/+8D+WAJ618uwXCCpWi/k83RKt7iy2iv5w4EDTx 

25cQ 

=a+24 

-END PGP SIGNATURE- 

 

Robert LeBlanc 

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 

 

 

On Tue, Aug 18, 2015 at 8:26 AM, gjprabu  wrote: 

> Hi Luis, 

> 

> What i mean , we have three OSD with Harddisk size each 1TB and two 

> pool (poolA and poolB) with replica 2. Here writing behavior is the 

> confusion for us. Our assumptions is below. 

> 

> PoolA -- may write with OSD1 and OSD2 (is this correct) 

> 

> PoolB -- may write with OSD3 and OSD1 (is this correct) 

> 

> suppose the hard disk size got full , then how many OSD's need to be added 

> and How will be the writing behavior to new OSD's 

> 

> After added few osd's 

> 

> PoolA -- may write with OSD4 and OSD5 (is this correct) 

> PoolB -- may write with OSD5 and OSD6 (is this correct) 

> 

> 

> Regards 

> Prabu 

> 

>  On Mon, 17 Aug 2015 19:41:53 +0530 Luis Periquito 
 

> wrote  

> 

> I don't understand your question? You created a 1G RBD/disk and it's full. 

> You are able to grow it though - but that's a Linux management issue, not 

> ceph. 

> 

> As everything is thin-provisioned you can create a RBD with an arbitrary 

> size - I've create one with 1PB when the cluster only had 600G/Raw 

> available. 

> 

> On Mon, Aug 17, 2015 at 1:18 PM, gjprabu  
wrote: 

> 

> Hi All, 

> 

> Anybody can help on this issue. 

> 

> Regards 

> Prabu 

> 

>  On Mon, 17 Aug 2015 12:08:28 +0530 gjprabu 
 wrote 

>  

> 

> Hi All, 

> 

> Also please find osd information. 

> 

> ceph osd dump | grep 'replicated size' 

> pool 2 'repo' replicated size 2 min_size 2 crush_ruleset 0 object_hash 

> rjenkins pg_num 126 pgp_num 126 last_change 21573 flags hashpspool 

> stripe_width 0 

> 

> Regards 

> Prabu 

> 

> 

> 

> 

>  On Mon, 17 Aug 2015 11:58:55 +0530 gjprabu 
 wrote 

>  

> 

> 

> 

> Hi All, 

> 

> We need to test three OSD and one image with replica 2(size 1GB). While 

> testing data is not writing above 1GB. Is there any option to write on 
third 

> OSD. 

> 

> ceph osd pool get repo pg_num 

> pg_num: 126 

> 

> # rbd showmapped 

> id pool image snap device 

> 0 rbd integdownloads - /dev/rbd0 -- Already one 

> 2 repo integrepotest - /dev/rbd2 -- newly created 

> 

> 

> [root@hm2 repository]# df -Th 

> Filesystem Type Size Used Avail Use% Mounted on 

> /dev/sda5 ext4 289G 18G 257G 7% / 

> devtmpfs devtmpfs 252G 0 252G 0% /dev 

> tmpfs tmpfs 252G 0 252G 0% /dev/shm 

> tmpfs tmpfs 252G 538M 252G 1% /run 

> tmpfs tmpfs 252G 0 252G 0% /sys/f

Re: [ceph-users] Firefly to Hammer Upgrade -- HEALTH_WARN; too many PGs per OSD (480 > max 300)

2015-09-01 Thread 10 minus
Hi Greg,

Thanks for the update..
I think the documentation on Ceph should be reworded.

--snip--

http://ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups

* Less than 5 OSDs set pg_num to 128
* Between 5 and 10 OSDs set pg_num to 512
* Between 10 and 50 OSDs set pg_num to 4096
* If you have more than 50 OSDs, you need to understand the tradeoffs
and how to calculate the pg_num value by yourself

--snip--




On Mon, Aug 31, 2015 at 10:31 AM, Gregory Farnum  wrote:

> On Mon, Aug 31, 2015 at 8:30 AM, 10 minus  wrote:
> > Hi ,
> >
> > I 'm in the process of upgrading my ceph cluster from Firefly to Hammer.
> >
> > The ceph cluster has 12 OSD spread across 4 nodes.
> >
> > Mons have been upgraded to hammer, since I have created pools  with value
> > 512 and 256 , so am bit confused with the warning message.
> >
> > --snip--
> >
> > ceph -s
> > cluster a7160e16-0aaf-4e78-9e7c-7fbec08642f0
> >  health HEALTH_WARN
> > too many PGs per OSD (480 > max 300)
> >  monmap e1: 3 mons at
> > {mon01=
> 172.16.10.5:6789/0,mon02=172.16.10.6:6789/0,mon03=172.16.10.7:6789/0}
> > election epoch 116, quorum 0,1,2 mon01,mon02,mon03
> >  osdmap e6814: 12 osds: 12 up, 12 in
> >   pgmap v2961763: 1920 pgs, 4 pools, 230 GB data, 29600 objects
> > 692 GB used, 21652 GB / 22345 GB avail
> > 1920 active+clean
> >
> >
> >
> > --snip--
> >
> >
> > ## Conf and ceph output
> >
> > --snip--
> >
> > [global]
> > fsid = a7160e16-0aaf-4e78-9e7c-7fbec08642f0
> > public_network = 172.16.10.0/24
> > cluster_network = 172.16.10.0/24
> > mon_initial_members = mon01, mon02, mon03
> > mon_host = 172.16.10.5,172.16.10.6,172.16.10.7
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true
> > mon_clock_drift_allowed = .15
> > mon_clock_drift_warn_backoff = 30
> > mon_osd_down_out_interval = 300
> > mon_osd_report_timeout = 300
> > mon_osd_full_ratio = .85
> > mon_osd_nearfull_ratio = .75
> > osd_backfill_full_ratio = .75
> > osd_pool_default_size = 3
> > osd_pool_default_min_size = 2
> > osd_pool_default_pg_num = 512
> > osd_pool_default_pgp_num = 512
> > --snip--
> >
> > ceph df
> >
> >
> > POOLS:
> > NAMEID   USED %USED   MAX AVAIL OBJECTS
> > images3 216G0.97 7179G
>  27793
> > vms4  14181M0.06 7179G
> 1804
> > volumes  50 0 7179G
> > 1
> > backups  60 0 7179G
> > 0
> >
> >
> > ceph osd pool get poolname pg_num
> > images: 256
> > backup: 512
> > vms: 512
> > volumes: 512
> >
> > --snip--
> >
> > Since it is a warning .. can I upgrade the OSDs without destroying the
> data.
> > or
> > Should I roll back.
>
> It's not a problem, just a diagnostic warning that appears to be
> misbehaving. If you can create a bug at tracker.ceph.com listing what
> Ceph versions are involved and exactly what's happened it can get
> investigated, but you should feel free to keep upgrading. :)
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How should I deal with placement group numbers when reducing number of OSDs

2015-09-01 Thread Jan Schermer
Hi,
we're in the process of changing 480G drives for 1200G drives, which should cut 
the number of OSDs I have roughly to 1/3.

My largest "volumes" pool for OpenStack volumes has 16384 PGs at the moment and 
I have 36K PGs in total. That equals to ~180 PGs/OSD and would become ~500 PG/s 
OSD.

I know I can't actually decrease the number of PGs in a pool, and I'm wondering 
if it's worth working around to decrease the numbers? It is possible I'll be 
expanding the storage in the future, but probably not 3-fold. 

I think it's not worth bothering with and I'll just have to disable the "too 
many PGs per OSD" warning if I upgrade.

I already put some new drives in and the OSDs seem to work fine (though I had 
to restart them after backfilling - they were spinning CPU for no apparent 
reason).

Your thoughts?

Thanks
Jan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

2015-09-01 Thread Kenneth Van Alstyne
Thanks for the awesome advice folks.  Until I can go larger scale (50+ SATA 
disks), I’m thinking my best option here is to just swap out these 1TB SATA 
disks with 1TB SSDs.  Am I oversimplifying the short term solution?

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com 
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

> On Aug 31, 2015, at 7:29 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:
> 
> In addition to the spot on comments by Warren and Quentin, verify this by
> watching your nodes with atop, iostat, etc. 
> The culprit (HDDs) should be plainly visible.
> 
> More inline:
> 
>> Christian, et al:
>> 
>> Sorry for the lack of information.  I wasn’t sure what of our hardware
>> specifications or Ceph configuration was useful information at this
>> point.  Thanks for the feedback — any feedback, is appreciated at this
>> point, as I’ve been beating my head against a wall trying to figure out
>> what’s going on.  (If anything.  Maybe the spindle count is indeed our
>> upper limit or our SSDs really suck? :-) )
>> 
> Your SSDs aren't the problem.
> 
>> To directly address your questions, see answers below:
>>  - CBT is the Ceph Benchmarking Tool.  Since my question was more
>> generic rather than with CBT itself, it was probably more useful to post
>> in the ceph-users list rather than cbt.
>>  - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @
>> 2.40GHz
> Not your problem either.
> 
>>  - The SSDs are indeed Intel S3500s.  I agree — not ideal, but
>> supposedly capable of up to 75,000 random 4KB reads/writes.  Throughput
>> and longevity is quite low for an SSD, rated at about 400MB/s reads and
>> 100MB/s writes, though.  When we added these as journals in front of the
>> SATA spindles, both VM performance and rados benchmark numbers were
>> relatively unchanged.
>> 
> The only thing relevant in regards to journal SSDs is the sequential write
> speed (SYNC), they don't seek and normally don't get read either.
> This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710
> which is faster in any other aspect but sequential writes. ^o^
> 
> Latency should have gone down with the SSD journals in place, but that's
> their main function/benefit. 
> 
>>  - Regarding throughput vs iops, indeed — the throughput that I’m
>> seeing is nearly worst case scenario, with all I/O being 4KB block
>> size.  With RBD cache enabled and the writeback option set in the VM
>> configuration, I was hoping more coalescing would occur, increasing the
>> I/O block size.
>> 
> That can only help with non-SYNC writes, so your MySQL VMs and certain
> file system ops will have to bypass that and that hurts.
> 
>> As an aside, the orchestration layer on top of KVM is OpenNebula if
>> that’s of any interest.
>> 
> It is actually, as I've been eying OpenNebula (alas no Debian Jessie
> packages). However not relevant to your problem indeed.
> 
>> VM information:
>>  - Number = 15
>>  - Worload = Mixed (I know, I know — that’s as vague of an answer
>> as they come)  A handful of VMs are running some MySQL databases and
>> some web applications in Apache Tomcat.  One is running a syslog
>> server.  Everything else is mostly static web page serving for a low
>> number of users.
>> 
> As others have mentioned, would you expect this load to work well with
> just 2 HDDs and via NFS to introduce network latency?
> 
>> I can duplicate the blocked request issue pretty consistently, just by
>> running something simple like a “yum -y update” in one VM.  While that
>> is running, ceph -w and ceph -s show the following: root@dashboard:~#
>> ceph -s cluster f79d8c2a-3c14-49be-942d-83fc5f193a25 health HEALTH_WARN
>>1 requests are blocked > 32 sec
>> monmap e3: 3 mons at
>> {storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0}
>> election epoch 136, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap
>> e75590: 6 osds: 6 up, 6 in pgmap v3495103: 224 pgs, 1 pools, 826 GB
>> data, 225 kobjects 2700 GB used, 2870 GB / 5571 GB avail
>> 224 active+clean
>>  client io 3292 B/s rd, 2623 kB/s wr, 81 op/s
>> 
> [snip]
>> 466 kB/s rd, 1863 kB/s wr, 148 op/s
>> 
> This is a good sample, unless your reads can be satisfied from page cache
> on your storage nodes or

[ceph-users] Moving/Sharding RGW Bucket Index

2015-09-01 Thread Daniel Maraio

Hello,

  I have two large buckets in my RGW and I think the performance is 
being impacted by the bucket index. One bucket contains 9 million 
objects and the other one has 22 million. I'd like to shard the bucket 
index and also change the ruleset of the .rgw.buckets.index pool to put 
it on our SSD root. I could not find any documentation on this issue. It 
looks like the bucket indexes can be rebuilt using the radosgw-admin 
bucket check command but I'm not sure how to proceed. We can stop writes 
or take the cluster down completely if necessary. My initial thought was 
to backup the existing index pool and create a new one. I'm not sure if 
I can change the index_pool of an existing bucket. If that is possible I 
assume I can change that to my new pool and execute a radosgw-admin 
bucket check command to rebuild/shard the index.


  Does anyone have experience in getting sharding running with an 
existing bucket, or even moving the index pool to a different ruleset? 
When I change the crush ruleset for the .rgw.buckets.index pool to my 
SSD root we run into issues, buckets cannot be created or listed, writes 
cease to work, reads seem to work fine though. Thanks for your time!


- Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to improve ceph cluster capacity usage

2015-09-01 Thread huang jun
hi,all

Recently, i did some experiments on OSD data distribution,
we set up a cluster with 72 OSDs,all 2TB sata disk,
and ceph version is v0.94.3 and linux kernel version is 3.18,
and set "ceph osd crush tunables optimal".
There are 3 pools:
pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0
pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 4096 pgp_num 4096 last_change 832
crash_replay_interval 45 stripe_width 0
pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 512 pgp_num 512 last_change 302
stripe_width 0

the osd pg num of each osd:
pool  : 0  1  2  | SUM

osd.0   13 10518 | 136
osd.1   17 11026 | 153
osd.2   15 11420 | 149
osd.3   11 10117 | 129
osd.4   8  10617 | 131
osd.5   12 10219 | 133
osd.6   19 11429 | 162
osd.7   16 11521 | 152
osd.8   15 11725 | 157
osd.9   13 11723 | 153
osd.10  13 13316 | 162
osd.11  14 10521 | 140
osd.12  11 94 16 | 121
osd.13  12 11021 | 143
osd.14  20 11926 | 165
osd.15  12 12519 | 156
osd.16  15 12622 | 163
osd.17  13 10919 | 141
osd.18  8  11919 | 146
osd.19  14 11419 | 147
osd.20  17 11329 | 159
osd.21  17 11127 | 155
osd.22  13 12120 | 154
osd.23  14 95 23 | 132
osd.24  17 11026 | 153
osd.25  13 13315 | 161
osd.26  17 12424 | 165
osd.27  16 11920 | 155
osd.28  19 13430 | 183
osd.29  13 12120 | 154
osd.30  11 97 20 | 128
osd.31  12 10918 | 139
osd.32  10 11215 | 137
osd.33  18 11428 | 160
osd.34  19 11229 | 160
osd.35  16 12132 | 169
osd.36  13 11118 | 142
osd.37  15 10722 | 144
osd.38  21 12924 | 174
osd.39  9  12117 | 147
osd.40  11 10218 | 131
osd.41  14 10119 | 134
osd.42  16 11925 | 160
osd.43  12 11813 | 143
osd.44  17 11425 | 156
osd.45  11 11415 | 140
osd.46  12 10716 | 135
osd.47  15 11123 | 149
osd.48  14 11520 | 149
osd.49  9  94 13 | 116
osd.50  14 11718 | 149
osd.51  13 11219 | 144
osd.52  11 12622 | 159
osd.53  12 12218 | 152
osd.54  13 12120 | 154
osd.55  17 11425 | 156
osd.56  11 11818 | 147
osd.57  22 13725 | 184
osd.58  15 10522 | 142
osd.59  13 12018 | 151
osd.60  12 11019 | 141
osd.61  21 11428 | 163
osd.62  12 97 18 | 127
osd.63  19 10931 | 159
osd.64  10 13221 | 163
osd.65  19 13721 | 177
osd.66  22 10732 | 161
osd.67  12 10720 | 139
osd.68  14 10022 | 136
osd.69  16 11024 | 150
osd.70  9  10114 | 124
osd.71  15 11224 | 151


SUM   : 1024   8192   1536   |

We can found that, for poolid=1(data pool),
osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs,
which maybe cause data distribution imbanlance, and reduces the space
utilization of the cluster.

Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep
2 --min-x 1 --max-x %s"
we tested different pool pg_num:

Total PG num PG num stats
 ---
4096 avg: 113.78 (avg stands for avg PG num of every osd)
total: 8192  (total stands for total PG num, include replica PG)
max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands
for percent above avage PG num )
min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands
for ratio below avage PG num )

8192 avg: 227.56
total: 16384
max: 267 0.173340
min: 226 -0.129883

16384 avg: 455.11
total: 32768
max: 502 0.103027
min: 455 -0.127686

32768 avg: 910.22
total: 65536
max: 966 0.061279
min: 910 -0.076050

With bigger pg_num, the gap between the maximum and the minimum decreased.
But it's unreasonable to set such large pg_num, which will increase
OSD and MON load.

Is there any way to get a more balanced PG distribution of the cluster?
We tried "ceph osd reweight-by-pg 110 data" many times, but that can
not resolve the problem.

Another problem is that if we can ensure the PG is distributed
balanced, can we ensure the data
distribution is balanced like PG?

Btw, we will write data to this cluster until one or more osd get
full, we set full ratio to 0.98,
and we expect the cluster can use 0.9 total capacity.

Any tips are welcome.

-- 
thanks
huangjun
__

Re: [ceph-users] How to disable object-map and exclusive features ?

2015-09-01 Thread Christoph Adomeit
Hi Jason,

I have a coredump with the size of 1200M compressed .

Where shall i put the dump  ?

I think the crashes are often caused when I do a snapshot backup of the 
vm-images.
Then somwething happens with locking which causes the cm to crash 

Thanks
  Christoph

On Mon, Aug 31, 2015 at 09:10:49AM -0400, Jason Dillaman wrote:
> Unfortunately, the tool the dynamically enable/disable image features (rbd 
> feature disable  ) was added during the Infernalis 
> development cycle.  Therefore, in the short-term you would need to recreate 
> the images via export/import or clone/flatten.  
> 
> There are several object map / exclusive lock bug fixes that are scheduled to 
> be included in the 0.94.4 release that might address your issue.  However, 
> without more information, we won't know if the issue you are seeing has been 
> fixed or not in a later release.
> 
> If it is at all possible, it would be most helpful if you could provide logs 
> or a backtrace when the VMs lock up.  Since it happens so infrequently, you 
> may not be willing to increase the RBD debug log level to 20 on one of these 
> VMs. Therefore, another possibility is for you to run "gcore " 
> to generate a core dump or attach GDB to the hung VM and run "thread apply 
> all bt".  With the gcore or backtrace method, we would need a listing of all 
> the package versions installed on the machine to recreate a similar debug 
> environment.
> 
> Thanks,
> 
> Jason 
> 
> 
> - Original Message -
> > From: "Christoph Adomeit" 
> > To: ceph-users@lists.ceph.com
> > Sent: Monday, August 31, 2015 7:49:00 AM
> > Subject: [ceph-users] How to disable object-map and exclusive features ?
> > 
> > Hi there,
> > 
> > I have a ceph-cluster (0.94-2) with >100 rbd kvm images.
> > 
> > Most vms are running rock-solid but 7 vms are hanging about once a week.
> > 
> > I found out the hanging machines have
> > features: layering, exclusive, object map while all other vms do not have
> > exclusive and object map set.
> > 
> > Now I want to disable these features. Is ist possible to disable these
> > features while the vms are running ? Or at least while they are shut down ?
> > Or will I have to recreate all these images ?
> > 
> > Thanks
> >   Christoph
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 

-- 
Christoph Adomeit
GATWORKS GmbH
Reststrauch 191
41199 Moenchengladbach
Sitz: Moenchengladbach
Amtsgericht Moenchengladbach, HRB 6303
Geschaeftsfuehrer:
Christoph Adomeit, Hans Wilhelm Terstappen

christoph.adom...@gatworks.de Internetloesungen vom Feinsten
Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting rgw bucket list

2015-09-01 Thread Yehuda Sadeh-Weinraub
Can you bump up debug (debug rgw = 20, debug ms = 1), and see if the
operations (bucket listing and bucket check) go into some kind of
infinite loop?

Yehuda

On Tue, Sep 1, 2015 at 1:16 AM, Sam Wouters  wrote:
> Hi, I've started the bucket --check --fix on friday evening and it's
> still running. 'ceph -s' shows the cluster health as OK, I don't know if
> there is anything else I could check? Is there a way of finding out if
> its actually doing something?
>
> We only have this issue on the one bucket with versioning enabled, I
> can't get rid of the feeling it has something todo with that. The
> "underscore bug" is also still present on that bucket
> (http://tracker.ceph.com/issues/12819). Not sure if thats related in any
> way.
> Are there any alternatives, as for example copy all the objects into a
> new bucket without versioning? Simple way would be to list the objects
> and copy them to a new bucket, but bucket listing is not working so...
>
> -Sam
>
>
> On 31-08-15 10:47, Gregory Farnum wrote:
>> This generally shouldn't be a problem at your bucket sizes. Have you
>> checked that the cluster is actually in a healthy state? The sleeping
>> locks are normal but should be getting woken up; if they aren't it
>> means the object access isn't working for some reason. A down PG or
>> something would be the simplest explanation.
>> -Greg
>>
>> On Fri, Aug 28, 2015 at 6:52 PM, Sam Wouters  wrote:
>>> Ok, maybe I'm to impatient. It would be great if there were some verbose
>>> or progress logging of the radosgw-admin tool.
>>> I will start a check and let it run over the weekend.
>>>
>>> tnx,
>>> Sam
>>>
>>> On 28-08-15 18:16, Sam Wouters wrote:
 Hi,

 this bucket only has 13389 objects, so the index size shouldn't be a
 problem. Also, on the same cluster we have an other bucket with 1200543
 objects (but no versioning configured), which has no issues.

 when we run a radosgw-admin bucket --check (--fix), nothing seems to be
 happening. Putting an strace on the process shows a lot of lines like 
 these:
 [pid 99372] futex(0x2d730d4, FUTEX_WAIT_PRIVATE, 156619, NULL
 
 [pid 99385] futex(0x2da9410, FUTEX_WAIT_PRIVATE, 2, NULL 
 [pid 99371] futex(0x2da9410, FUTEX_WAKE_PRIVATE, 1 
 [pid 99385] <... futex resumed> )   = -1 EAGAIN (Resource
 temporarily unavailable)
 [pid 99371] <... futex resumed> )   = 0

 but no errors in the ceph logs or health warnings.

 r,
 Sam

 On 28-08-15 17:49, Ben Hines wrote:
> How many objects in the bucket?
>
> RGW has problems with index size once number of objects gets into the
> 90+ level. The buckets need to be recreated with 'sharded bucket
> indexes' on:
>
> rgw override bucket index max shards = 23
>
> You could also try repairing the index with:
>
>  radosgw-admin bucket check --fix --bucket=
>
> -Ben
>
> On Fri, Aug 28, 2015 at 8:38 AM, Sam Wouters  wrote:
>> Hi,
>>
>> we have a rgw bucket (with versioning) where PUT and GET operations for
>> specific objects succeed,  but retrieving an object list fails.
>> Using python-boto, after a timeout just gives us an 500 internal error;
>> radosgw-admin just hangs.
>> Also a radosgw-admin bucket check just seems to hang...
>>
>> ceph version is 0.94.3 but this also was happening with 0.94.2, we
>> quietly hoped upgrading would fix but it didn't...
>>
>> r,
>> Sam
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Appending to an open file - O_APPEND flag

2015-09-01 Thread Janusz Borkowski
Hi!

open( ... O_APPEND) works fine in a single system. If many processes write to 
the same file, their output will never overwrite each other.

On NFS overwriting is possible, as appending is only emulated - each write is 
preceded by a seek to the current file size and race condition may occur.

How it is in cephfs?

I have a file F opened with  O_APPEND|O_WRONLY by some process. In a console I 
type

$ echo "asd" >> F

Effectively, this is opening of file F by another process with O_APPEND flag .

The string "asd" is written to the beginning of file F, overwriting the 
starting bytes in the file. Is it a bug or a feature? If a feature, how it is 
described?

It is ceph Hammer and kernel 3.10.0-229.11.1.el7.x86_64

Thanks!

J.
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to an open file - O_APPEND flag

2015-09-01 Thread PILLAI Madhubalan
Hi guy's,

   I am totally new to Ceph deploy and I have succefully install ceph
cluster on a Admin node and able to active it by One monitor and two OSD.
After creation of Ceph Cluster I checked the Ceph Health Status and The
output was OK.
  With that success I started to move to next stage for Rados Gateway.
Untill 90% of work there was no any error but when I moved with stage
python s3test.py I am facing the error.
  It's kind request to share your views to solve the error.

thanks n advance,
Maddy

On Tue, Sep 1, 2015 at 5:40 PM, Janusz Borkowski <
janusz.borkow...@infobright.com> wrote:

> Hi!
>
> open( ... O_APPEND) works fine in a single system. If many processes write
> to the same file, their output will never overwrite each other.
>
> On NFS overwriting is possible, as appending is only emulated - each write
> is preceded by a seek to the current file size and race condition may occur.
>
> How it is in cephfs?
>
> I have a file F opened with  O_APPEND|O_WRONLY by some process. In a
> console I type
>
> $ echo "asd" >> F
>
> Effectively, this is opening of file F by another process with O_APPEND
> flag .
>
> The string "asd" is written to the beginning of file F, overwriting the
> starting bytes in the file. Is it a bug or a feature? If a feature, how it
> is described?
>
> It is ceph Hammer and kernel 3.10.0-229.11.1.el7.x86_64
>
> Thanks!
>
> J.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How should I deal with placement group numbers when reducing number of OSDs

2015-09-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are in a situation where we need to decrease PG for a pool as well.
One thought is to live migrate with block copy to a new pool with the
right number of PGs and then once they are all moved delete the old
pool. We don't have a lot of data in that pool yet, that may not be
feasible for you.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 6:19 AM, Jan Schermer  wrote:
Hi,
we're in the process of changing 480G drives for 1200G drives, which
should cut the number of OSDs I have roughly to 1/3.

My largest "volumes" pool for OpenStack volumes has 16384 PGs at the
moment and I have 36K PGs in total. That equals to ~180 PGs/OSD and
would become ~500 PG/s OSD.

I know I can't actually decrease the number of PGs in a pool, and I'm
wondering if it's worth working around to decrease the numbers? It is
possible I'll be expanding the storage in the future, but probably not
3-fold.

I think it's not worth bothering with and I'll just have to disable
the "too many PGs per OSD" warning if I upgrade.

I already put some new drives in and the OSDs seem to work fine
(though I had to restart them after backfilling - they were spinning
CPU for no apparent reason).

Your thoughts?

Thanks
Jan
___
ceph-users mailing
listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV5c1oCRDmVDuy+mK58QAAkXkP/1Wi4vBQ9BmZ6y11Eg+2
MxFl4ajDBYosJZz1jbnRvIKWWPlVbFHxbE0cFby6RtumT6DzpRNny+12TMcE
aakwUuVR5RADh+oXzr4MU4xlPj6DWMAzSx8Bi5Mid6KVlJtK6Egsq9hCHD50
EwXg1PcEoagJL5QOHFcT/u89TlE26Enp2cl4tjwp3ltMWj1hay+J63gpTglS
Tfmhi8hx22Q3RCWhVCFS+gWzWXjYPVfh3bONaSmK9BhqGjy98QJa6II+a6kL
gAWG7XTJl1zAKko44cj7JSqHLmzyuBfoa/PuZMOjkEfDAOW6jdTU4VUAj3bd
OK6E8sw8EMhbhlVOle6HvG1dO6bJhIt9uRxSVf+hZfFp87DoIHRAZ1J3b0PR
zB6s8b+XfSph3gnU2ZsCc3wHuqM3MFXUcI7Vn7tvdV7HWXWBTGtPhokI5COk
vgpLO1gvTTRzkNxmsLqwCTBFhFqK2zPw6xHpL1D5BcUYr/zS02+48ARZoUh7
pRteDdsnHOPSc5m1DcldvQtQelSMgfIyULVSXlZAukIWH9rsNt7Zishj3lvR
W7z8/Ixr22TJ15mkVAAVwtlI813X59tPhmZrFmffP/GaF9vQpKUysEVZFhm1
rrTfBt6ZBa5nhYCatojpv91HM7WNeY0XJSrl+LnwGjP9avt/B2r1SoRG61Y0
d3BM
=J7n1
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD test results with Plextor M6 Pro, HyperX Fury, Kingston V300, ADATA SP90

2015-09-01 Thread Jelle de Jong
Hi Jan,

I am building two new clusters for testing. I been reading your messages
on the mailing list for a while now and want to thank you for your support.

I can redo all the numbers, but is your question to run all the test
again with [hdparm -W 1 /dev/sdc]? Please tell me what else you would
like to see test, commands?

My experience was that enabling disk cache causes about a 45%
performance drop, iops=25690 vs iops=46185

I am going to test DT01ACA300 vs WD1003FBYZ disks with SV300S37A ssd's
in my other two three node ceph clusters.

What is your advice on making hdparm and possible scheduler (noop)
changes persistent (cmd in rc.local or special udev rules, examples?)

Kind regards,

Jelle de Jong


On 23/06/15 12:41, Jan Schermer wrote:
> Those are interesting numbers - can you rerun the test with write cache 
> enabled this time? I wonder how much your drop will be…
> 
> thanks
> 
> Jan
> 
>> On 18 Jun 2015, at 17:48, Jelle de Jong  wrote:
>>
>> Hello everybody,
>>
>> I thought I would share the benchmarks from these four ssd's I tested
>> (see attachment)
>>
>> I do still have some question:
>>
>> #1 *Data Set Management TRIM supported (limit 1 block)
>>vs
>>   *Data Set Management TRIM supported (limit 8 blocks)
>> and how this effects Ceph and also how can I test if TRIM is actually
>> working and not corruption data.
>>
>> #2 are there other things I should test to compare ssd's for Ceph Journals
>>
>> #3 are the power loss security mechanisms on SSD relevant in Ceph when
>> configured in a way that a full node can fully die and that a power loss
>> of all nodes at the same time should not be possible (or has an extreme
>> low probability)
>>
>> #4 how to benchmarks the OSD (disk+ssd-journal) combination so I can
>> compare them.
>>
>> I got some other benchmarks question, but I will make an separate mail
>> for them.
>>
>> Kind regards,
>>
>> Jelle de Jong
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to an open file - O_APPEND flag

2015-09-01 Thread Gregory Farnum
On Sep 1, 2015 4:41 PM, "Janusz Borkowski" 
wrote:
>
> Hi!
>
> open( ... O_APPEND) works fine in a single system. If many processes
write to the same file, their output will never overwrite each other.
>
> On NFS overwriting is possible, as appending is only emulated - each
write is preceded by a seek to the current file size and race condition may
occur.
>
> How it is in cephfs?

CephFS generally ought to handle appends correctly. If it's not we will
want to fix that.

>
> I have a file F opened with  O_APPEND|O_WRONLY by some process. In a
console I type
>
> $ echo "asd" >> F
>
> Effectively, this is opening of file F by another process with O_APPEND
flag .
>
> The string "asd" is written to the beginning of file F, overwriting the
starting bytes in the file. Is it a bug or a feature? If a feature, how it
is described?

Are you doing this in the same box that's got the the file open, or a
different one? Are you using the ceph-fuse or kernel clients on the systems?

I'm not sure how the shell actually handles >> so I'd like to see this
reproduced with strace or an example program to be sure it's really not
handling append properly.
-Greg

>
> It is ceph Hammer and kernel 3.10.0-229.11.1.el7.x86_64
>
> Thanks!
>
> J.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

2015-09-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Just swapping out spindles for SSD will not give you orders of
magnitude performance gains as it does in regular cases. This is
because Ceph has a lot of overhead for each I/O which limits the
performance of the SSDs. In my testing, two Intel S3500 SSDs with an 8
core Atom (Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz) and size=1 and fio
with 8 jobs and QD=8 sync,direct 4K read/writes produced 2,600 IOPs.
Don't get me wrong, it will help, but don't expect spectacular
results.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 8:01 AM, Kenneth Van Alstyne  wrote:
Thanks for the awesome advice folks.  Until I can go larger scale (50+
SATA disks), I’m thinking my best option here is to just swap out
these 1TB SATA disks with 1TB SSDs.  Am I oversimplifying the short
term solution?

Thanks,

- --
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001

Notice: This e-mail message, including any attachments, is for the
sole use of the intended recipient(s) and may contain confidential and
privileged information. Any unauthorized review, copy, use,
disclosure, or distribution is STRICTLY prohibited. If you are not the
intended recipient, please contact the sender by reply e-mail and
destroy all copies of the original message.

On Aug 31, 2015, at 7:29 PM, Christian Balzer  wrote:


Hello,

On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:

In addition to the spot on comments by Warren and Quentin, verify this by
watching your nodes with atop, iostat, etc.
The culprit (HDDs) should be plainly visible.

More inline:

Christian, et al:

Sorry for the lack of information.  I wasn’t sure what of our hardware
specifications or Ceph configuration was useful information at this
point.  Thanks for the feedback — any feedback, is appreciated at this
point, as I’ve been beating my head against a wall trying to figure out
what’s going on.  (If anything.  Maybe the spindle count is indeed our
upper limit or our SSDs really suck? :-) )

Your SSDs aren't the problem.

To directly address your questions, see answers below:
- CBT is the Ceph Benchmarking Tool.  Since my question was more
generic rather than with CBT itself, it was probably more useful to post
in the ceph-users list rather than cbt.
- 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @
2.40GHz
Not your problem either.

- The SSDs are indeed Intel S3500s.  I agree — not ideal, but
supposedly capable of up to 75,000 random 4KB reads/writes.  Throughput
and longevity is quite low for an SSD, rated at about 400MB/s reads and
100MB/s writes, though.  When we added these as journals in front of the
SATA spindles, both VM performance and rados benchmark numbers were
relatively unchanged.

The only thing relevant in regards to journal SSDs is the sequential write
speed (SYNC), they don't seek and normally don't get read either.
This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710
which is faster in any other aspect but sequential writes. ^o^

Latency should have gone down with the SSD journals in place, but that's
their main function/benefit.

- Regarding throughput vs iops, indeed — the throughput that I’m
seeing is nearly worst case scenario, with all I/O being 4KB block
size.  With RBD cache enabled and the writeback option set in the VM
configuration, I was hoping more coalescing would occur, increasing the
I/O block size.

That can only help with non-SYNC writes, so your MySQL VMs and certain
file system ops will have to bypass that and that hurts.

As an aside, the orchestration layer on top of KVM is OpenNebula if
that’s of any interest.

It is actually, as I've been eying OpenNebula (alas no Debian Jessie
packages). However not relevant to your problem indeed.

VM information:
- Number = 15
- Worload = Mixed (I know, I know — that’s as vague of an answer
as they come)  A handful of VMs are running some MySQL databases and
some web applications in Apache Tomcat.  One is running a syslog
server.  Everything else is mostly static web page serving for a low
number of users.

As others have mentioned, would you expect this load to work well with
just 2 HDDs and via NFS to introduce network latency?

I can duplicate the blocked request issue pretty consistently, just by
running something simple like a “yum -y update” in one VM.  While that
is running, ceph -w and ceph -s show the following: root@dashboard:~#
ceph -s cluster f79d8c2a-3c14-49be-942d-83fc5f193a25 health HEALTH_WARN
   1 requests are blocked > 32 sec
monmap e3: 3 mons at
{storage-1=10.0.0.1:6789/0,storage-

Re: [ceph-users] How should I deal with placement group numbers when reducing number of OSDs

2015-09-01 Thread Jan Schermer
Unfortunately we are not in control of the VMs using this pool, so something 
like "sync -> stop VM -> incremental sync -> start VM on new pool" would be 
extremely complicated. I _think_ it's possible to misuse a cache tier to do 
this (add a cache tier, remove the underlying tier, add a new pool and remove 
cache tier), but that's a hack at best.

So before we go even considering this - will there be any significant gains 
from this? When we increased the PG numbers it had a very positive effect on 
the cluster, but with only 1/3 of the drives I am worried there will be too 
much contention on the OSDs. I've already seen a higher CPU usage and while 
some latency metrics went down thanks to the new Intel drives, other metrics 
went up of course, so I'm not sure how it will perform in the real life...

Jan

> On 01 Sep 2015, at 18:08, Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> We are in a situation where we need to decrease PG for a pool as well. One 
> thought is to live migrate with block copy to a new pool with the right 
> number of PGs and then once they are all moved delete the old pool. We don't 
> have a lot of data in that pool yet, that may not be feasible for you.
> 
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> On Tue, Sep 1, 2015 at 6:19 AM, Jan Schermer  wrote:
> Hi,
> we're in the process of changing 480G drives for 1200G drives, which should 
> cut the number of OSDs I have roughly to 1/3.
> 
> My largest "volumes" pool for OpenStack volumes has 16384 PGs at the moment 
> and I have 36K PGs in total. That equals to ~180 PGs/OSD and would become 
> ~500 PG/s OSD.
> 
> I know I can't actually decrease the number of PGs in a pool, and I'm 
> wondering if it's worth working around to decrease the numbers? It is 
> possible I'll be expanding the storage in the future, but probably not 3-fold.
> 
> I think it's not worth bothering with and I'll just have to disable the "too 
> many PGs per OSD" warning if I upgrade.
> 
> I already put some new drives in and the OSDs seem to work fine (though I had 
> to restart them after backfilling - they were spinning CPU for no apparent 
> reason).
> 
> Your thoughts?
> 
> Thanks
> Jan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com 
> 
> wsFcBAEBCAAQBQJV5c1oCRDmVDuy+mK58QAAkXkP/1Wi4vBQ9BmZ6y11Eg+2
> MxFl4ajDBYosJZz1jbnRvIKWWPlVbFHxbE0cFby6RtumT6DzpRNny+12TMcE
> aakwUuVR5RADh+oXzr4MU4xlPj6DWMAzSx8Bi5Mid6KVlJtK6Egsq9hCHD50
> EwXg1PcEoagJL5QOHFcT/u89TlE26Enp2cl4tjwp3ltMWj1hay+J63gpTglS
> Tfmhi8hx22Q3RCWhVCFS+gWzWXjYPVfh3bONaSmK9BhqGjy98QJa6II+a6kL
> gAWG7XTJl1zAKko44cj7JSqHLmzyuBfoa/PuZMOjkEfDAOW6jdTU4VUAj3bd
> OK6E8sw8EMhbhlVOle6HvG1dO6bJhIt9uRxSVf+hZfFp87DoIHRAZ1J3b0PR
> zB6s8b+XfSph3gnU2ZsCc3wHuqM3MFXUcI7Vn7tvdV7HWXWBTGtPhokI5COk
> vgpLO1gvTTRzkNxmsLqwCTBFhFqK2zPw6xHpL1D5BcUYr/zS02+48ARZoUh7
> pRteDdsnHOPSc5m1DcldvQtQelSMgfIyULVSXlZAukIWH9rsNt7Zishj3lvR
> W7z8/Ixr22TJ15mkVAAVwtlI813X59tPhmZrFmffP/GaF9vQpKUysEVZFhm1
> rrTfBt6ZBa5nhYCatojpv91HM7WNeY0XJSrl+LnwGjP9avt/B2r1SoRG61Y0
> d3BM
> =J7n1
> -END PGP SIGNATURE-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Hi cephers,

 I would like to know the status for production-ready of Accelio & Ceph,
does anyone had a home-made procedure implemented with Ubuntu?

recommendations, comments?

Thanks in advance,

Best regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving/Sharding RGW Bucket Index

2015-09-01 Thread Wang, Warren
I added sharding to our busiest RGW sites, but it will not shard existing 
bucket indexes, only applies to new buckets. Even with that change, I'm still 
considering moving the index pool to SSD. The main factor being the rate of 
writes. We are looking at a project that will have extremely high writes/sec 
through the RGWs. 

The other thing worth noting is that at that scale, you also need to change 
filestore merge threshold and filestore split multiple to something 
considerably larger. Props to Michael Kidd @ RH for that tip. There's a 
mathematical formula on the filestore config reference.

Warren

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Daniel 
Maraio
Sent: Tuesday, September 01, 2015 10:40 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Moving/Sharding RGW Bucket Index

Hello,

   I have two large buckets in my RGW and I think the performance is being 
impacted by the bucket index. One bucket contains 9 million objects and the 
other one has 22 million. I'd like to shard the bucket index and also change 
the ruleset of the .rgw.buckets.index pool to put it on our SSD root. I could 
not find any documentation on this issue. It looks like the bucket indexes can 
be rebuilt using the radosgw-admin bucket check command but I'm not sure how to 
proceed. We can stop writes or take the cluster down completely if necessary. 
My initial thought was to backup the existing index pool and create a new one. 
I'm not sure if I can change the index_pool of an existing bucket. If that is 
possible I assume I can change that to my new pool and execute a radosgw-admin 
bucket check command to rebuild/shard the index.

   Does anyone have experience in getting sharding running with an existing 
bucket, or even moving the index pool to a different ruleset? 
When I change the crush ruleset for the .rgw.buckets.index pool to my SSD root 
we run into issues, buckets cannot be created or listed, writes cease to work, 
reads seem to work fine though. Thanks for your time!

- Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How should I deal with placement group numbers when reducing number of OSDs

2015-09-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm not convinced that a backing pool can be removed from a caching
tier. I just haven't been able to get around to trying it.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 10:29 AM, Jan Schermer  wrote:
Unfortunately we are not in control of the VMs using this pool, so
something like "sync -> stop VM -> incremental sync -> start VM on new
pool" would be extremely complicated. I _think_ it's possible to
misuse a cache tier to do this (add a cache tier, remove the
underlying tier, add a new pool and remove cache tier), but that's a
hack at best.

So before we go even considering this - will there be any significant
gains from this? When we increased the PG numbers it had a very
positive effect on the cluster, but with only 1/3 of the drives I am
worried there will be too much contention on the OSDs. I've already
seen a higher CPU usage and while some latency metrics went down
thanks to the new Intel drives, other metrics went up of course, so
I'm not sure how it will perform in the real life...

Jan

On 01 Sep 2015, at 18:08, Robert LeBlanc  wrote:

- -BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are in a situation where we need to decrease PG for a pool as well.
One thought is to live migrate with block copy to a new pool with the
right number of PGs and then once they are all moved delete the old
pool. We don't have a lot of data in that pool yet, that may not be
feasible for you.

- - 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 6:19 AM, Jan Schermer  wrote:
Hi,
we're in the process of changing 480G drives for 1200G drives, which
should cut the number of OSDs I have roughly to 1/3.

My largest "volumes" pool for OpenStack volumes has 16384 PGs at the
moment and I have 36K PGs in total. That equals to ~180 PGs/OSD and
would become ~500 PG/s OSD.

I know I can't actually decrease the number of PGs in a pool, and I'm
wondering if it's worth working around to decrease the numbers? It is
possible I'll be expanding the storage in the future, but probably not
3-fold.

I think it's not worth bothering with and I'll just have to disable
the "too many PGs per OSD" warning if I upgrade.

I already put some new drives in and the OSDs seem to work fine
(though I had to restart them after backfilling - they were spinning
CPU for no apparent reason).

Your thoughts?

Thanks
Jan
___
ceph-users mailing
listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

- -BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV5c1oCRDmVDuy+mK58QAAkXkP/1Wi4vBQ9BmZ6y11Eg+2
MxFl4ajDBYosJZz1jbnRvIKWWPlVbFHxbE0cFby6RtumT6DzpRNny+12TMcE
aakwUuVR5RADh+oXzr4MU4xlPj6DWMAzSx8Bi5Mid6KVlJtK6Egsq9hCHD50
EwXg1PcEoagJL5QOHFcT/u89TlE26Enp2cl4tjwp3ltMWj1hay+J63gpTglS
Tfmhi8hx22Q3RCWhVCFS+gWzWXjYPVfh3bONaSmK9BhqGjy98QJa6II+a6kL
gAWG7XTJl1zAKko44cj7JSqHLmzyuBfoa/PuZMOjkEfDAOW6jdTU4VUAj3bd
OK6E8sw8EMhbhlVOle6HvG1dO6bJhIt9uRxSVf+hZfFp87DoIHRAZ1J3b0PR
zB6s8b+XfSph3gnU2ZsCc3wHuqM3MFXUcI7Vn7tvdV7HWXWBTGtPhokI5COk
vgpLO1gvTTRzkNxmsLqwCTBFhFqK2zPw6xHpL1D5BcUYr/zS02+48ARZoUh7
pRteDdsnHOPSc5m1DcldvQtQelSMgfIyULVSXlZAukIWH9rsNt7Zishj3lvR
W7z8/Ixr22TJ15mkVAAVwtlI813X59tPhmZrFmffP/GaF9vQpKUysEVZFhm1
rrTfBt6ZBa5nhYCatojpv91HM7WNeY0XJSrl+LnwGjP9avt/B2r1SoRG61Y0
d3BM
=J7n1
- -END PGP SIGNATURE-


-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV5dQVCRDmVDuy+mK58QAAAWkQAJHuwHD0T2tLkC2UvU2A
x1kNgxjyFgZykBAO8oZPQqgAva3AVwC70b9Wi+OlYSFEAKxu0M0sjHtfQP5d
uMFLfk2T+PeloWCSKUToIbqTR892vrivO12pII7SvBNcmH5OEF8wlyzfVw1l
BVm1sA9tLqCQ6GHA6u4n1iXAn/ZCUzwB08XRRXNHgFp5oNTBxve720zEqO11
5CLv010WkcGtNnbZUqYpOKWXXpd3KnVvh4dNatGiUheHzTv7u9R6Iu5mZGTt
+vJFNkDq0Yy3h/uyneJu1tPBNHvYM3o1vy7VL7lQ4G45mV2oqrdTn/Pp0mb3
y8R5F9hx+40rtXl/gehi4fY0crYPmg+vG2/GPpxKxeJoWFDcbinbACPXN9oR
vm/4mi83R/zoisQt6wxbHwaFJAZDQldeb+Wej7IJ/JzEL+pW395ezE3AaLe0
mFStyIyZXC6ceqjeEXzl5X3eFU+snzKPWMF4xznxfe3/Qz9NxKegBCjr5WoV
//BA4+XisLQpCsFhAC7B87bs7ExoC/eD67K97E7QFH2GnkYL554RdTQI3bBZ
8y2X2Udi+EbzCx99yEN6aU/H1tgkAZ2q1WmgoxQwxPMVBQ7A0ZfXonxWvIBv
P6XXx8SQoFX/j23sitnrin5LtsnNnDZO1JOiRw0FcGbeNLnT0vVe+sCjcBZI
Hg4P
=8q50
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Accelio and Ceph are still in heavy development and not ready for production.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
Hi cephers,

 I would like to know the status for production-ready of Accelio &
Ceph, does anyone had a home-made procedure implemented with Ubuntu?

recommendations, comments?

Thanks in advance,

Best regards,

German

___
ceph-users mailing
listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J
FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX
xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92
OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/
VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02
m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV
YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA
Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF
XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD
/7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z
SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt
8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh
TiL2
=oSrX
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

2015-09-01 Thread Kenneth Van Alstyne
Got it — I’ll keep that in mind. That may just be what I need to “get by” for 
now.  Ultimately, we’re looking to buy at least three nodes of servers that can 
hold 40+ OSDs backed by 2TB+ SATA disks,

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com 
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

> On Sep 1, 2015, at 11:26 AM, Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Just swapping out spindles for SSD will not give you orders of magnitude 
> performance gains as it does in regular cases. This is because Ceph has a lot 
> of overhead for each I/O which limits the performance of the SSDs. In my 
> testing, two Intel S3500 SSDs with an 8 core Atom (Intel(R) Atom(TM) CPU  
> C2750  @ 2.40GHz) and size=1 and fio with 8 jobs and QD=8 sync,direct 4K 
> read/writes produced 2,600 IOPs. Don't get me wrong, it will help, but don't 
> expect spectacular results.
> 
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> On Tue, Sep 1, 2015 at 8:01 AM, Kenneth Van Alstyne  wrote:
> Thanks for the awesome advice folks.  Until I can go larger scale (50+ SATA 
> disks), I’m thinking my best option here is to just swap out these 1TB SATA 
> disks with 1TB SSDs.  Am I oversimplifying the short term solution?
> 
> Thanks,
> 
> - --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> c: 228-547-8045 f: 571-266-3106
> www.knightpoint.com  
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 2 / ISO 27001
> 
> Notice: This e-mail message, including any attachments, is for the sole use 
> of the intended recipient(s) and may contain confidential and privileged 
> information. Any unauthorized review, copy, use, disclosure, or distribution 
> is STRICTLY prohibited. If you are not the intended recipient, please contact 
> the sender by reply e-mail and destroy all copies of the original message.
> 
> On Aug 31, 2015, at 7:29 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:
> 
> In addition to the spot on comments by Warren and Quentin, verify this by
> watching your nodes with atop, iostat, etc. 
> The culprit (HDDs) should be plainly visible.
> 
> More inline:
> 
> Christian, et al:
> 
> Sorry for the lack of information.  I wasn’t sure what of our hardware
> specifications or Ceph configuration was useful information at this
> point.  Thanks for the feedback — any feedback, is appreciated at this
> point, as I’ve been beating my head against a wall trying to figure out
> what’s going on.  (If anything.  Maybe the spindle count is indeed our
> upper limit or our SSDs really suck? :-) )
> 
> Your SSDs aren't the problem.
> 
> To directly address your questions, see answers below:
>   - CBT is the Ceph Benchmarking Tool.  Since my question was more
> generic rather than with CBT itself, it was probably more useful to post
> in the ceph-users list rather than cbt.
>   - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @
> 2.40GHz
> Not your problem either.
> 
>   - The SSDs are indeed Intel S3500s.  I agree — not ideal, but
> supposedly capable of up to 75,000 random 4KB reads/writes.  Throughput
> and longevity is quite low for an SSD, rated at about 400MB/s reads and
> 100MB/s writes, though.  When we added these as journals in front of the
> SATA spindles, both VM performance and rados benchmark numbers were
> relatively unchanged.
> 
> The only thing relevant in regards to journal SSDs is the sequential write
> speed (SYNC), they don't seek and normally don't get read either.
> This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710
> which is faster in any other aspect but sequential writes. ^o^
> 
> Latency should have gone down with the SSD journals in place, but that's
> their main function/benefit. 
> 
>   - Regarding throughput vs iops, indeed — the throughput that I’m
> seeing is nearly worst case scenario, with all I/O being 4KB block
> size.  With RBD cache enabled and the writeback option set in the VM
> configuration, I was hoping more coalescing would occur, in

Re: [ceph-users] ceph distributed osd

2015-09-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

You will be the one best equipped to answer the performance question.
You will have to figure out what minimal performance your application
will need. Then you have to match the disks to that (disk random IOPs
* # disks) / replicas will get you in the ball park. If you are not
using SSDs for journals, I'd half the number you get from the
equation. You will have to determine if your max performance for each
client will fit in a 1 Gb link.

There is not enough information for me to give you a good answer. My
gut is that if you have 12 clients in a cluster, you are doing that
for performance, in which case I would say it is not enough.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 5:19 AM, gjprabu  wrote:
Hi Robert,

We are going to use ceph with ocfs2 in production. Here my
doubt is rbd mounted in 12 clients using ocfs2 clustering and network
for server & client will be 1 Gig. Is the throughput performance is ok
for this setup?

Regards
Prabu

-  On Thu, 20 Aug 2015 02:15:53 +0530 gjprabu  wrote 

Hi Robert,

Thanks for your replay. We understand the senarios.


Regards
Prabu



-  On Thu, 20 Aug 2015 00:15:41 +0530 rob...@leblancnet.us wrote 

- -BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

By default, all pools will use all OSDs. Each RBD, for instance, is
broken up into 4 MB objects and those objects are somewhat uniformly
distributed between the OSDs. When you add another OSD, the CRUSH map
is recalculated and the OSDs shuffle the objects to their new
locations somewhat uniformly distributing them across all available
OSDs.

I say uniformly distributed because it is based on the hashing
algorithm of the name and size is not taken into account. So you may
have more larger objects on some OSDs than others. The number of PGs
affect the ability to more uniformly distribute the data (more hash
buckets for data to land in).

You can create CRUSH rules that limit selection of OSDs to a subset
and then configure a pool to use those rules. This is a pretty
advanced configuration option.

I hope that helps with your question.
- -BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV1M7SCRDmVDuy+mK58QAASbYQAMG0oPEu56Uz0/9cb4LY
E7QTeX2hUGRX5c65Zurr9p+/Sc4WCvDEZm/aPPcB9UtO0O5dvWXULWjXRgr0
Z13/28OozLxWQihRc80OhY2MskNfgPA0zYwaANgUR0xJV4YFQ1ORa13rj0L8
SL4z/IDK9tK/NDLxnjq/iMPXCTTcg3ufiB+0Njl3zLRbGEOAix6H5hzi0239
qHb7UniTtailICcSI0byQE2vKPWQbJ7GueECbcAn/MkqU0uZqzyh5HotiBFq
9ut/ui3ec0Sg/3puD6TOhipQlP998sMnAa5hFi+hoNbVbljGZ9dGZ+inVlJy
kSQTbNDs0Xo2QijGH11LrQ4yL47Trr2WkIriHONtvbncgZg3qK7uR39k6kZ9
dfGUdtstkn8sh5gt98jFNvjWL8UTH9puAJv5C9TzPuq+cq3kr3dwhy4WxrN+
MNISYwJOvncY/2kl03FLL/Z0HxDx1mjjJMQdzM+q9+D0m/EYfUpe/DxMqqMI
4t8hD5UPBhkv1sgLYSWyJ5vxLnNOZP7roe2Jp0KwwlSADM9DJb4MEx/1nNcb
6emts8KUhhtb1jsH8gu9Z0tzHcaqNE8N1z9JiveaNCjs6wTp8xbtmDB7p9k4
uZzzoIXTJWrIN/Qqukza+/+8D+WAJ618uwXCCpWi/k83RKt7iy2iv5w4EDTx
25cQ
=a+24
- -END PGP SIGNATURE-
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1


On Tue, Aug 18, 2015 at 8:26 AM, gjprabu  wrote:
> Hi Luis,
>
> What i mean , we have three OSD with Harddisk size each 1TB and two
> pool (poolA and poolB) with replica 2. Here writing behavior is the
> confusion for us. Our assumptions is below.
>
> PoolA -- may write with OSD1 and OSD2 (is this correct)
>
> PoolB -- may write with OSD3 and OSD1 (is this correct)
>
> suppose the hard disk size got full , then how many OSD's need to be added
> and How will be the writing behavior to new OSD's
>
> After added few osd's
>
> PoolA -- may write with OSD4 and OSD5 (is this correct)
> PoolB -- may write with OSD5 and OSD6 (is this correct)
>
>
> Regards
> Prabu
>
>  On Mon, 17 Aug 2015 19:41:53 +0530 Luis Periquito
> wrote 
>
> I don't understand your question? You created a 1G RBD/disk and it's full.
> You are able to grow it though - but that's a Linux management issue, not
> ceph.
>
> As everything is thin-provisioned you can create a RBD with an arbitrary
> size - I've create one with 1PB when the cluster only had 600G/Raw
> available.
>
> On Mon, Aug 17, 2015 at 1:18 PM, gjprabu  wrote:
>
> Hi All,
>
> Anybody can help on this issue.
>
> Regards
> Prabu
>
>  On Mon, 17 Aug 2015 12:08:28 +0530 gjprabu  wrote
> 
>
> Hi All,
>
> Also please find osd information.
>
> ceph osd dump | grep 'replicated size'
> pool 2 'repo' replicated size 2 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 126 pgp_num 126 last_change 21573 flags hashpspool
> stripe_width 0
>
> Regards
> Prabu
>
>
>
>
>  On Mon, 17 Aug 2015 11:58:55 +0530 gjprabu  wrote
> 
>
>
>
> Hi All,
>
> We need to test three OSD and one image with replica 2(size 1GB). While
> testing data is not writing above 1GB. Is there any option to write on third
> OSD.
>
> ceph osd pool get repo pg_num
> pg_num: 126
>

Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

2015-09-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I would caution against large OSD nodes. You can really get into a
pinch with CPU and RAM during recovery periods. I know a few people
have it working well, but it requires a lot of tuning to get it right.
Personally, 20 disks in a box are too much for my comfort. If you want
to go with large boxes, I would be sure to do a lot of research and
ask people here on the list about what needs to be done to get optimum
performance.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 10:50 AM, Kenneth Van Alstyne  wrote:
Got it — I’ll keep that in mind. That may just be what I need to “get
by” for now.  Ultimately, we’re looking to buy at least three nodes of
servers that can hold 40+ OSDs backed by 2TB+ SATA disks,

Thanks,

- --
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001

Notice: This e-mail message, including any attachments, is for the
sole use of the intended recipient(s) and may contain confidential and
privileged information. Any unauthorized review, copy, use,
disclosure, or distribution is STRICTLY prohibited. If you are not the
intended recipient, please contact the sender by reply e-mail and
destroy all copies of the original message.

On Sep 1, 2015, at 11:26 AM, Robert LeBlanc  wrote:

- -BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Just swapping out spindles for SSD will not give you orders of
magnitude performance gains as it does in regular cases. This is
because Ceph has a lot of overhead for each I/O which limits the
performance of the SSDs. In my testing, two Intel S3500 SSDs with an 8
core Atom (Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz) and size=1 and fio
with 8 jobs and QD=8 sync,direct 4K read/writes produced 2,600 IOPs.
Don't get me wrong, it will help, but don't expect spectacular
results.

- - 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 8:01 AM, Kenneth Van Alstyne  wrote:
Thanks for the awesome advice folks.  Until I can go larger scale (50+
SATA disks), I’m thinking my best option here is to just swap out
these 1TB SATA disks with 1TB SSDs.  Am I oversimplifying the short
term solution?

Thanks,

- - --
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001

Notice: This e-mail message, including any attachments, is for the
sole use of the intended recipient(s) and may contain confidential and
privileged information. Any unauthorized review, copy, use,
disclosure, or distribution is STRICTLY prohibited. If you are not the
intended recipient, please contact the sender by reply e-mail and
destroy all copies of the original message.

On Aug 31, 2015, at 7:29 PM, Christian Balzer  wrote:


Hello,

On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:

In addition to the spot on comments by Warren and Quentin, verify this by
watching your nodes with atop, iostat, etc.
The culprit (HDDs) should be plainly visible.

More inline:

Christian, et al:

Sorry for the lack of information.  I wasn’t sure what of our hardware
specifications or Ceph configuration was useful information at this
point.  Thanks for the feedback — any feedback, is appreciated at this
point, as I’ve been beating my head against a wall trying to figure out
what’s going on.  (If anything.  Maybe the spindle count is indeed our
upper limit or our SSDs really suck? :-) )

Your SSDs aren't the problem.

To directly address your questions, see answers below:
- CBT is the Ceph Benchmarking Tool.  Since my question was more
generic rather than with CBT itself, it was probably more useful to post
in the ceph-users list rather than cbt.
- 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @
2.40GHz
Not your problem either.

- The SSDs are indeed Intel S3500s.  I agree — not ideal, but
supposedly capable of up to 75,000 random 4KB reads/writes.  Throughput
and longevity is quite low for an SSD, rated at about 400MB/s reads and
100MB/s writes, though.  When we added these as journals in front of the
SATA spindles, both VM performance and rados benchmark numbers were
relatively unchanged.

The only thing relevant in regards to journal SSDs is the sequential write
speed (SYNC), they don't seek and normally don't get read either.
This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710
which is faster in 

Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

2015-09-01 Thread Wang, Warren
Be selective with the SSDs you choose. I personally have tried Micron M500DC, 
Intel S3500, and some PCIE cards that would all suffice. There are MANY that do 
not work well at all. A shockingly large list, in fact.

Intel 3500/3700 are the gold standards.

Warren

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Kenneth Van Alstyne
Sent: Tuesday, September 01, 2015 12:50 PM
To: Robert LeBlanc 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph Performance Questions with rbd images access by 
qemu-kvm

Got it — I’ll keep that in mind. That may just be what I need to “get by” for 
now.  Ultimately, we’re looking to buy at least three nodes of servers that can 
hold 40+ OSDs backed by 2TB+ SATA disks,

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

On Sep 1, 2015, at 11:26 AM, Robert LeBlanc 
mailto:rob...@leblancnet.us>> wrote:


-BEGIN PGP SIGNED MESSAGE-

Hash: SHA256



Just swapping out spindles for SSD will not give you orders of magnitude 
performance gains as it does in regular cases. This is because Ceph has a lot 
of overhead for each I/O which limits the performance of the SSDs. In my 
testing, two Intel S3500 SSDs with an 8 core Atom (Intel(R) Atom(TM) CPU  C2750 
 @ 2.40GHz) and size=1 and fio with 8 jobs and QD=8 sync,direct 4K read/writes 
produced 2,600 IOPs. Don't get me wrong, it will help, but don't expect 
spectacular results.



- 

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Tue, Sep 1, 2015 at 8:01 AM, Kenneth Van Alstyne  wrote:

Thanks for the awesome advice folks.  Until I can go larger scale (50+ SATA 
disks), I’m thinking my best option here is to just swap out these 1TB SATA 
disks with 1TB SSDs.  Am I oversimplifying the short term solution?



Thanks,



- --

Kenneth Van Alstyne

Systems Architect

Knight Point Systems, LLC

Service-Disabled Veteran-Owned Business

1775 Wiehle Avenue Suite 101 | Reston, VA 20190

c: 228-547-8045 f: 571-266-3106

www.knightpoint.com

DHS EAGLE II Prime Contractor: FC1 SDVOSB Track

GSA Schedule 70 SDVOSB: GS-35F-0646S

GSA MOBIS Schedule: GS-10F-0404Y

ISO 2 / ISO 27001



Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.



On Aug 31, 2015, at 7:29 PM, Christian Balzer  wrote:





Hello,



On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:



In addition to the spot on comments by Warren and Quentin, verify this by

watching your nodes with atop, iostat, etc.

The culprit (HDDs) should be plainly visible.



More inline:



Christian, et al:



Sorry for the lack of information.  I wasn’t sure what of our hardware

specifications or Ceph configuration was useful information at this

point.  Thanks for the feedback — any feedback, is appreciated at this

point, as I’ve been beating my head against a wall trying to figure out

what’s going on.  (If anything.  Maybe the spindle count is indeed our

upper limit or our SSDs really suck? :-) )



Your SSDs aren't the problem.



To directly address your questions, see answers below:

  - CBT is the Ceph Benchmarking Tool.  Since my question was more

generic rather than with CBT itself, it was probably more useful to post

in the ceph-users list rather than cbt.

  - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @

2.40GHz

Not your problem either.



  - The SSDs are indeed Intel S3500s.  I agree — not ideal, but

supposedly capable of up to 75,000 random 4KB reads/writes.  Throughput

and longevity is quite low for an SSD, rated at about 400MB/s reads and

100MB/s writes, though.  When we added these as journals in front of the

SATA spindles, both VM performance and rados benchmark numbers were

relatively unchanged.



The only thing relevant in regards to journal SSDs is the sequential write

speed (SYNC), they don't seek and normally don't get read either.

This is why a 200GB DC S3700 is a better

Re: [ceph-users] Troubleshooting rgw bucket list

2015-09-01 Thread Sam Wouters
It looks like it, this is what shows in the logs after bumping the debug
and requesting a bucket list.

2015-09-01 17:14:53.008620 7fccb17ca700 10 cls_bucket_list
aws-cmis-prod(@{i=.be-east.rgw.buckets.index}.be-east.rgw.buckets[be-east.5436.1])
start
abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5[]
num_entries 1
2015-09-01 17:14:53.008629 7fccb17ca700 20 reading from
.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
2015-09-01 17:14:53.008636 7fccb17ca700 20 get_obj_state:
rctx=0x7fccb17c84d0
obj=.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
state=0x7fcde01a4060 s->prefetch_data=0
2015-09-01 17:14:53.008640 7fccb17ca700 10 cache get:
name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
2015-09-01 17:14:53.008645 7fccb17ca700 20 get_obj_state: s->obj_tag was
set empty
2015-09-01 17:14:53.008647 7fccb17ca700 10 cache get:
name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
2015-09-01 17:14:53.008675 7fccb17ca700  1 -- 10.11.4.105:0/1109243 -->
10.11.4.105:6801/39085 -- osd_op(client.55506.0:435874
.dir.be-east.5436.1 [call rgw.bucket_list] 26.7d78fc84
ack+read+known_if_redirected e255) v5 -- ?+0 0x7fcde01a0540 con 0x3a2d870
2015-09-01 17:14:53.009136 7fccb17ca700 10 cls_bucket_list
aws-cmis-prod(@{i=.be-east.rgw.buckets.index}.be-east.rgw.buckets[be-east.5436.1])
start
abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5[]
num_entries 1
2015-09-01 17:14:53.009146 7fccb17ca700 20 reading from
.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
2015-09-01 17:14:53.009153 7fccb17ca700 20 get_obj_state:
rctx=0x7fccb17c84d0
obj=.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
state=0x7fcde01a4060 s->prefetch_data=0
2015-09-01 17:14:53.009158 7fccb17ca700 10 cache get:
name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
2015-09-01 17:14:53.009163 7fccb17ca700 20 get_obj_state: s->obj_tag was
set empty
2015-09-01 17:14:53.009165 7fccb17ca700 10 cache get:
name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
2015-09-01 17:14:53.009189 7fccb17ca700  1 -- 10.11.4.105:0/1109243 -->
10.11.4.105:6801/39085 -- osd_op(client.55506.0:435876
.dir.be-east.5436.1 [call rgw.bucket_list] 26.7d78fc84
ack+read+known_if_redirected e255) v5 -- ?+0 0x7fcde01a0540 con 0x3a2d870
2015-09-01 17:14:53.009629 7fccb17ca700 10 cls_bucket_list
aws-cmis-prod(@{i=.be-east.rgw.buckets.index}.be-east.rgw.buckets[be-east.5436.1])
start
abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5[]
num_entries 1
2015-09-01 17:14:53.009638 7fccb17ca700 20 reading from
.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
2015-09-01 17:14:53.009645 7fccb17ca700 20 get_obj_state:
rctx=0x7fccb17c84d0
obj=.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
state=0x7fcde01a4060 s->prefetch_data=0
2015-09-01 17:14:53.009651 7fccb17ca700 10 cache get:
name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
2015-09-01 17:14:53.009655 7fccb17ca700 20 get_obj_state: s->obj_tag was
set empty
2015-09-01 17:14:53.009657 7fccb17ca700 10 cache get:
name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
2015-09-01 17:14:53.009681 7fccb17ca700  1 -- 10.11.4.105:0/1109243 -->
10.11.4.105:6801/39085 -- osd_op(client.55506.0:435878
.dir.be-east.5436.1 [call rgw.bucket_list] 26.7d78fc84
ack+read+known_if_redirected e255) v5 -- ?+0 0x7fcde01a0540 con 0x3a2d870
2015-09-01 17:14:53.010139 7fccb17ca700 10 cls_bucket_list
aws-cmis-prod(@{i=.be-east.rgw.buckets.index}.be-east.rgw.buckets[be-east.5436.1])
start
abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5[]
num_entries 1
2015-09-01 17:14:53.010149 7fccb17ca700 20 reading from
.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
2015-09-01 17:14:53.010156 7fccb17ca700 20 get_obj_state:
rctx=0x7fccb17c84d0
obj=.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
state=0x7fcde01a4060 s->prefetch_data=0
2015-09-01 17:14:53.010161 7fccb17ca700 10 cache get:
name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
2015-09-01 17:14:53.010166 7fccb17ca700 20 get_obj_state: s->obj_tag was
set empty
2015-09-01 17:14:53.010168 7fccb17ca700 10 cache get:
name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
2015-09-01 17:14:53.010192 7fccb17ca700  1 -- 10.11.4.105:0/1109243 -->
10.11.4.105:6801/39085 -- osd_op(client.55506.0:435880
.dir.be-east.5436.1 [call rgw.bucket_list] 26.7d78fc84
ack+read+known_if_redirected e255) v5 -- ?+0 0x7fcde01a0540 con 0x3a2d870

On 01-09-15 17:11, Yehuda Sadeh-Weinraub wrote:
> Can you bump up debug (debug rgw = 20, debug ms = 1), and see if the
> operations (bucket listing and bucket check) go into some kind of
> infinite loop?
>
> Yehuda
>
> On Tue, Sep 1, 2015 at 1:16 AM, Sam Wouters  wrote:
>> Hi, I've started the bucket --check --fix on friday evening and it's
>> still running. 'ceph -s' shows the cluster health as OK, I don't know if
>> there is anything el

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Thanks a lot for the quick response Robert, any idea when it's going to be
ready for production? any alternative solution for similar-performance?

Best regards,


*German *

2015-09-01 13:42 GMT-03:00 Robert LeBlanc :

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Accelio and Ceph are still in heavy development and not ready for production.
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
> Hi cephers,
>
>  I would like to know the status for production-ready of Accelio & Ceph, does 
> anyone had a home-made procedure implemented with Ubuntu?
>
> recommendations, comments?
>
> Thanks in advance,
>
> Best regards,
>
> German
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J
> FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX
> xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92
> OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/
> VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02
> m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV
> YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA
> Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF
> XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD
> /7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z
> SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt
> 8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh
> TiL2
> =oSrX
> -END PGP SIGNATURE-
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting rgw bucket list

2015-09-01 Thread Sam Wouters
not sure where I can find the logs for the bucket check, I can't really
filter them out in the radosgw log.

-Sam

On 01-09-15 19:25, Sam Wouters wrote:
> It looks like it, this is what shows in the logs after bumping the debug
> and requesting a bucket list.
>
> 2015-09-01 17:14:53.008620 7fccb17ca700 10 cls_bucket_list
> aws-cmis-prod(@{i=.be-east.rgw.buckets.index}.be-east.rgw.buckets[be-east.5436.1])
> start
> abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5[]
> num_entries 1
> 2015-09-01 17:14:53.008629 7fccb17ca700 20 reading from
> .be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
> 2015-09-01 17:14:53.008636 7fccb17ca700 20 get_obj_state:
> rctx=0x7fccb17c84d0
> obj=.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
> state=0x7fcde01a4060 s->prefetch_data=0
> 2015-09-01 17:14:53.008640 7fccb17ca700 10 cache get:
> name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
> 2015-09-01 17:14:53.008645 7fccb17ca700 20 get_obj_state: s->obj_tag was
> set empty
> 2015-09-01 17:14:53.008647 7fccb17ca700 10 cache get:
> name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
> 2015-09-01 17:14:53.008675 7fccb17ca700  1 -- 10.11.4.105:0/1109243 -->
> 10.11.4.105:6801/39085 -- osd_op(client.55506.0:435874
> ...
> .dir.be-east.5436.1 [call rgw.bucket_list] 26.7d78fc84
> ack+read+known_if_redirected e255) v5 -- ?+0 0x7fcde01a0540 con 0x3a2d870
>
> On 01-09-15 17:11, Yehuda Sadeh-Weinraub wrote:
>> Can you bump up debug (debug rgw = 20, debug ms = 1), and see if the
>> operations (bucket listing and bucket check) go into some kind of
>> infinite loop?
>>
>> Yehuda
>>
>> On Tue, Sep 1, 2015 at 1:16 AM, Sam Wouters  wrote:
>>> Hi, I've started the bucket --check --fix on friday evening and it's
>>> still running. 'ceph -s' shows the cluster health as OK, I don't know if
>>> there is anything else I could check? Is there a way of finding out if
>>> its actually doing something?
>>>
>>> We only have this issue on the one bucket with versioning enabled, I
>>> can't get rid of the feeling it has something todo with that. The
>>> "underscore bug" is also still present on that bucket
>>> (http://tracker.ceph.com/issues/12819). Not sure if thats related in any
>>> way.
>>> Are there any alternatives, as for example copy all the objects into a
>>> new bucket without versioning? Simple way would be to list the objects
>>> and copy them to a new bucket, but bucket listing is not working so...
>>>
>>> -Sam
>>>
>>>
>>> On 31-08-15 10:47, Gregory Farnum wrote:
 This generally shouldn't be a problem at your bucket sizes. Have you
 checked that the cluster is actually in a healthy state? The sleeping
 locks are normal but should be getting woken up; if they aren't it
 means the object access isn't working for some reason. A down PG or
 something would be the simplest explanation.
 -Greg

 On Fri, Aug 28, 2015 at 6:52 PM, Sam Wouters  wrote:
> Ok, maybe I'm to impatient. It would be great if there were some verbose
> or progress logging of the radosgw-admin tool.
> I will start a check and let it run over the weekend.
>
> tnx,
> Sam
>
> On 28-08-15 18:16, Sam Wouters wrote:
>> Hi,
>>
>> this bucket only has 13389 objects, so the index size shouldn't be a
>> problem. Also, on the same cluster we have an other bucket with 1200543
>> objects (but no versioning configured), which has no issues.
>>
>> when we run a radosgw-admin bucket --check (--fix), nothing seems to be
>> happening. Putting an strace on the process shows a lot of lines like 
>> these:
>> [pid 99372] futex(0x2d730d4, FUTEX_WAIT_PRIVATE, 156619, NULL
>> 
>> [pid 99385] futex(0x2da9410, FUTEX_WAIT_PRIVATE, 2, NULL 
>> [pid 99371] futex(0x2da9410, FUTEX_WAKE_PRIVATE, 1 
>> [pid 99385] <... futex resumed> )   = -1 EAGAIN (Resource
>> temporarily unavailable)
>> [pid 99371] <... futex resumed> )   = 0
>>
>> but no errors in the ceph logs or health warnings.
>>
>> r,
>> Sam
>>
>> On 28-08-15 17:49, Ben Hines wrote:
>>> How many objects in the bucket?
>>>
>>> RGW has problems with index size once number of objects gets into the
>>> 90+ level. The buckets need to be recreated with 'sharded bucket
>>> indexes' on:
>>>
>>> rgw override bucket index max shards = 23
>>>
>>> You could also try repairing the index with:
>>>
>>>  radosgw-admin bucket check --fix --bucket=
>>>
>>> -Ben
>>>
>>> On Fri, Aug 28, 2015 at 8:38 AM, Sam Wouters  wrote:
 Hi,

 we have a rgw bucket (with versioning) where PUT and GET operations for
 specific objects succeed,  but retrieving an object list fails.
 Using python-boto, after a timeout just gives us an 500 internal error;
 radosgw-admin just hangs.
 Also a radosgw-admin bucket check just seems to hang...
>

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread Somnath Roy
Hi German,
We are working on to make it production ready ASAP. As you know RDMA is very 
resource constrained and at the same time will outperform TCP as well. There 
will be some definite tradeoff between cost Vs Performance.
We are lacking on ideas on how big the RDMA deployment could be and it will be 
really helpful if you can give some idea on how you are planning to deploy that 
(i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc. etc.).

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of German 
Anders
Sent: Tuesday, September 01, 2015 10:39 AM
To: Robert LeBlanc
Cc: ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks a lot for the quick response Robert, any idea when it's going to be 
ready for production? any alternative solution for similar-performance?
Best regards,

German

2015-09-01 13:42 GMT-03:00 Robert LeBlanc 
mailto:rob...@leblancnet.us>>:

-BEGIN PGP SIGNED MESSAGE-

Hash: SHA256



Accelio and Ceph are still in heavy development and not ready for production.



- 

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:

Hi cephers,



 I would like to know the status for production-ready of Accelio & Ceph, does 
anyone had a home-made procedure implemented with Ubuntu?



recommendations, comments?



Thanks in advance,



Best regards,



German



___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-BEGIN PGP SIGNATURE-

Version: Mailvelope v1.0.2

Comment: https://www.mailvelope.com



wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J

FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX

xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92

OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/

VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02

m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV

YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA

Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF

XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD

/7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z

SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt

8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh

TiL2

=oSrX

-END PGP SIGNATURE-




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting rgw bucket list

2015-09-01 Thread Yehuda Sadeh-Weinraub
I assume you filtered the log by thread? I don't see the response
messages. For the bucket check you can run radosgw-admin with
--log-to-stderr.

Can you also set 'debug objclass = 20' on the osds? You can do it by:

$ ceph tell osd.\* injectargs --debug-objclass 20

Also, it'd be interesting to get the following:

$ radosgw-admin bi list --bucket=
--object=abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5


Thanks,
Yehuda

On Tue, Sep 1, 2015 at 10:44 AM, Sam Wouters  wrote:
> not sure where I can find the logs for the bucket check, I can't really
> filter them out in the radosgw log.
>
> -Sam
>
> On 01-09-15 19:25, Sam Wouters wrote:
>> It looks like it, this is what shows in the logs after bumping the debug
>> and requesting a bucket list.
>>
>> 2015-09-01 17:14:53.008620 7fccb17ca700 10 cls_bucket_list
>> aws-cmis-prod(@{i=.be-east.rgw.buckets.index}.be-east.rgw.buckets[be-east.5436.1])
>> start
>> abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5[]
>> num_entries 1
>> 2015-09-01 17:14:53.008629 7fccb17ca700 20 reading from
>> .be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
>> 2015-09-01 17:14:53.008636 7fccb17ca700 20 get_obj_state:
>> rctx=0x7fccb17c84d0
>> obj=.be-east.rgw:.bucket.meta.aws-cmis-prod:be-east.5436.1
>> state=0x7fcde01a4060 s->prefetch_data=0
>> 2015-09-01 17:14:53.008640 7fccb17ca700 10 cache get:
>> name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
>> 2015-09-01 17:14:53.008645 7fccb17ca700 20 get_obj_state: s->obj_tag was
>> set empty
>> 2015-09-01 17:14:53.008647 7fccb17ca700 10 cache get:
>> name=.be-east.rgw+.bucket.meta.aws-cmis-prod:be-east.5436.1 : hit
>> 2015-09-01 17:14:53.008675 7fccb17ca700  1 -- 10.11.4.105:0/1109243 -->
>> 10.11.4.105:6801/39085 -- osd_op(client.55506.0:435874
>> ...
>> .dir.be-east.5436.1 [call rgw.bucket_list] 26.7d78fc84
>> ack+read+known_if_redirected e255) v5 -- ?+0 0x7fcde01a0540 con 0x3a2d870
>>
>> On 01-09-15 17:11, Yehuda Sadeh-Weinraub wrote:
>>> Can you bump up debug (debug rgw = 20, debug ms = 1), and see if the
>>> operations (bucket listing and bucket check) go into some kind of
>>> infinite loop?
>>>
>>> Yehuda
>>>
>>> On Tue, Sep 1, 2015 at 1:16 AM, Sam Wouters  wrote:
 Hi, I've started the bucket --check --fix on friday evening and it's
 still running. 'ceph -s' shows the cluster health as OK, I don't know if
 there is anything else I could check? Is there a way of finding out if
 its actually doing something?

 We only have this issue on the one bucket with versioning enabled, I
 can't get rid of the feeling it has something todo with that. The
 "underscore bug" is also still present on that bucket
 (http://tracker.ceph.com/issues/12819). Not sure if thats related in any
 way.
 Are there any alternatives, as for example copy all the objects into a
 new bucket without versioning? Simple way would be to list the objects
 and copy them to a new bucket, but bucket listing is not working so...

 -Sam


 On 31-08-15 10:47, Gregory Farnum wrote:
> This generally shouldn't be a problem at your bucket sizes. Have you
> checked that the cluster is actually in a healthy state? The sleeping
> locks are normal but should be getting woken up; if they aren't it
> means the object access isn't working for some reason. A down PG or
> something would be the simplest explanation.
> -Greg
>
> On Fri, Aug 28, 2015 at 6:52 PM, Sam Wouters  wrote:
>> Ok, maybe I'm to impatient. It would be great if there were some verbose
>> or progress logging of the radosgw-admin tool.
>> I will start a check and let it run over the weekend.
>>
>> tnx,
>> Sam
>>
>> On 28-08-15 18:16, Sam Wouters wrote:
>>> Hi,
>>>
>>> this bucket only has 13389 objects, so the index size shouldn't be a
>>> problem. Also, on the same cluster we have an other bucket with 1200543
>>> objects (but no versioning configured), which has no issues.
>>>
>>> when we run a radosgw-admin bucket --check (--fix), nothing seems to be
>>> happening. Putting an strace on the process shows a lot of lines like 
>>> these:
>>> [pid 99372] futex(0x2d730d4, FUTEX_WAIT_PRIVATE, 156619, NULL
>>> 
>>> [pid 99385] futex(0x2da9410, FUTEX_WAIT_PRIVATE, 2, NULL >> ...>
>>> [pid 99371] futex(0x2da9410, FUTEX_WAKE_PRIVATE, 1 
>>> [pid 99385] <... futex resumed> )   = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> [pid 99371] <... futex resumed> )   = 0
>>>
>>> but no errors in the ceph logs or health warnings.
>>>
>>> r,
>>> Sam
>>>
>>> On 28-08-15 17:49, Ben Hines wrote:
 How many objects in the bucket?

 RGW has problems with index size once number of objects gets into the
 90+ level. The buckets need to be recreated with 'sharded bucket
 indexes' o

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Hi Roy,

   I understand, we are looking for using accelio with an starting small
cluster of 3 mon and 8 osd servers:

3x MON servers
   2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
   24x 16GB DIMM DDR3 1333Mhz (384GB)
   2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP

4x OSD servers
   2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
   8x 16GB DIMM DDR3 1333Mhz (128GB)
   2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
   3x 120GB Intel SSD DC SC3500 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP

4x OSD servers
   2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
   8x 16GB DIMM DDR3 1866Mhz (128GB)
   2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
   3x 200GB Intel SSD DC S3700 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP

and thinking of using *infernalis v.9.0.0* or *hammer* release? comments?
recommendations?


*German*

2015-09-01 14:46 GMT-03:00 Somnath Roy :

> Hi German,
>
> We are working on to make it production ready ASAP. As you know RDMA is
> very resource constrained and at the same time will outperform TCP as well.
> There will be some definite tradeoff between cost Vs Performance.
>
> We are lacking on ideas on how big the RDMA deployment could be and it
> will be really helpful if you can give some idea on how you are planning to
> deploy that (i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc.
> etc.).
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 10:39 AM
> *To:* Robert LeBlanc
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks a lot for the quick response Robert, any idea when it's going to be
> ready for production? any alternative solution for similar-performance?
>
> Best regards,
>
>
> *German *
>
>
>
> 2015-09-01 13:42 GMT-03:00 Robert LeBlanc :
>
> -BEGIN PGP SIGNED MESSAGE-
>
> Hash: SHA256
>
>
>
> Accelio and Ceph are still in heavy development and not ready for production.
>
>
>
> - 
>
> Robert LeBlanc
>
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
>
> On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
>
> Hi cephers,
>
>
>
>  I would like to know the status for production-ready of Accelio & Ceph, does 
> anyone had a home-made procedure implemented with Ubuntu?
>
>
>
> recommendations, comments?
>
>
>
> Thanks in advance,
>
>
>
> Best regards,
>
>
>
> German
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> -BEGIN PGP SIGNATURE-
>
> Version: Mailvelope v1.0.2
>
> Comment: https://www.mailvelope.com
>
>
>
> wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J
>
> FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX
>
> xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92
>
> OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/
>
> VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02
>
> m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV
>
> YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA
>
> Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF
>
> XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD
>
> /7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z
>
> SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt
>
> 8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh
>
> TiL2
>
> =oSrX
>
> -END PGP SIGNATURE-
>
>
>
> --
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting rgw bucket list

2015-09-01 Thread Sam Wouters
Hi,

see inline

On 01-09-15 20:14, Yehuda Sadeh-Weinraub wrote:
> I assume you filtered the log by thread? I don't see the response
> messages. For the bucket check you can run radosgw-admin with
> --log-to-stderr.
nothing is logged to the console when I do that
>
> Can you also set 'debug objclass = 20' on the osds? You can do it by:
>
> $ ceph tell osd.\* injectargs --debug-objclass 20
this continuously prints "20  cls/rgw/cls_rgw.cc:460: entry
abc_econtract/data/6smuz2ysavvxbygng34tgusyse[] is not valid" on osd.0
>
> Also, it'd be interesting to get the following:
>
> $ radosgw-admin bi list --bucket=
> --object=abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5
this gives me an empty array:
[
]
but we did a trim of the bilog a while ago cause a lot entries regarding
objects that were already removed from the bucket kept on syncing with
the sync agent, causing a lot of delete_markers at the replication site.

The object in the error above from the osd log, gives the following:
# radosgw-admin --log-to-stderr -n client.radosgw.be-east-1 bi list
--bucket=aws-cmis-prod
--object=abc_econtract/data/6smuz2ysavvxbygng34tgusyse
[
{
"type": "plain",
"idx": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
"entry": {
"name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
"instance": "",
"ver": {
"pool": -1,
"epoch": 0
},
"locator": "",
"exists": "false",
"meta": {
"category": 0,
"size": 0,
"mtime": "0.00",
"etag": "",
"owner": "",
"owner_display_name": "",
"content_type": "",
"accounted_size": 0
},
"tag": "",
"flags": 8,
"pending_map": [],
"versioned_epoch": 0
}
},
{
"type": "plain",
"idx":
"abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse\uv913\uiRQZUR76UdeymR-PGaw6sbCHMCOcaovu",
"entry": {
"name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
"instance": "RQZUR76UdeymR-PGaw6sbCHMCOcaovu",
"ver": {
"pool": 23,
"epoch": 9680
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 103410,
"mtime": "2015-08-07 17:57:32.00Z",
"etag": "6c67f5e6cb4aa63f4fa26a3b94d19d3a",
"owner": "aws-cmis-prod",
"owner_display_name": "AWS-CMIS prod user",
"content_type": "application\/pdf",
"accounted_size": 103410
},
"tag": "be-east.34319.4520377",
"flags": 3,
"pending_map": [],
"versioned_epoch": 2
}
},
{
"type": "instance",
"idx":
"�1000_abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse\uiRQZUR76UdeymR-PGaw6sbCHMCOcaovu",
"entry": {
"name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
"instance": "RQZUR76UdeymR-PGaw6sbCHMCOcaovu",
"ver": {
"pool": 23,
"epoch": 9680
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 103410,
"mtime": "2015-08-07 17:57:32.00Z",
"etag": "6c67f5e6cb4aa63f4fa26a3b94d19d3a",
"owner": "aws-cmis-prod",
"owner_display_name": "AWS-CMIS prod user",
"content_type": "application\/pdf",
"accounted_size": 103410
},
"tag": "be-east.34319.4520377",
"flags": 3,
"pending_map": [],
"versioned_epoch": 2
}
},
{
"type": "olh",
"idx": "�1001_abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
"entry": {
"key": {
"name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
"instance": "RQZUR76UdeymR-PGaw6sbCHMCOcaovu"
},
"delete_marker": "false",
"epoch": 2,
"pending_log": [],
"tag": "3ejreihlq1045d212goxvdlry31nbdde",
"exists": "true",
"pending_removal": "false"
}
}

]
>
>
> Thanks,
> Yehuda
much appreciating the care...
Sam
>
> On Tue, Sep 1, 2015 at 10:44 AM, Sam Wouters  wrote:
>> not sure where I can find the logs for the bucket check, I can't really
>> filter them out in the radosgw log.
>>
>> -Sam
>>
>> On 01-09-15 19:25, Sam Wouters wrote:
>>> It looks like it, this is what shows in the logs after bumping the debug
>>> and requesting a bucket list.
>>>
>>> 2015-09-01 17:14:53.008620 7fccb17ca700 10 cls_bucket_list
>>> aws-cmis-prod(@{i=.be-east

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread Somnath Roy
Thanks !
I think you should try installing from the ceph mainstream..There are some bug 
fixes went on after Hammer (not sure if it is backported)..
I would say try with 1 drive -> 1 OSD first since presently we have seen some 
stability issues (mainly due to resource constraint) with more OSDs in a box.
The another point is, installation itself is not straight forward. You need to 
build all the components probably, not sure if it is added as git submodule or 
not, Vu , could you please confirm ?

Since we are working to make this solution work at scale, could you please give 
us some idea what is the scale you are looking at for future deployment ?

Regards
Somnath

From: German Anders [mailto:gand...@despegar.com]
Sent: Tuesday, September 01, 2015 11:19 AM
To: Somnath Roy
Cc: Robert LeBlanc; ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Hi Roy,
   I understand, we are looking for using accelio with an starting small 
cluster of 3 mon and 8 osd servers:
3x MON servers
   2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
   24x 16GB DIMM DDR3 1333Mhz (384GB)
   2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
   8x 16GB DIMM DDR3 1333Mhz (128GB)
   2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
   3x 120GB Intel SSD DC SC3500 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
   8x 16GB DIMM DDR3 1866Mhz (128GB)
   2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
   3x 200GB Intel SSD DC S3700 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
and thinking of using infernalis v.9.0.0 or hammer release? comments? 
recommendations?

German

2015-09-01 14:46 GMT-03:00 Somnath Roy 
mailto:somnath@sandisk.com>>:
Hi German,
We are working on to make it production ready ASAP. As you know RDMA is very 
resource constrained and at the same time will outperform TCP as well. There 
will be some definite tradeoff between cost Vs Performance.
We are lacking on ideas on how big the RDMA deployment could be and it will be 
really helpful if you can give some idea on how you are planning to deploy that 
(i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc. etc.).

Thanks & Regards
Somnath

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of German Anders
Sent: Tuesday, September 01, 2015 10:39 AM
To: Robert LeBlanc
Cc: ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks a lot for the quick response Robert, any idea when it's going to be 
ready for production? any alternative solution for similar-performance?
Best regards,

German

2015-09-01 13:42 GMT-03:00 Robert LeBlanc 
mailto:rob...@leblancnet.us>>:

-BEGIN PGP SIGNED MESSAGE-

Hash: SHA256



Accelio and Ceph are still in heavy development and not ready for production.



- 

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:

Hi cephers,



 I would like to know the status for production-ready of Accelio & Ceph, does 
anyone had a home-made procedure implemented with Ubuntu?



recommendations, comments?



Thanks in advance,



Best regards,



German



___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-BEGIN PGP SIGNATURE-

Version: Mailvelope v1.0.2

Comment: https://www.mailvelope.com



wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J

FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX

xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92

OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/

VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02

m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV

YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA

Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF

XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD

/7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z

SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt

8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh

TiL2

=oSrX

-END PGP SIGNATURE-




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, pl

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Thanks Roy, we're planning to grow on this cluster if can get the
performance that we need, the idea is to run non-relational databases here,
so it would be high-io intensive. We are talking in grow terms of about
40-50 OSD servers with no more than 6 OSD daemons per server. If you got
some hints or docs out there on how to compile ceph with accelio it would
be awesome.


*German*

2015-09-01 15:31 GMT-03:00 Somnath Roy :

> Thanks !
>
> I think you should try installing from the ceph mainstream..There are some
> bug fixes went on after Hammer (not sure if it is backported)..
>
> I would say try with 1 drive -> 1 OSD first since presently we have seen
> some stability issues (mainly due to resource constraint) with more OSDs in
> a box.
>
> The another point is, installation itself is not straight forward. You
> need to build all the components probably, not sure if it is added as git
> submodule or not, Vu , could you please confirm ?
>
>
>
> Since we are working to make this solution work at scale, could you please
> give us some idea what is the scale you are looking at for future
> deployment ?
>
>
>
> Regards
>
> Somnath
>
>
>
> *From:* German Anders [mailto:gand...@despegar.com]
> *Sent:* Tuesday, September 01, 2015 11:19 AM
> *To:* Somnath Roy
> *Cc:* Robert LeBlanc; ceph-users
>
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Hi Roy,
>
>I understand, we are looking for using accelio with an starting small
> cluster of 3 mon and 8 osd servers:
>
> 3x MON servers
>
>2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
>
>24x 16GB DIMM DDR3 1333Mhz (384GB)
>
>2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
>
>8x 16GB DIMM DDR3 1333Mhz (128GB)
>
>2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
>
>3x 120GB Intel SSD DC SC3500 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
>
>8x 16GB DIMM DDR3 1866Mhz (128GB)
>
>2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
>
>3x 200GB Intel SSD DC S3700 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> and thinking of using *infernalis v.9.0.0* or *hammer* release? comments?
> recommendations?
>
>
> *German*
>
>
>
> 2015-09-01 14:46 GMT-03:00 Somnath Roy :
>
> Hi German,
>
> We are working on to make it production ready ASAP. As you know RDMA is
> very resource constrained and at the same time will outperform TCP as well.
> There will be some definite tradeoff between cost Vs Performance.
>
> We are lacking on ideas on how big the RDMA deployment could be and it
> will be really helpful if you can give some idea on how you are planning to
> deploy that (i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc.
> etc.).
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 10:39 AM
> *To:* Robert LeBlanc
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks a lot for the quick response Robert, any idea when it's going to be
> ready for production? any alternative solution for similar-performance?
>
> Best regards,
>
>
> *German *
>
>
>
> 2015-09-01 13:42 GMT-03:00 Robert LeBlanc :
>
> -BEGIN PGP SIGNED MESSAGE-
>
> Hash: SHA256
>
>
>
> Accelio and Ceph are still in heavy development and not ready for production.
>
>
>
> - 
>
> Robert LeBlanc
>
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
>
> On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
>
> Hi cephers,
>
>
>
>  I would like to know the status for production-ready of Accelio & Ceph, does 
> anyone had a home-made procedure implemented with Ubuntu?
>
>
>
> recommendations, comments?
>
>
>
> Thanks in advance,
>
>
>
> Best regards,
>
>
>
> German
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> -BEGIN PGP SIGNATURE-
>
> Version: Mailvelope v1.0.2
>
> Comment: https://www.mailvelope.com
>
>
>
> wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J
>
> FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX
>
> xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92
>
> OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/
>
> VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02
>
> m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV
>
> YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA
>
> Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF
>
> XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD
>
> /7S/S05yQ2TcfnkVPGapD

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread Somnath Roy
Thanks !
6 OSD daemons per server should be good.

Vu,
Could you please send out the doc you are maintaining ?

Regards
Somnath

From: German Anders [mailto:gand...@despegar.com]
Sent: Tuesday, September 01, 2015 11:36 AM
To: Somnath Roy
Cc: Robert LeBlanc; ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks Roy, we're planning to grow on this cluster if can get the performance 
that we need, the idea is to run non-relational databases here, so it would be 
high-io intensive. We are talking in grow terms of about 40-50 OSD servers with 
no more than 6 OSD daemons per server. If you got some hints or docs out there 
on how to compile ceph with accelio it would be awesome.

German

2015-09-01 15:31 GMT-03:00 Somnath Roy 
mailto:somnath@sandisk.com>>:
Thanks !
I think you should try installing from the ceph mainstream..There are some bug 
fixes went on after Hammer (not sure if it is backported)..
I would say try with 1 drive -> 1 OSD first since presently we have seen some 
stability issues (mainly due to resource constraint) with more OSDs in a box.
The another point is, installation itself is not straight forward. You need to 
build all the components probably, not sure if it is added as git submodule or 
not, Vu , could you please confirm ?

Since we are working to make this solution work at scale, could you please give 
us some idea what is the scale you are looking at for future deployment ?

Regards
Somnath

From: German Anders [mailto:gand...@despegar.com]
Sent: Tuesday, September 01, 2015 11:19 AM
To: Somnath Roy
Cc: Robert LeBlanc; ceph-users

Subject: Re: [ceph-users] Accelio & Ceph

Hi Roy,
   I understand, we are looking for using accelio with an starting small 
cluster of 3 mon and 8 osd servers:
3x MON servers
   2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
   24x 16GB DIMM DDR3 1333Mhz (384GB)
   2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
   8x 16GB DIMM DDR3 1333Mhz (128GB)
   2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
   3x 120GB Intel SSD DC SC3500 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
   8x 16GB DIMM DDR3 1866Mhz (128GB)
   2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
   3x 200GB Intel SSD DC S3700 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
and thinking of using infernalis v.9.0.0 or hammer release? comments? 
recommendations?

German

2015-09-01 14:46 GMT-03:00 Somnath Roy 
mailto:somnath@sandisk.com>>:
Hi German,
We are working on to make it production ready ASAP. As you know RDMA is very 
resource constrained and at the same time will outperform TCP as well. There 
will be some definite tradeoff between cost Vs Performance.
We are lacking on ideas on how big the RDMA deployment could be and it will be 
really helpful if you can give some idea on how you are planning to deploy that 
(i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc. etc.).

Thanks & Regards
Somnath

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of German Anders
Sent: Tuesday, September 01, 2015 10:39 AM
To: Robert LeBlanc
Cc: ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks a lot for the quick response Robert, any idea when it's going to be 
ready for production? any alternative solution for similar-performance?
Best regards,

German

2015-09-01 13:42 GMT-03:00 Robert LeBlanc 
mailto:rob...@leblancnet.us>>:

-BEGIN PGP SIGNED MESSAGE-

Hash: SHA256



Accelio and Ceph are still in heavy development and not ready for production.



- 

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:

Hi cephers,



 I would like to know the status for production-ready of Accelio & Ceph, does 
anyone had a home-made procedure implemented with Ubuntu?



recommendations, comments?



Thanks in advance,



Best regards,



German



___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-BEGIN PGP SIGNATURE-

Version: Mailvelope v1.0.2

Comment: https://www.mailvelope.com



wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J

FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX

xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92

OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/

VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02

m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV

YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA

Fa6Lj9x7gQjszLI1es

Re: [ceph-users] Troubleshooting rgw bucket list

2015-09-01 Thread Sam Wouters
Sorry, forgot to mention:

- yes, filtered by thread
- the "is not valid" line occurred when performing the bucket --check
- when doing a bucket listing, I also get an "is not valid", but on a
different object:
7fe4f1d5b700 20  cls/rgw/cls_rgw.cc:460: entry
abc_econtract/data/6scbrrlo4vttk72melewizj6n3[] is not valid

bilog entry for this object similar to the one below

r, Sam

On 01-09-15 20:30, Sam Wouters wrote:
> Hi,
>
> see inline
>
> On 01-09-15 20:14, Yehuda Sadeh-Weinraub wrote:
>> I assume you filtered the log by thread? I don't see the response
>> messages. For the bucket check you can run radosgw-admin with
>> --log-to-stderr.
> nothing is logged to the console when I do that
>> Can you also set 'debug objclass = 20' on the osds? You can do it by:
>>
>> $ ceph tell osd.\* injectargs --debug-objclass 20
> this continuously prints "20  cls/rgw/cls_rgw.cc:460: entry
> abc_econtract/data/6smuz2ysavvxbygng34tgusyse[] is not valid" on osd.0
>> Also, it'd be interesting to get the following:
>>
>> $ radosgw-admin bi list --bucket=
>> --object=abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5
> this gives me an empty array:
> [
> ]
> but we did a trim of the bilog a while ago cause a lot entries regarding
> objects that were already removed from the bucket kept on syncing with
> the sync agent, causing a lot of delete_markers at the replication site.
>
> The object in the error above from the osd log, gives the following:
> # radosgw-admin --log-to-stderr -n client.radosgw.be-east-1 bi list
> --bucket=aws-cmis-prod
> --object=abc_econtract/data/6smuz2ysavvxbygng34tgusyse
> [
> {
> "type": "plain",
> "idx": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
> "entry": {
> "name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
> "instance": "",
> "ver": {
> "pool": -1,
> "epoch": 0
> },
> "locator": "",
> "exists": "false",
> "meta": {
> "category": 0,
> "size": 0,
> "mtime": "0.00",
> "etag": "",
> "owner": "",
> "owner_display_name": "",
> "content_type": "",
> "accounted_size": 0
> },
> "tag": "",
> "flags": 8,
> "pending_map": [],
> "versioned_epoch": 0
> }
> },
> {
> "type": "plain",
> "idx":
> "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse\uv913\uiRQZUR76UdeymR-PGaw6sbCHMCOcaovu",
> "entry": {
> "name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
> "instance": "RQZUR76UdeymR-PGaw6sbCHMCOcaovu",
> "ver": {
> "pool": 23,
> "epoch": 9680
> },
> "locator": "",
> "exists": "true",
> "meta": {
> "category": 1,
> "size": 103410,
> "mtime": "2015-08-07 17:57:32.00Z",
> "etag": "6c67f5e6cb4aa63f4fa26a3b94d19d3a",
> "owner": "aws-cmis-prod",
> "owner_display_name": "AWS-CMIS prod user",
> "content_type": "application\/pdf",
> "accounted_size": 103410
> },
> "tag": "be-east.34319.4520377",
> "flags": 3,
> "pending_map": [],
> "versioned_epoch": 2
> }
> },
> {
> "type": "instance",
> "idx":
> "�1000_abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse\uiRQZUR76UdeymR-PGaw6sbCHMCOcaovu",
> "entry": {
> "name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
> "instance": "RQZUR76UdeymR-PGaw6sbCHMCOcaovu",
> "ver": {
> "pool": 23,
> "epoch": 9680
> },
> "locator": "",
> "exists": "true",
> "meta": {
> "category": 1,
> "size": 103410,
> "mtime": "2015-08-07 17:57:32.00Z",
> "etag": "6c67f5e6cb4aa63f4fa26a3b94d19d3a",
> "owner": "aws-cmis-prod",
> "owner_display_name": "AWS-CMIS prod user",
> "content_type": "application\/pdf",
> "accounted_size": 103410
> },
> "tag": "be-east.34319.4520377",
> "flags": 3,
> "pending_map": [],
> "versioned_epoch": 2
> }
> },
> {
> "type": "olh",
> "idx": "�1001_abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
> "entry": {
> "key": {
> "name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
> "instance": "RQZUR76UdeymR-PGaw6sbCHMCOcaovu"
> },
> "delete_marker": "false",
> "epoch": 2,
> 

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Thanks a lot guys, I'll configure the cluster and send you some feedback
once we test it

Best regards,

*German*

2015-09-01 15:38 GMT-03:00 Somnath Roy :

> Thanks !
>
> 6 OSD daemons per server should be good.
>
>
>
> Vu,
>
> Could you please send out the doc you are maintaining ?
>
>
>
> Regards
>
> Somnath
>
>
>
> *From:* German Anders [mailto:gand...@despegar.com]
> *Sent:* Tuesday, September 01, 2015 11:36 AM
>
> *To:* Somnath Roy
> *Cc:* Robert LeBlanc; ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks Roy, we're planning to grow on this cluster if can get the
> performance that we need, the idea is to run non-relational databases here,
> so it would be high-io intensive. We are talking in grow terms of about
> 40-50 OSD servers with no more than 6 OSD daemons per server. If you got
> some hints or docs out there on how to compile ceph with accelio it would
> be awesome.
>
>
> *German*
>
>
>
> 2015-09-01 15:31 GMT-03:00 Somnath Roy :
>
> Thanks !
>
> I think you should try installing from the ceph mainstream..There are some
> bug fixes went on after Hammer (not sure if it is backported)..
>
> I would say try with 1 drive -> 1 OSD first since presently we have seen
> some stability issues (mainly due to resource constraint) with more OSDs in
> a box.
>
> The another point is, installation itself is not straight forward. You
> need to build all the components probably, not sure if it is added as git
> submodule or not, Vu , could you please confirm ?
>
>
>
> Since we are working to make this solution work at scale, could you please
> give us some idea what is the scale you are looking at for future
> deployment ?
>
>
>
> Regards
>
> Somnath
>
>
>
> *From:* German Anders [mailto:gand...@despegar.com]
> *Sent:* Tuesday, September 01, 2015 11:19 AM
> *To:* Somnath Roy
> *Cc:* Robert LeBlanc; ceph-users
>
>
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Hi Roy,
>
>I understand, we are looking for using accelio with an starting small
> cluster of 3 mon and 8 osd servers:
>
> 3x MON servers
>
>2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
>
>24x 16GB DIMM DDR3 1333Mhz (384GB)
>
>2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
>
>8x 16GB DIMM DDR3 1333Mhz (128GB)
>
>2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
>
>3x 120GB Intel SSD DC SC3500 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
>
>8x 16GB DIMM DDR3 1866Mhz (128GB)
>
>2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
>
>3x 200GB Intel SSD DC S3700 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> and thinking of using *infernalis v.9.0.0* or *hammer* release? comments?
> recommendations?
>
>
> *German*
>
>
>
> 2015-09-01 14:46 GMT-03:00 Somnath Roy :
>
> Hi German,
>
> We are working on to make it production ready ASAP. As you know RDMA is
> very resource constrained and at the same time will outperform TCP as well.
> There will be some definite tradeoff between cost Vs Performance.
>
> We are lacking on ideas on how big the RDMA deployment could be and it
> will be really helpful if you can give some idea on how you are planning to
> deploy that (i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc.
> etc.).
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 10:39 AM
> *To:* Robert LeBlanc
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks a lot for the quick response Robert, any idea when it's going to be
> ready for production? any alternative solution for similar-performance?
>
> Best regards,
>
>
> *German *
>
>
>
> 2015-09-01 13:42 GMT-03:00 Robert LeBlanc :
>
> -BEGIN PGP SIGNED MESSAGE-
>
> Hash: SHA256
>
>
>
> Accelio and Ceph are still in heavy development and not ready for production.
>
>
>
> - 
>
> Robert LeBlanc
>
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
>
> On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
>
> Hi cephers,
>
>
>
>  I would like to know the status for production-ready of Accelio & Ceph, does 
> anyone had a home-made procedure implemented with Ubuntu?
>
>
>
> recommendations, comments?
>
>
>
> Thanks in advance,
>
>
>
> Best regards,
>
>
>
> German
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> -BEGIN PGP SIGNATURE-
>
> Version: Mailvelope v1.0.2
>
> Comment: https://www.mailvelope.com
>
>
>
> wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQ

[ceph-users] cephfs read-only setting doesn't work?

2015-09-01 Thread Erming Pei

Hi,

  I tried to set up a read-only permission for a client but it looks 
always writable.


  I did the following:

==Server end==

[client.cephfs_data_ro]
key = AQxx==
caps mon = "allow r"
caps osd = "allow r pool=cephfs_data, allow r pool=cephfs_metadata"


==Client end==
mount -v -t ceph hostname.domainname:6789:/ /cephfs -o 
name=cephfs_data_ro,secret=AQxx==


But I still can touch, delete, overwrite.

I read that touch/delete could be only meta data operations, but why I 
still can overwrite?


Is there anyway I could test/check the data pool (instead of meta data) 
to see if any effect on it?



Erming




--
-
 Erming Pei, Ph.D
 Senior System Analyst; Grid/Cloud Specialist

 Research Computing Group
 Information Services & Technology
 University of Alberta, Canada

 Tel: +1 7804929914Fax: +1 7804921729
-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph SSD CPU Frequency Benchmarks

2015-09-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Nick,

I've been trying to replicate your results without success. Can you
help me understand what I'm doing that is not the same as your test?

My setup is two boxes, one is a client and the other is a server. The
server has Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz, 32 GB RAM and 2
Intel S3500 240 GB SSD drives. The boxes have Infiniband FDR cards
connected to a QDR switch using IPoIB. I set up OSDs on the 2 SSDs and
set pool size=1. I mapped a 200GB RBD using the kernel module ran fio
on the RBD. I adjusted the number of cores, clock speed and C-states
of the server and here are my results:

Adjusted core number and set the processor to a set frequency using
the userspace governor.

8 jobs 8 depth   Cores
  12 3 4 5 6 7 8
Frequency  2.4  387  762  1121  1432  1657  1900  2092  2260
GHz2386  758  1126  1428  1657  1890  2090  2232
   1.6  382  756  1127  1428  1656  1894  2083  2201
   1.2  385  756  1125  1431  1656  1885  2093  2244

I then adjusted the processor to not go in a deeper sleep state than
C1 and also tested setting the highest CPU frequency with the ondemand
governor.

1 job 1 depth
Cores  1
  <=C1, feq range  C0-C6, freq range  C0-C6, static
freq<=C1, static freq
Frequency 2.4  381 381379 381
GHz   2382 380381 381
  1.6  380 381379 382
  1.2  383 378379 383
Cores  8
  <=C1, feq range  C0-C6, freq range  C0-C6, static
freq<=C1, static freq
Frequency 2.4  629 580584 629
GHz   2630 579584 634
  1.6  630 579584 634
  1.2  632 581582 634

Here I'm see a correlation between # cores and C-states, but not frequency.

Frequency was controlled with:
cpupower frequency-set -d 1.2GHz -u 1.2GHz -g userspace
and
cpupower frequency-set -d 1.2GHz -u 2.0GHz -g ondemand

Core count adjusted by:
for i in {1..7}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done

C-states controlled by:
# python
Python 2.7.5 (default, Jun 24 2015, 00:41:19)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> fd = open('/dev/cpu_dma_latency','wb')
>>> fd.write('1')
>>> fd.flush()
>>> fd.close() # Don't run this until the tests are completed (the handle has 
>>> to stay open).
>>>

I'd like to replicate your results. I'd also like if you can verify
some of mine in your set-up around C-States and cores.

Thanks,

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV5g8GCRDmVDuy+mK58QAAe6YP/j+SNGFI2z7ndnbOk87D
UjxG+hiZT5bkdt2/wVfI6QiH0UGDA3rLBsttOHPgfxP6/CEy801q8/fO0QOk
tLxIgX01K4ECls2uhiFAM3bhKalFsKDM6rHYFx96tIGWonQeou36ouDG8pfz
YsprvQ2XZEX1+G4dfZZ4lc3A3mfIY6Wsn7DC0tup9eRp3cl9hQLXEu4Zg8CZ
7867FNaud4S4f6hYV0KUC0fv+hZvyruMCt/jgl8gVr8bAdNgiW5u862gsk5b
sO9mb7H679G8t47m3xd89jTh9siMshbcakF9PXKzrN7DxBb/sBuN3GykesZA
+5jdUTzPCxFu+LocJ91by8FybatpLwxycmfP2gRxd/owclXk5BqqJUnrdYVm
n2GcHobdHVv9k/s+iBVV0xbwqOY+IO9UNUfLAKNy7E1xtpXdTpQBuokmu/4D
WXg3C4u+DsZNvcziO4s/edQ1koOQm1Fcj5VnbouSqmsHpB5nHeJbGmiKNTBA
9pE/hTph56YRqOE3bq3X/ohjtziL7/e/MVF3VUisDJieaLxV9weLxKIf0W9t
L7NMhX7iUIMps5ulA9qzd8qJK6yBa65BVXtk5M0A5oTA/VvxHQT6e5nSZS+Z
WLjavMnmSSJT1BQZ5GkVbVqo4UVjndcXEvkBm3+McaGKliO2xvxP+U3nCKpZ
js+h
=4WAa
-END PGP SIGNATURE-



Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Sat, Jun 13, 2015 at 8:58 AM, Nick Fisk  wrote:

> Hi All,
>
> I know there has been lots of discussions around needing fast CPU's to get
> the most out of SSD's. However I have never really ever seen an solid
> numbers to make a comparison about how much difference a faster CPU makes
> and if Ceph scales linearly with clockspeed. So I did a little experiment
> today.
>
> I setup a 1 OSD Ceph instance on a Desktop PC. The Desktop has a i5
> Sandbybridge CPU with the CPU turbo overclocked to 4.3ghz. By using the
> userspace governor in Linux, I was able to set static clock speeds to see
> the possible performance effects on Ceph. My pc only has an old X25M-G2
> SSD,
> so I had to limit the IO testing to 4kb QD=1, as otherwise the SSD ran out
> of puff when I got to the higher clock speeds.
>
> CPU Mhz 4Kb Write IOMin Latency (us)Avg Latency (us)CPU
> usr CPU sys
> 1600797 886 1250
> 10.14   2.35
> 2000815 746 1222
> 8.451.82
> 24001161630 857
> 9.5 1.6
> 2800  

Re: [ceph-users] Ceph SSD CPU Frequency Benchmarks

2015-09-01 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Robert LeBlanc
> Sent: 01 September 2015 21:48
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph SSD CPU Frequency Benchmarks
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Nick,
> 
> I've been trying to replicate your results without success. Can you help me
> understand what I'm doing that is not the same as your test?
> 
> My setup is two boxes, one is a client and the other is a server. The server
> has Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz, 32 GB RAM and 2 Intel S3500
> 240 GB SSD drives. The boxes have Infiniband FDR cards connected to a QDR
> switch using IPoIB. I set up OSDs on the 2 SSDs and set pool size=1. I mapped
> a 200GB RBD using the kernel module ran fio on the RBD. I adjusted the
> number of cores, clock speed and C-states of the server and here are my
> results:
> 
> Adjusted core number and set the processor to a set frequency using the
> userspace governor.
> 
> 8 jobs 8 depth   Cores
>   12 3 4 5 6 7 8
> Frequency  2.4  387  762  1121  1432  1657  1900  2092  2260
> GHz2386  758  1126  1428  1657  1890  2090  2232
>1.6  382  756  1127  1428  1656  1894  2083  2201
>1.2  385  756  1125  1431  1656  1885  2093  2244
> 

I tested at QD=1 as this tends to highlight the difference in clock speed, 
whereas a higher queue depth will probably scale with both frequency and cores. 
I'm not sure this is your problem, but to make sure your environment is doing 
what you want I would suggest QD=1 and 1 job to start with.

But thank you for sharing these results regardless of your current frequency 
scaling issues. Information like this is really useful for people trying to 
decide on hardware purchases. Those Atom boards look like they could support 
12x normal HDD's quite happily, assuming 80 IOPsx12.

I wonder if we can get enough data from various people to generate a IOPs/CPU 
Freq for various CPU architectures? 


> I then adjusted the processor to not go in a deeper sleep state than C1 and
> also tested setting the highest CPU frequency with the ondemand governor.
> 
> 1 job 1 depth
> Cores  1
>   <=C1, feq range  C0-C6, freq range  C0-C6, static freq  <=C1, 
> static
> freq
> Frequency 2.4  381 381379 381
> GHz   2382 380381 381
>   1.6  380 381379 382
>   1.2  383 378379 383
> Cores  8
>   <=C1, feq range  C0-C6, freq range  C0-C6, static freq  <=C1, 
> static
> freq
> Frequency 2.4  629 580584 629
> GHz   2630 579584 634
>   1.6  630 579584 634
>   1.2  632 581582 634
> 
> Here I'm see a correlation between # cores and C-states, but not frequency.
> 
> Frequency was controlled with:
> cpupower frequency-set -d 1.2GHz -u 1.2GHz -g userspace
> and
> cpupower frequency-set -d 1.2GHz -u 2.0GHz -g ondemand
> 
> Core count adjusted by:
> for i in {1..7}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
> 
> C-states controlled by:
> # python
> Python 2.7.5 (default, Jun 24 2015, 00:41:19)
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> fd = open('/dev/cpu_dma_latency','wb')
> >>> fd.write('1')
> >>> fd.flush()
> >>> fd.close() # Don't run this until the tests are completed (the handle has
> to stay open).
> >>>
> 
> I'd like to replicate your results. I'd also like if you can verify some of 
> mine in
> your set-up around C-States and cores.

I can't remember exactly, but I think I had to do something to get the 
userspace governor to behave as I expected it to. I tend to recall setting the 
frequency low and yet still seeing it bursting up to max. I will have a look 
through my notes tomorrow and see if I can recall anything. One thing I do 
remember though is that the Intel powertop utility was very useful in 
confirming what the actual CPU frequency was. It might be worth installing and 
running this and seeing what the CPU cores are doing.


> 
> Thanks,
> 
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJV5g8GCRDmVDuy+mK58QAAe6YP/j+SNGFI2z7ndnbOk87
> D
> UjxG+hiZT5bkdt2/wVfI6QiH0UGDA3rLBsttOHPgfxP6/CEy801q8/fO0QOk
> tLxIgX01K4ECls2uhiFAM3bhKalFsKDM6rHYFx96tIGWonQeou36ouDG8pfz
> YsprvQ2XZEX1+G4dfZZ4lc3A3mfIY6Wsn7DC0tup9eRp3cl9hQLXEu4Zg8CZ
> 7867FNaud4S4f6hYV0KUC0fv+hZvyruMCt/jgl8gVr8bAdNgiW5u862gsk5b
> sO9mb7H679G8t47m3xd89jTh9siMshbcakF9PXKzrN7DxBb/sBuN3GykesZA
> +5jdUTzPCxFu+LocJ91by8Fybat

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread Vu Pham
Hi German,

You can try this small wiki to setup ceph/accelio

https://community.mellanox.com/docs/DOC-2141

thanks,
-vu


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of German 
Anders
Sent: Tuesday, September 01, 2015 12:00 PM
To: Somnath Roy
Cc: ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks a lot guys, I'll configure the cluster and send you some feedback once 
we test it
Best regards,

German

2015-09-01 15:38 GMT-03:00 Somnath Roy 
mailto:somnath@sandisk.com>>:
Thanks !
6 OSD daemons per server should be good.

Vu,
Could you please send out the doc you are maintaining ?

Regards
Somnath

From: German Anders [mailto:gand...@despegar.com]
Sent: Tuesday, September 01, 2015 11:36 AM

To: Somnath Roy
Cc: Robert LeBlanc; ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks Roy, we're planning to grow on this cluster if can get the performance 
that we need, the idea is to run non-relational databases here, so it would be 
high-io intensive. We are talking in grow terms of about 40-50 OSD servers with 
no more than 6 OSD daemons per server. If you got some hints or docs out there 
on how to compile ceph with accelio it would be awesome.

German

2015-09-01 15:31 GMT-03:00 Somnath Roy 
mailto:somnath@sandisk.com>>:
Thanks !
I think you should try installing from the ceph mainstream..There are some bug 
fixes went on after Hammer (not sure if it is backported)..
I would say try with 1 drive -> 1 OSD first since presently we have seen some 
stability issues (mainly due to resource constraint) with more OSDs in a box.
The another point is, installation itself is not straight forward. You need to 
build all the components probably, not sure if it is added as git submodule or 
not, Vu , could you please confirm ?

Since we are working to make this solution work at scale, could you please give 
us some idea what is the scale you are looking at for future deployment ?

Regards
Somnath

From: German Anders [mailto:gand...@despegar.com]
Sent: Tuesday, September 01, 2015 11:19 AM
To: Somnath Roy
Cc: Robert LeBlanc; ceph-users

Subject: Re: [ceph-users] Accelio & Ceph

Hi Roy,
   I understand, we are looking for using accelio with an starting small 
cluster of 3 mon and 8 osd servers:
3x MON servers
   2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
   24x 16GB DIMM DDR3 1333Mhz (384GB)
   2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
   8x 16GB DIMM DDR3 1333Mhz (128GB)
   2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
   3x 120GB Intel SSD DC SC3500 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
   8x 16GB DIMM DDR3 1866Mhz (128GB)
   2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
   3x 200GB Intel SSD DC S3700 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
and thinking of using infernalis v.9.0.0 or hammer release? comments? 
recommendations?

German

2015-09-01 14:46 GMT-03:00 Somnath Roy 
mailto:somnath@sandisk.com>>:
Hi German,
We are working on to make it production ready ASAP. As you know RDMA is very 
resource constrained and at the same time will outperform TCP as well. There 
will be some definite tradeoff between cost Vs Performance.
We are lacking on ideas on how big the RDMA deployment could be and it will be 
really helpful if you can give some idea on how you are planning to deploy that 
(i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc. etc.).

Thanks & Regards
Somnath

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of German Anders
Sent: Tuesday, September 01, 2015 10:39 AM
To: Robert LeBlanc
Cc: ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks a lot for the quick response Robert, any idea when it's going to be 
ready for production? any alternative solution for similar-performance?
Best regards,

German

2015-09-01 13:42 GMT-03:00 Robert LeBlanc 
mailto:rob...@leblancnet.us>>:

-BEGIN PGP SIGNED MESSAGE-

Hash: SHA256



Accelio and Ceph are still in heavy development and not ready for production.



- 

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:

Hi cephers,



 I would like to know the status for production-ready of Accelio & Ceph, does 
anyone had a home-made procedure implemented with Ubuntu?



recommendations, comments?



Thanks in advance,



Best regards,



German



___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-B

Re: [ceph-users] Moving/Sharding RGW Bucket Index

2015-09-01 Thread Ben Hines
We also run RGW buckets with many millions of objects and had to shard
our existing buckets. We did have to delete the old ones first,
unfortunately.

I haven't tried moving the index pool to an SSD ruleset - would also
be interested in folks' experiences with this.

Thanks for the information on split multiple + merge threshold. I
assume that increasing that is relatively safe to do on a running
cluster? According to this redhat issue, this may impact
scrub/recovery performance?
https://bugzilla.redhat.com/show_bug.cgi?id=1219974

-Ben

On Tue, Sep 1, 2015 at 9:31 AM, Wang, Warren
 wrote:
> I added sharding to our busiest RGW sites, but it will not shard existing 
> bucket indexes, only applies to new buckets. Even with that change, I'm still 
> considering moving the index pool to SSD. The main factor being the rate of 
> writes. We are looking at a project that will have extremely high writes/sec 
> through the RGWs.
>
> The other thing worth noting is that at that scale, you also need to change 
> filestore merge threshold and filestore split multiple to something 
> considerably larger. Props to Michael Kidd @ RH for that tip. There's a 
> mathematical formula on the filestore config reference.
>
> Warren
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Daniel Maraio
> Sent: Tuesday, September 01, 2015 10:40 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Moving/Sharding RGW Bucket Index
>
> Hello,
>
>I have two large buckets in my RGW and I think the performance is being 
> impacted by the bucket index. One bucket contains 9 million objects and the 
> other one has 22 million. I'd like to shard the bucket index and also change 
> the ruleset of the .rgw.buckets.index pool to put it on our SSD root. I could 
> not find any documentation on this issue. It looks like the bucket indexes 
> can be rebuilt using the radosgw-admin bucket check command but I'm not sure 
> how to proceed. We can stop writes or take the cluster down completely if 
> necessary. My initial thought was to backup the existing index pool and 
> create a new one. I'm not sure if I can change the index_pool of an existing 
> bucket. If that is possible I assume I can change that to my new pool and 
> execute a radosgw-admin bucket check command to rebuild/shard the index.
>
>Does anyone have experience in getting sharding running with an existing 
> bucket, or even moving the index pool to a different ruleset?
> When I change the crush ruleset for the .rgw.buckets.index pool to my SSD 
> root we run into issues, buckets cannot be created or listed, writes cease to 
> work, reads seem to work fine though. Thanks for your time!
>
> - Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librados application consultant needed

2015-09-01 Thread John Onusko
We have an application built on top of librados that has barely acceptable 
performance and in need of optimizations. Since the code is functionally 
correct, we have a hard time freeing up the resources to fully investigate 
where the bottlenecks occur and fix them. We would like to hire a consultant 
who could look at the application design and how it was implemented using 
librados. The consultant should have a good understanding of how Ceph internals 
work and how the various librados API calls translate into IOPS. The consultant 
could also be hired to implement the recommended fixes.

Would anyone on this mailing list be interested in handling a job like this or 
would know of someone that would be interested?

We are located in Redwood City, CA. It may be possible to work remotely. We 
would like to start working on the analysis and application optimizations ASAP.

Thanks.


John Onusko
Director of Research and Architecture
Actiance, Inc.

530.903.0309 (mobile)

[Actiance_Signature_Logo(2)]

Follow us:
www.facebook.com/actiance
www.linkedin.com/company/actiance-inc
www.twitter.com/actiance

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados: Undefined symbol error

2015-09-01 Thread Brad Hubbard
- Original Message -
> From: "Aakanksha Pudipeddi-SSI" 
> To: "Brad Hubbard" 
> Sent: Wednesday, 2 September, 2015 6:25:49 AM
> Subject: RE: [ceph-users] Rados: Undefined symbol error
> 
> Hello Brad,
> 
> I wanted to clarify the "make install" part of building a cluster. I finished
> building the source (have not done "make install" yet) and now when I type
> in "rados", I get this:
> 
> $rados
> 2015-09-01 13:12:25.061939 7f5370f35840 -1 did not load config file, using
> default settings.
> rados: you must give an action. Try --help
> 
> When I built ceph from source a couple of months ago(giant), I found that
> sudo make install does not deploy ceph binaries onto the system and hence,
> went through the process of building packages via dpkg and then deploying
> the cluster with ceph-deploy. I am not sure as to what make install does
> here. Could you elaborate on that?
> 
> I actually tried "make install" yesterday and when I typed "rados", I got
> something like this:
> 
> /usr/local/bin/rados: librados.so.2: cannot open shared object file
> 
> But I had to clone the source again because of some other issues and I am
> currently at the stage I mentioned in the beginning. Now I am not sure if I
> should "make install" or go through the process of building ceph packages
> from source and deploying the cluster with ceph-deploy. Any pointers on this
> would be very helpful! Thanks a lot again for your continued help :)

Note that the idea here was not to go into production with this but merely as a
test, that's why I suggested standing up a new vm to do it.

So let's try some things in the build directory then.

After the build the rados binary should end up in ./src/.libs/rados if deb
systems are the same as Fedora in that regard. If not you will need to find the
rados binary that gets built when you run "make". Once you have that run the
following on it.

$ strings ./src/.libs/rados|grep "^ceph version" -A5 ceph version
$ eu-unstrip -n -e  ./src/.libs/rados
$ nm --dynamic ./src/.libs/rados|grep Mutex
$ ./src/.libs/rados -v

The last command may not work unless you have the correct libraries in place on
the target system but please include all output.

Then you can do your normal packaging and install and run the same commands
substituting "$(which rados)" for ./src/.libs/rados.

It is very important you include all output and, if any of the tools are
missing, you may need to install the equivalent of the elfutils package (for
eu-unsrip, although I guess in pinch you could just use "strip" from the
binutils package. I just prefer the elfutils versions).

> 
> Aakanksha
> 
> 
> 
> -Original Message-
> From: Brad Hubbard [mailto:bhubb...@redhat.com]
> Sent: Monday, August 31, 2015 3:47 PM
> To: Aakanksha Pudipeddi-SSI
> Cc: ceph-users
> Subject: Re: [ceph-users] Rados: Undefined symbol error
> 
> 
> 
> - Original Message -
> > From: "Brad Hubbard" 
> > To: "Aakanksha Pudipeddi-SSI" 
> > Cc: "ceph-users" 
> > Sent: Tuesday, 1 September, 2015 8:36:33 AM
> > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > 
> > - Original Message -
> > > From: "Aakanksha Pudipeddi-SSI" 
> > > To: "Brad Hubbard" 
> > > Cc: "ceph-users" 
> > > Sent: Tuesday, 1 September, 2015 7:58:33 AM
> > > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > > 
> > > Brad,
> > > 
> > > Yes, you are right. Sorry about that! This is what I get when I try
> > > with the back ticks:
> > > $ `which rados` -v
> > > /usr/bin/rados: symbol lookup error: /usr/bin/rados: undefined symbol:
> > > _ZN5MutexC1ERKSsbbbP11CephContext
> > > $ strings `which rados`|grep "^ceph version"
> > > $
> > > $ strings $(which rados)|grep "^ceph version" -A5 $
> > > 
> > > The latest command returns no results too.
> > 
> > Here's what you should get.
> > 
> > # strings $(which rados)|grep "^ceph version" -A5 ceph version
> > e4bfad3a3c51054df7e537a724c8d0bf9be972ff
> 
> Except you should see be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0 since that is
> v9.0.2. Your rados binary just isn't behaving like anything I've seen
> before.
> 
> How about you stand up a fresh VM and run "./autogen.sh && ./configure &&
> make install" on v9.0.2 and see if you get similar output to what I'm
> getting then try working back from there?
> 
> > ConfLine(key = '
> > ', val='
> > ', newsection='
> >  = "
> > 
> > > 
> > > Thanks,
> > > Aakanksha
> > > 
> > > -Original Message-
> > > From: Brad Hubbard [mailto:bhubb...@redhat.com]
> > > Sent: Monday, August 31, 2015 2:49 PM
> > > To: Aakanksha Pudipeddi-SSI
> > > Cc: ceph-users
> > > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > > 
> > > - Original Message -
> > > > From: "Aakanksha Pudipeddi-SSI" 
> > > > To: "Brad Hubbard" 
> > > > Cc: ceph-us...@ceph.com
> > > > Sent: Tuesday, 1 September, 2015 7:27:04 AM
> > > > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > > > 
> > > > Hello Brad,
> > > > 
> > > > When I type "which rados" it returns /usr/bin/ra

[ceph-users] libvirt rbd issue

2015-09-01 Thread Rafael Lopez
Hi ceph-users,

Hoping to get some help with a tricky problem. I have a rhel7.1 VM guest
(host machine also rhel7.1) with root disk presented from ceph 0.94.2-0
(rbd) using libvirt.

The VM also has a second rbd for storage presented from the same ceph
cluster, also using libvirt.

The VM boots fine, no apparent issues with the OS root rbd. I am able to
mount the storage disk in the VM, and create a file system. I can even
transfer small files to it. But when I try to transfer a moderate size
files, eg. greater than 1GB, it seems to slow to a grinding halt and
eventually it locks up the whole system, and generates the kernel messages
below.

I have googled some *similar* issues around, but haven't come across some
solid advice/fix. So far I have tried modifying the libvirt disk cache
settings, tried using the latest mainline kernel (4.2+), different file
systems (ext4, xfs, zfs) all produce similar results. I suspect it may be
network related, as when I was using the mainline kernel I was transferring
some files to the storage disk and this message came up, and the transfer
seemed to stop at the same time:

Sep  1 15:31:22 nas1-rds NetworkManager[724]:  [1441085482.078646]
[platform/nm-linux-platform.c:2133] sysctl_set(): sysctl: failed to set
'/proc/sys/net/ipv6/conf/eth0/mtu' to '9000': (22) Invalid argument

I think maybe the key info to troubleshooting is that it seems to be OK for
files under 1GB.

Any ideas would be appreciated.

Cheers,
Raf


Sep  1 16:04:15 nas1-rds kernel: INFO: task kworker/u8:1:60 blocked for
more than 120 seconds.
Sep  1 16:04:15 nas1-rds kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 16:04:15 nas1-rds kernel: kworker/u8:1D 88023fd93680 0
 60  2 0x
Sep  1 16:04:15 nas1-rds kernel: Workqueue: writeback bdi_writeback_workfn
(flush-252:80)
Sep  1 16:04:15 nas1-rds kernel: 880230c136b0 0046
8802313c4440 880230c13fd8
Sep  1 16:04:15 nas1-rds kernel: 880230c13fd8 880230c13fd8
8802313c4440 88023fd93f48
Sep  1 16:04:15 nas1-rds kernel: 880230c137b0 880230fbcb08
e8d80ec0 88022e827590
Sep  1 16:04:15 nas1-rds kernel: Call Trace:
Sep  1 16:04:15 nas1-rds kernel: [] io_schedule+0x9d/0x130
Sep  1 16:04:15 nas1-rds kernel: [] bt_get+0x10f/0x1a0
Sep  1 16:04:15 nas1-rds kernel: [] ?
wake_up_bit+0x30/0x30
Sep  1 16:04:15 nas1-rds kernel: []
blk_mq_get_tag+0xbf/0xf0
Sep  1 16:04:15 nas1-rds kernel: []
__blk_mq_alloc_request+0x1b/0x1f0
Sep  1 16:04:15 nas1-rds kernel: []
blk_mq_map_request+0x181/0x1e0
Sep  1 16:04:15 nas1-rds kernel: []
blk_sq_make_request+0x9a/0x380
Sep  1 16:04:15 nas1-rds kernel: [] ?
generic_make_request_checks+0x24f/0x380
Sep  1 16:04:15 nas1-rds kernel: []
generic_make_request+0xe2/0x130
Sep  1 16:04:15 nas1-rds kernel: [] submit_bio+0x71/0x150
Sep  1 16:04:15 nas1-rds kernel: []
ext4_io_submit+0x25/0x50 [ext4]
Sep  1 16:04:15 nas1-rds kernel: []
ext4_bio_write_page+0x159/0x2e0 [ext4]
Sep  1 16:04:15 nas1-rds kernel: []
mpage_submit_page+0x5d/0x80 [ext4]
Sep  1 16:04:15 nas1-rds kernel: []
mpage_map_and_submit_buffers+0x172/0x2a0 [ext4]
Sep  1 16:04:15 nas1-rds kernel: []
ext4_writepages+0x733/0xd60 [ext4]
Sep  1 16:04:15 nas1-rds kernel: []
do_writepages+0x1e/0x40
Sep  1 16:04:15 nas1-rds kernel: []
__writeback_single_inode+0x40/0x220
Sep  1 16:04:15 nas1-rds kernel: []
writeback_sb_inodes+0x25e/0x420
Sep  1 16:04:15 nas1-rds kernel: []
__writeback_inodes_wb+0x9f/0xd0
Sep  1 16:04:15 nas1-rds kernel: []
wb_writeback+0x263/0x2f0
Sep  1 16:04:15 nas1-rds kernel: []
bdi_writeback_workfn+0x1cc/0x460
Sep  1 16:04:15 nas1-rds kernel: []
process_one_work+0x17b/0x470
Sep  1 16:04:15 nas1-rds kernel: []
worker_thread+0x11b/0x400
Sep  1 16:04:15 nas1-rds kernel: [] ?
rescuer_thread+0x400/0x400
Sep  1 16:04:15 nas1-rds kernel: [] kthread+0xcf/0xe0
Sep  1 16:04:15 nas1-rds kernel: [] ?
kthread_create_on_node+0x140/0x140
Sep  1 16:04:15 nas1-rds kernel: []
ret_from_fork+0x7c/0xb0
Sep  1 16:04:15 nas1-rds kernel: [] ?
kthread_create_on_node+0x140/0x140
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Hi Vu,
   Thanks a lot for the link

Best regards,

*German*

2015-09-01 19:02 GMT-03:00 Vu Pham :

> Hi German,
>
>
>
> You can try this small wiki to setup ceph/accelio
>
>
>
> https://community.mellanox.com/docs/DOC-2141
>
>
>
> thanks,
>
> -vu
>
>
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 12:00 PM
> *To:* Somnath Roy
>
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks a lot guys, I'll configure the cluster and send you some feedback
> once we test it
>
> Best regards,
>
>
> *German*
>
>
>
> 2015-09-01 15:38 GMT-03:00 Somnath Roy :
>
> Thanks !
>
> 6 OSD daemons per server should be good.
>
>
>
> Vu,
>
> Could you please send out the doc you are maintaining ?
>
>
>
> Regards
>
> Somnath
>
>
>
> *From:* German Anders [mailto:gand...@despegar.com]
> *Sent:* Tuesday, September 01, 2015 11:36 AM
>
>
> *To:* Somnath Roy
> *Cc:* Robert LeBlanc; ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks Roy, we're planning to grow on this cluster if can get the
> performance that we need, the idea is to run non-relational databases here,
> so it would be high-io intensive. We are talking in grow terms of about
> 40-50 OSD servers with no more than 6 OSD daemons per server. If you got
> some hints or docs out there on how to compile ceph with accelio it would
> be awesome.
>
>
> *German*
>
>
>
> 2015-09-01 15:31 GMT-03:00 Somnath Roy :
>
> Thanks !
>
> I think you should try installing from the ceph mainstream..There are some
> bug fixes went on after Hammer (not sure if it is backported)..
>
> I would say try with 1 drive -> 1 OSD first since presently we have seen
> some stability issues (mainly due to resource constraint) with more OSDs in
> a box.
>
> The another point is, installation itself is not straight forward. You
> need to build all the components probably, not sure if it is added as git
> submodule or not, Vu , could you please confirm ?
>
>
>
> Since we are working to make this solution work at scale, could you please
> give us some idea what is the scale you are looking at for future
> deployment ?
>
>
>
> Regards
>
> Somnath
>
>
>
> *From:* German Anders [mailto:gand...@despegar.com]
> *Sent:* Tuesday, September 01, 2015 11:19 AM
> *To:* Somnath Roy
> *Cc:* Robert LeBlanc; ceph-users
>
>
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Hi Roy,
>
>I understand, we are looking for using accelio with an starting small
> cluster of 3 mon and 8 osd servers:
>
> 3x MON servers
>
>2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
>
>24x 16GB DIMM DDR3 1333Mhz (384GB)
>
>2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
>
>8x 16GB DIMM DDR3 1333Mhz (128GB)
>
>2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
>
>3x 120GB Intel SSD DC SC3500 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
>
>8x 16GB DIMM DDR3 1866Mhz (128GB)
>
>2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
>
>3x 200GB Intel SSD DC S3700 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> and thinking of using *infernalis v.9.0.0* or *hammer* release? comments?
> recommendations?
>
>
> *German*
>
>
>
> 2015-09-01 14:46 GMT-03:00 Somnath Roy :
>
> Hi German,
>
> We are working on to make it production ready ASAP. As you know RDMA is
> very resource constrained and at the same time will outperform TCP as well.
> There will be some definite tradeoff between cost Vs Performance.
>
> We are lacking on ideas on how big the RDMA deployment could be and it
> will be really helpful if you can give some idea on how you are planning to
> deploy that (i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc.
> etc.).
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 10:39 AM
> *To:* Robert LeBlanc
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks a lot for the quick response Robert, any idea when it's going to be
> ready for production? any alternative solution for similar-performance?
>
> Best regards,
>
>
> *German *
>
>
>
> 2015-09-01 13:42 GMT-03:00 Robert LeBlanc :
>
> -BEGIN PGP SIGNED MESSAGE-
>
> Hash: SHA256
>
>
>
> Accelio and Ceph are still in heavy development and not ready for production.
>
>
>
> - 
>
> Robert LeBlanc
>
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
>
> On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
>
> Hi cephers,
>
>
>
>  I would like to know the status for production-ready of Accelio & Cep

[ceph-users] How to add a slave zone to rgw

2015-09-01 Thread 周炳华
Hi, ceph users:

I have a ceph cluster for rgw service in production, which was setup
according to the simple configuration tutorial, with only one deafult
region and one default zone. Even worse, I didn't enable neither the meta
logging nor the data logging in the master zone.
Now i want to add a slave zone to the rgw for disaster recovery. How can i
do this, influencing the service in production the least ? The size of data
in the master zone is 10TB.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

2015-09-01 Thread Christian Balzer

Hello,

On Tue, 1 Sep 2015 11:50:07 -0500 Kenneth Van Alstyne wrote:

> Got it — I’ll keep that in mind. That may just be what I need to “get
> by” for now.  Ultimately, we’re looking to buy at least three nodes of
> servers that can hold 40+ OSDs backed by 2TB+ SATA disks,
> 
As mentioned, pick decent SSDs, if only so that they can be used later on
in cache pool perhaps. 
I'd go for DC S3610s in your case. 

Also keep in mind that with durable, reliable SSDs you can go for a
replication of 2 instead of 3, thus both improving your storage space AND
performance.

Research the archives, there are plenty of discussions and recommendations
for storage node configurations.
You will be infinitely happier with more and smaller nodes in the 8-12
OSD size compared to something with more than 24 OSDs.

Also keep in mind that more nodes are beneficial when it comes to node
maintenance or failure. 
Can your cluster maintain sufficient performance if 1 nodes is down (in
your example 40 out of 120 OSDs)? I'm certain the answer will be no. 

Something like this with either 2 DC S 37xx (200 or 400GB) in the back or
a DC P3700 internally for journals, a FAST CPU and 64GB RAM (or more) is a
good starting point:
http://www.supermicro.com.tw/products/system/2U/5028/SSG-5028R-E1CR12L.cfm

Christian

> Thanks,
> 
> --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> c: 228-547-8045 f: 571-266-3106
> www.knightpoint.com 
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 2 / ISO 27001
> 
> Notice: This e-mail message, including any attachments, is for the sole
> use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, copy, use, disclosure,
> or distribution is STRICTLY prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy all
> copies of the original message.
> 
> > On Sep 1, 2015, at 11:26 AM, Robert LeBlanc 
> > wrote:
> > 
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> > 
> > Just swapping out spindles for SSD will not give you orders of
> > magnitude performance gains as it does in regular cases. This is
> > because Ceph has a lot of overhead for each I/O which limits the
> > performance of the SSDs. In my testing, two Intel S3500 SSDs with an 8
> > core Atom (Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz) and size=1 and fio
> > with 8 jobs and QD=8 sync,direct 4K read/writes produced 2,600 IOPs.
> > Don't get me wrong, it will help, but don't expect spectacular results.
> > 
> > - 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > 
> > On Tue, Sep 1, 2015 at 8:01 AM, Kenneth Van Alstyne  wrote:
> > Thanks for the awesome advice folks.  Until I can go larger scale (50+
> > SATA disks), I’m thinking my best option here is to just swap out
> > these 1TB SATA disks with 1TB SSDs.  Am I oversimplifying the short
> > term solution?
> > 
> > Thanks,
> > 
> > - --
> > Kenneth Van Alstyne
> > Systems Architect
> > Knight Point Systems, LLC
> > Service-Disabled Veteran-Owned Business
> > 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> > c: 228-547-8045 f: 571-266-3106
> > www.knightpoint.com  
> > DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> > GSA Schedule 70 SDVOSB: GS-35F-0646S
> > GSA MOBIS Schedule: GS-10F-0404Y
> > ISO 2 / ISO 27001
> > 
> > Notice: This e-mail message, including any attachments, is for the
> > sole use of the intended recipient(s) and may contain confidential and
> > privileged information. Any unauthorized review, copy, use,
> > disclosure, or distribution is STRICTLY prohibited. If you are not the
> > intended recipient, please contact the sender by reply e-mail and
> > destroy all copies of the original message.
> > 
> > On Aug 31, 2015, at 7:29 PM, Christian Balzer  wrote:
> > 
> > 
> > Hello,
> > 
> > On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:
> > 
> > In addition to the spot on comments by Warren and Quentin, verify this
> > by watching your nodes with atop, iostat, etc. 
> > The culprit (HDDs) should be plainly visible.
> > 
> > More inline:
> > 
> > Christian, et al:
> > 
> > Sorry for the lack of information.  I wasn’t sure what of our hardware
> > specifications or Ceph configuration was useful information at this
> > point.  Thanks for the feedback — any feedback, is appreciated at this
> > point, as I’ve been beating my head against a wall trying to figure out
> > what’s going on.  (If anything.  Maybe the spindle count is indeed our
> > upper limit or our SSDs really suck? :-) )
> > 
> > Your SSDs aren't the problem.
> > 
> > To directly address your questions, see answers below:
> > - CBT is the Ceph Benchmarking Tool.  Since my question was
> > more generic rather