[ceph-users] rgw bucket index manual copy

2016-09-20 Thread Василий Ангапов
Hello,

Is there any way to copy rgw bucket index to another Ceph node to
lower the downtime of RGW? For now I have  a huge bucket with 200
million files and its backfilling is blocking RGW completely for an
hour and a half even with 10G network.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw bucket index manual copy

2016-09-20 Thread Wido den Hollander

> Op 20 september 2016 om 10:55 schreef Василий Ангапов :
> 
> 
> Hello,
> 
> Is there any way to copy rgw bucket index to another Ceph node to
> lower the downtime of RGW? For now I have  a huge bucket with 200
> million files and its backfilling is blocking RGW completely for an
> hour and a half even with 10G network.
> 

No, not really. What you really want is the bucket sharding feature.

So what you can do is enable the sharding, create a NEW bucket and copy over 
the objects.

Afterwards you can remove the old bucket.

Wido

> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph reweight-by-utilization and increasing

2016-09-20 Thread Stefan Priebe - Profihost AG
Hi,

while using ceph hammer i saw in the doc of ceph reweight-by-utilization
that there is a --no-increasing flag. I do not use it but never saw an
increased weight value even some of my osds are really empty.

Example:
821G  549G  273G  67% /var/lib/ceph/osd/ceph-110

vs.

821G  767G   54G  94% /var/lib/ceph/osd/ceph-13

I would expect that ceph reweight-by-utilization increases osd.110
weight value but instead it still lowers other osds.

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increase PG number

2016-09-20 Thread Vincent Godin
Hi,

In fact, when you increase your pg number, the new pgs will have to peer
first and during this time, a lot a pg will be unreachable. The best way to
upgrade the number of PG of a cluster (you 'll need to adjust the number of
PGP too) is :


   - Don't forget to apply Goncalo advices to keep your cluster responsive
   for client operations. Otherwise, all the IO and CPU will be used for the
   recovery operations and your cluster will be unreachable. Be sure that all
   these new parameters are in place before upgrading your cluster


   - stop and wait for scrub and deep-scrub operations

ceph osd set noscrub
ceph osd set nodeep-scrub

   - set you cluster in maintenance mode with :

ceph osd set norecover
ceph osd set nobackfill
ceph osd set nodown
ceph osd set noout

wait for your cluster not have scrub or deep-scrub opration anymore

   - upgrade the pg number with a small increment like 256


   - wait for the cluster to create and peer the new pgs (about 30 seconds)


   - upgrade the pgp number with the same increment


   - wait for the cluster to create and peer (about 30 seconds)

(Repeat the last 4 operations until you reach the number of pg and pgp you
want

At this time, your cluster is still functionnal.

   - Now you have to unset the maintenance mode

ceph osd unset noout
ceph osd unset nodown
ceph osd unset nobackfill
ceph osd unset norecover

It will take some time to replace all the pgs but at the end you will have
a cluster with all pgs active+clean.During all the operation,your cluster
will still be functionnal if you have respected Goncalo parameters


   - When all the pgs are active+clean, you can re-enable the scrub and
   deep-scrub operations

ceph osd unset noscrub
ceph osd unset nodeep-scrub
Vincent
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increase PG number

2016-09-20 Thread Matteo Dacrema
Thanks a lot guys.

I’ll try to do as you told me.

Best Regards
Matteo

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 20 set 2016, alle ore 12:20, Vincent Godin  
> ha scritto:
> 
> Hi, 
> 
> In fact, when you increase your pg number, the new pgs will have to peer 
> first and during this time, a lot a pg will be unreachable. The best way to 
> upgrade the number of PG of a cluster (you 'll need to adjust the number of 
> PGP too) is :
> 
> Don't forget to apply Goncalo advices to keep your cluster responsive for 
> client operations. Otherwise, all the IO and CPU will be used for the 
> recovery operations and your cluster will be unreachable. Be sure that all 
> these new parameters are in place before upgrading your cluster
> stop and wait for scrub and deep-scrub operations
> ceph osd set noscrub
> ceph osd set nodeep-scrub
> 
> set you cluster in maintenance mode with :
> ceph osd set norecover
> ceph osd set nobackfill
> ceph osd set nodown
> ceph osd set noout
> 
> wait for your cluster not have scrub or deep-scrub opration anymore
> upgrade the pg number with a small increment like 256
> wait for the cluster to create and peer the new pgs (about 30 seconds)
> upgrade the pgp number with the same increment
> wait for the cluster to create and peer (about 30 seconds)
> (Repeat the last 4 operations until you reach the number of pg and pgp you 
> want
> 
> At this time, your cluster is still functionnal. 
> 
> Now you have to unset the maintenance mode
> ceph osd unset noout
> ceph osd unset nodown
> ceph osd unset nobackfill
> ceph osd unset norecover
> 
> It will take some time to replace all the pgs but at the end you will have a 
> cluster with all pgs active+clean.During all the operation,your cluster will 
> still be functionnal if you have respected Goncalo parameters
> 
> When all the pgs are active+clean, you can re-enable the scrub and deep-scrub 
> operations
> ceph osd unset noscrub
> ceph osd unset nodeep-scrub
> 
> Vincent
> 
> 
> 
> -- 
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto. 
> Clicca qui per segnalarlo come spam. 
>  
> Clicca qui per metterlo in blacklist 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph reweight-by-utilization and increasing

2016-09-20 Thread Dan van der Ster
Hi Stefan,

What's the current reweight value for osd.110? It cannot be increased above 1.

Cheers, Dan



On Tue, Sep 20, 2016 at 12:13 PM, Stefan Priebe - Profihost AG
 wrote:
> Hi,
>
> while using ceph hammer i saw in the doc of ceph reweight-by-utilization
> that there is a --no-increasing flag. I do not use it but never saw an
> increased weight value even some of my osds are really empty.
>
> Example:
> 821G  549G  273G  67% /var/lib/ceph/osd/ceph-110
>
> vs.
>
> 821G  767G   54G  94% /var/lib/ceph/osd/ceph-13
>
> I would expect that ceph reweight-by-utilization increases osd.110
> weight value but instead it still lowers other osds.
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph reweight-by-utilization and increasing

2016-09-20 Thread Christian Balzer

Hello,

This and the non-permanence of reweight is why I use CRUSH reweight (a
more distinct naming would be VERY helpful, too) and do it manually, which
tends to beat all the automated approaches so far.

Christian

 On Tue, 20 Sep 2016 13:49:50 +0200 Dan van der Ster wrote:

> Hi Stefan,
> 
> What's the current reweight value for osd.110? It cannot be increased above 1.
> 
> Cheers, Dan
> 
> 
> 
> On Tue, Sep 20, 2016 at 12:13 PM, Stefan Priebe - Profihost AG
>  wrote:
> > Hi,
> >
> > while using ceph hammer i saw in the doc of ceph reweight-by-utilization
> > that there is a --no-increasing flag. I do not use it but never saw an
> > increased weight value even some of my osds are really empty.
> >
> > Example:
> > 821G  549G  273G  67% /var/lib/ceph/osd/ceph-110
> >
> > vs.
> >
> > 821G  767G   54G  94% /var/lib/ceph/osd/ceph-13
> >
> > I would expect that ceph reweight-by-utilization increases osd.110
> > weight value but instead it still lowers other osds.
> >
> > Greets,
> > Stefan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] swiftclient call radosgw, it always response 401 Unauthorized

2016-09-20 Thread Radoslaw Zarzynski
Hi Brian,

Responded inline.

On Tue, Sep 20, 2016 at 5:45 AM, Brian Chang-Chien
 wrote:
>
>
> 2016-09-20 10:14:38.761635 7f2049ffb700 20 
> HTTP_X_AUTH_TOKEN=b243614d27244d00b12b2f366b58d709
> 2016-09-20 10:14:38.761636 7f2049ffb700 20 QUERY_STRING=
> ...
> 2016-09-20 10:14:38.761720 7f2049ffb700  2 req 3:0.78:swift:HEAD 
> /swift/v1:stat_account:authorizing
> 2016-09-20 10:14:38.761725 7f2049ffb700 10 failed to authorize request
> 2016-09-20 10:14:38.761726 7f2049ffb700 20 handler->ERRORHANDLER: err_no=-1 
> new_err_no=-1


Those logs show there was no jump to the Keystone code
at all. This is because the "token_id=..." debug message [1]
is absent. The sole reason I see for such behavior is that
the RadosGW instance internally sees rgw_keystone_url
as empty [2][3].

Are you absolutely sure that the instance that got debug_rgw
to its configuration file has rgw_keystone_url properly set?
I mean whether the setting is in the same section, is written
in pure ASCII (without some crazy UTF characters) and so
on? I saw you posted the config earlier but we really need
to double check.

Could you also provide output from following curl command
and corresponding RadosGW's log? 401 is fully expected
as we'll intensionally send an invalid token.

curl -i "http://:/swift/v1" -X HEAD -H
"X-Auth-Token: random_string"

>
>
> I also have some problems
>
> Q1 : if use keystone, radosgw need create user and subuser?
> in the case , i create admin user and admin:admin subuser , but i think it 
> don't need , and i rght?


Yup, this is unnecessary when using the Keystone auth.

>
>
> Q2:
> And i found a phenomenon,
> Once I connect keystone and ceph radosgw before, and i use " rados --pool 
> default.rgw.users.uid ls "
> It will detail a like token uid
>
> but if swift response 401
> i can't find the token uid
> Do you know keystone how to add  token user to default.rgw.users.uid
> finally , hope bellow msgs can help me to slove
> anyway, thx your support greate


You don't need to add anything. RadosGW will create
RGWUserInfo if necessary on the first, successfully
authenticated request [4]. The RADOS object will be
named after the tenant ID in Keystone.

Best regards,
Radoslaw Zarzynski

[1] https://github.com/ceph/ceph/blob/v10.2.2/src/rgw/rgw_swift.cc#L472
[2] https://github.com/ceph/ceph/blob/v10.2.2/src/rgw/rgw_swift.cc#L766-L769
[3] https://github.com/ceph/ceph/blob/v10.2.2/src/rgw/rgw_swift.h#L59-L61
[4] https://github.com/ceph/ceph/blob/v10.2.2/src/rgw/rgw_swift.cc#L413
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph reweight-by-utilization and increasing

2016-09-20 Thread Stefan Priebe - Profihost AG
Am 20.09.2016 um 13:49 schrieb Dan van der Ster:
> Hi Stefan,
> 
> What's the current reweight value for osd.110? It cannot be increased above 1.

ah OK it's 1 already. But that doesn't make sense cause this means all
other osds (f.e. 109 osds) have to be touches to get lower values before
110 get's more data...

Stefan

> 
> Cheers, Dan
> 
> 
> 
> On Tue, Sep 20, 2016 at 12:13 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hi,
>>
>> while using ceph hammer i saw in the doc of ceph reweight-by-utilization
>> that there is a --no-increasing flag. I do not use it but never saw an
>> increased weight value even some of my osds are really empty.
>>
>> Example:
>> 821G  549G  273G  67% /var/lib/ceph/osd/ceph-110
>>
>> vs.
>>
>> 821G  767G   54G  94% /var/lib/ceph/osd/ceph-13
>>
>> I would expect that ceph reweight-by-utilization increases osd.110
>> weight value but instead it still lowers other osds.
>>
>> Greets,
>> Stefan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph reweight-by-utilization and increasing

2016-09-20 Thread Stefan Priebe - Profihost AG
Hi Christian,

Am 20.09.2016 um 13:54 schrieb Christian Balzer:
> This and the non-permanence of reweight is why I use CRUSH reweight (a
> more distinct naming would be VERY helpful, too) and do it manually, which
> tends to beat all the automated approaches so far.

so you do it really by hand and use ceph osd crush set weight?

Greets,
Stefan

>  On Tue, 20 Sep 2016 13:49:50 +0200 Dan van der Ster wrote:
> 
>> Hi Stefan,
>>
>> What's the current reweight value for osd.110? It cannot be increased above 
>> 1.
>>
>> Cheers, Dan
>>
>>
>>
>> On Tue, Sep 20, 2016 at 12:13 PM, Stefan Priebe - Profihost AG
>>  wrote:
>>> Hi,
>>>
>>> while using ceph hammer i saw in the doc of ceph reweight-by-utilization
>>> that there is a --no-increasing flag. I do not use it but never saw an
>>> increased weight value even some of my osds are really empty.
>>>
>>> Example:
>>> 821G  549G  273G  67% /var/lib/ceph/osd/ceph-110
>>>
>>> vs.
>>>
>>> 821G  767G   54G  94% /var/lib/ceph/osd/ceph-13
>>>
>>> I would expect that ceph reweight-by-utilization increases osd.110
>>> weight value but instead it still lowers other osds.
>>>
>>> Greets,
>>> Stefan
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph reweight-by-utilization and increasing

2016-09-20 Thread Christian Balzer

Hello,

On Tue, 20 Sep 2016 14:40:25 +0200 Stefan Priebe - Profihost AG wrote:

> Hi Christian,
> 
> Am 20.09.2016 um 13:54 schrieb Christian Balzer:
> > This and the non-permanence of reweight is why I use CRUSH reweight (a
> > more distinct naming would be VERY helpful, too) and do it manually, which
> > tends to beat all the automated approaches so far.
> 
> so you do it really by hand and use ceph osd crush set weight?
>
Indeed.

Mind, my clusters aren't that big.
Also (as I described here before) by moving the worst offenders up and
down respectively while trying to keep the per host weight as close
to identical to the original value as possible, one winds up with only
about half of the OSDs that need tweaking.

Also as mentioned before, both approaches penultimately are band-aids to a
problem that needs something far more integrated and smarter, short from
re-visiting the CRUSH algorithm.

Because with plain reweights you will loose the adjustment when the OSD
gets set out for some reason. 
While this is not the case with CRUSH reweights, loosing an OSD
(re-balancing ensues) may still cause some OSDs to get much more PGs than
they would have otherwise (with original weights).

In short, CRUSH reweight can and will give you a nicely balanced cluster
during normal operations, but if you're running things being close to full
(not being able to sustain an OSD or node loss and the resulting
re-shuffling), it may not save you.

Christian

> Greets,
> Stefan
> 
> >  On Tue, 20 Sep 2016 13:49:50 +0200 Dan van der Ster wrote:
> > 
> >> Hi Stefan,
> >>
> >> What's the current reweight value for osd.110? It cannot be increased 
> >> above 1.
> >>
> >> Cheers, Dan
> >>
> >>
> >>
> >> On Tue, Sep 20, 2016 at 12:13 PM, Stefan Priebe - Profihost AG
> >>  wrote:
> >>> Hi,
> >>>
> >>> while using ceph hammer i saw in the doc of ceph reweight-by-utilization
> >>> that there is a --no-increasing flag. I do not use it but never saw an
> >>> increased weight value even some of my osds are really empty.
> >>>
> >>> Example:
> >>> 821G  549G  273G  67% /var/lib/ceph/osd/ceph-110
> >>>
> >>> vs.
> >>>
> >>> 821G  767G   54G  94% /var/lib/ceph/osd/ceph-13
> >>>
> >>> I would expect that ceph reweight-by-utilization increases osd.110
> >>> weight value but instead it still lowers other osds.
> >>>
> >>> Greets,
> >>> Stefan
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > 
> > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Auto recovering after loosing all copies of a PG(s)

2016-09-20 Thread Iain Buclaw
On 1 September 2016 at 23:04, Wido den Hollander  wrote:
>
>> Op 1 september 2016 om 17:37 schreef Iain Buclaw :
>>
>>
>> On 16 August 2016 at 17:13, Wido den Hollander  wrote:
>> >
>> >> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw :
>> >>
>> >>
>> >> The desired behaviour for me would be for the client to get an instant
>> >> "not found" response from stat() operations.  For write() to recreate
>> >> unfound objects.  And for missing placement groups to be recreated on
>> >> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
>> >> it can still be accessed is just not workable, I'm afraid.
>> >>
>> >
>> > Well, you can't make Ceph do that, but you can make librados do such a 
>> > thing.
>> >
>> > I'm using the OSD and MON timeout settings in libvirt for example: 
>> > http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157
>> >
>> > You can set these options:
>> > - client_mount_timeout
>> > - rados_mon_op_timeout
>> > - rados_osd_op_timeout
>> >
>> > Where I think only the last two should be sufficient in your case.
>> >
>> > You wel get ETIMEDOUT back as error when a operation times out.
>> >
>> > Wido
>> >
>>
>> This seems to be fine.
>>
>> Now what to do when a DR situation happens.
>>
>>
>>   pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects
>> 2485 GB used, 10691 GB / 13263 GB avail
>> 3902 active+clean
>>  128 creating
>>   66 incomplete
>>
>>
>> These PGs just never seem to finish creating.
>>
>
> I have seen that happen as well, you sometimes need to restart the OSDs to 
> let the create finish.
>
> Wido
>

Just had another DR situation happen again over the weekend, and I can
confirm that setting client side timeouts did effectively nothing to
help the situation.  According to the ceph performance stats, the
total throughput of client operations went from 5000 per second to
just 20.  All clients are set with rados osd op timeout = 0.5, and are
using AIO.

Why must everything come to a halt internally when 1/30 OSDs of the
cluster is down? I managed only to get it up to 70 ops after forcibly
completing the PGs (stale+active+clean).  Then I got back up to normal
operations (-ish) after issuing force_create_pg, then stop and start
the OSD where the PG got moved to.

This is something that I'm trying to understand about ceph/librados.
If one disk is down, the whole system is collapses to a trickling low
rate that is not really any better than being completely down. It's as
if it cannot cope with loosing a disk that holds the only copy of a
PG.

As I've said before, the clients don't really care if data goes
missing or gets lost in the first place.  So long as accessible data
continues to be accessed without disruption, then everything will be
happy.

Is there a better way to make the cluster play happy in this scenario?
 As I've said before, the most desired behaviour I'm looking at is
just to recreate lost PGs and move on with it's life with zero impact
to the performance.

Lost data will always be recreated two days later by the clients that
check the validity of what's stored.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Stat speed for objects in ceph

2016-09-20 Thread Iain Buclaw
Hi,

As a general observation, the speed of calling stat() on any object in
ceph is relatively slow.  I'm probably getting a rate of about 10K per
second using AIO, and even then it is really *really* bursty, to the
point where there could be 5 seconds of activity going in one
direction, then the callback thread wakes up and processes all queued
completions in a single blast.

At our current rate with more than 1 billion objects in a pool, it's
looking like if I was to check the existence of every object, that it
would take around 19-24 hours to complete.

Granted that our starting point before beginning some migrations to
Ceph was around 1 hour to check the existence of every object, this is
something of a concern.  Are there any ways via librados to improve
the throughput of processing objects?

Adding more instances or sharding work doesn't seem to increase the
overall throughput at all.  And cache won't help either, there is no
determinism in what's accessed, and given the size of the pool OS
filesystem cache is useless anyway.

Thanks,
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stat speed for objects in ceph

2016-09-20 Thread Gregory Farnum
In librados getting a stat is basically equivalent to reading a small
object; there's not an index or anything so FileStore needs to descend its
folder hierarchy. If looking at metadata for all the objects in the system
efficiently is important you'll want to layer an index in somewhere.
-Greg

On Tuesday, September 20, 2016, Iain Buclaw  wrote:

> Hi,
>
> As a general observation, the speed of calling stat() on any object in
> ceph is relatively slow.  I'm probably getting a rate of about 10K per
> second using AIO, and even then it is really *really* bursty, to the
> point where there could be 5 seconds of activity going in one
> direction, then the callback thread wakes up and processes all queued
> completions in a single blast.
>
> At our current rate with more than 1 billion objects in a pool, it's
> looking like if I was to check the existence of every object, that it
> would take around 19-24 hours to complete.
>
> Granted that our starting point before beginning some migrations to
> Ceph was around 1 hour to check the existence of every object, this is
> something of a concern.  Are there any ways via librados to improve
> the throughput of processing objects?
>
> Adding more instances or sharding work doesn't seem to increase the
> overall throughput at all.  And cache won't help either, there is no
> determinism in what's accessed, and given the size of the pool OS
> filesystem cache is useless anyway.
>
> Thanks,
> --
> Iain Buclaw
>
> *(p < e ? p++ : p) = (c & 0x0f) + '0';
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel Docs | error on mount.ceph page

2016-09-20 Thread David
Sorry I don't know the correct way to report this.

Potential error on this page:

on http://docs.ceph.com/docs/jewel/man/8/mount.ceph/

Currently:

rsize
int (bytes), max readahead, multiple of 1024, Default: 524288 (512*1024)

Should it be something like the following?

rsize
int (bytes), max read size. Default: none

rasize
int (bytes), max readahead, multiple of 1024, Default: 8388608 (8192*1024)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cache tier not flushing 10.2.2

2016-09-20 Thread Jim Kilborn
Simple issue I cant find with the cache tier. Thanks for taking the time…

Setup a new cluster with ssd cache tier. My cache tier is on 1TB ssd. With 2 
replicas. It just fills up my cache until the ceph filesystem stops allowing 
access.
I even set the target_max_bytes to 1048576 (1GB) and still doesn’t flush.

Here are the settings:

Setup the pools

ceph osd pool create cephfs-cache 512 512 replicated ssd_ruleset
ceph osd pool create cephfs-metadata 512 512 replicated ssd_ruleset
ceph osd pool create cephfs-data 512 512 erasure default spinning_ruleset
ceph osd pool set cephfs-cache min_size 1
ceph osd pool set cephfs-cache size 2
ceph osd pool set cephfs-metadata min_size 1
ceph osd pool set cephfs-metadata size 2



Add tiers

ceph osd tier add cephfs-data cephfs-cache
ceph osd tier cache-mode cephfs-cache writeback
ceph osd tier set-overlay cephfs-data cephfs-cache
ceph osd pool set cephfs-cache hit_set_type bloom
ceph osd pool set cephfs-cache hit_set_count 1
ceph osd pool set cephfs-cache hit_set_period 3600
ceph osd pool set cephfs-cache target_max_bytes 1048576 # 1 TB
ceph osd pool set cephfs-cache cache_target_dirty_ratio 0.4 # percentage of 
target_max_bytes before flushes dirty objects
ceph osd pool set cephfs-cache cache_target_dirty_high_ratio 0.6 # percentage 
of target_max_bytes before flushes dirty objects more aggressively
ceph osd pool set cephfs-cache cache_target_full_ratio 0.80 # percentage of 
cache full before evicts objects


Am I missing something stupid? Must be. I can cause it to flush with
rados -p cephfs-cache cache-try-flush-evict-all

Should my metadata not be on the same pool as the cache pool?

I cant figure out why it doesn’t start flushing when I copy over 2 GB data. It 
just goes to
'cephfs-cache' at/near target max

Regards,
Jim

Sent from Mail for Windows 10

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stat speed for objects in ceph

2016-09-20 Thread Wido den Hollander

> Op 20 september 2016 om 19:27 schreef Gregory Farnum :
> 
> 
> In librados getting a stat is basically equivalent to reading a small
> object; there's not an index or anything so FileStore needs to descend its
> folder hierarchy. If looking at metadata for all the objects in the system
> efficiently is important you'll want to layer an index in somewhere.
> -Greg
> 

Should we expect a improvement here with BlueStore vs FileStore? That would 
basically be a RocksDB lookup on the OSD, right?

Wido

> On Tuesday, September 20, 2016, Iain Buclaw  wrote:
> 
> > Hi,
> >
> > As a general observation, the speed of calling stat() on any object in
> > ceph is relatively slow.  I'm probably getting a rate of about 10K per
> > second using AIO, and even then it is really *really* bursty, to the
> > point where there could be 5 seconds of activity going in one
> > direction, then the callback thread wakes up and processes all queued
> > completions in a single blast.
> >
> > At our current rate with more than 1 billion objects in a pool, it's
> > looking like if I was to check the existence of every object, that it
> > would take around 19-24 hours to complete.
> >
> > Granted that our starting point before beginning some migrations to
> > Ceph was around 1 hour to check the existence of every object, this is
> > something of a concern.  Are there any ways via librados to improve
> > the throughput of processing objects?
> >
> > Adding more instances or sharding work doesn't seem to increase the
> > overall throughput at all.  And cache won't help either, there is no
> > determinism in what's accessed, and given the size of the pool OS
> > filesystem cache is useless anyway.
> >
> > Thanks,
> > --
> > Iain Buclaw
> >
> > *(p < e ? p++ : p) = (c & 0x0f) + '0';
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stat speed for objects in ceph

2016-09-20 Thread Gregory Farnum
On Tue, Sep 20, 2016 at 11:26 AM, Wido den Hollander  wrote:
>
>> Op 20 september 2016 om 19:27 schreef Gregory Farnum :
>>
>>
>> In librados getting a stat is basically equivalent to reading a small
>> object; there's not an index or anything so FileStore needs to descend its
>> folder hierarchy. If looking at metadata for all the objects in the system
>> efficiently is important you'll want to layer an index in somewhere.
>> -Greg
>>
>
> Should we expect a improvement here with BlueStore vs FileStore? That would 
> basically be a RocksDB lookup on the OSD, right?

I think it will be at least a little better, since as you say it's a
RocksDB lookup? But I haven't paid BlueStore a ton of attention and
even that will cost a lot more than something like a mostly-in-memory
database.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stat speed for objects in ceph

2016-09-20 Thread Haomai Wang
On Wed, Sep 21, 2016 at 2:26 AM, Wido den Hollander  wrote:
>
>> Op 20 september 2016 om 19:27 schreef Gregory Farnum :
>>
>>
>> In librados getting a stat is basically equivalent to reading a small
>> object; there's not an index or anything so FileStore needs to descend its
>> folder hierarchy. If looking at metadata for all the objects in the system
>> efficiently is important you'll want to layer an index in somewhere.
>> -Greg
>>
>
> Should we expect a improvement here with BlueStore vs FileStore? That would 
> basically be a RocksDB lookup on the OSD, right?

Yes, bluestore will be much better since it has indexed on Onode(like
inode) in rocksdb. Although it's fast enough, it also cost some on
construct object, if you only want to check object existence, we may
need a more lightweight interface

>
> Wido
>
>> On Tuesday, September 20, 2016, Iain Buclaw  wrote:
>>
>> > Hi,
>> >
>> > As a general observation, the speed of calling stat() on any object in
>> > ceph is relatively slow.  I'm probably getting a rate of about 10K per
>> > second using AIO, and even then it is really *really* bursty, to the
>> > point where there could be 5 seconds of activity going in one
>> > direction, then the callback thread wakes up and processes all queued
>> > completions in a single blast.
>> >
>> > At our current rate with more than 1 billion objects in a pool, it's
>> > looking like if I was to check the existence of every object, that it
>> > would take around 19-24 hours to complete.
>> >
>> > Granted that our starting point before beginning some migrations to
>> > Ceph was around 1 hour to check the existence of every object, this is
>> > something of a concern.  Are there any ways via librados to improve
>> > the throughput of processing objects?
>> >
>> > Adding more instances or sharding work doesn't seem to increase the
>> > overall throughput at all.  And cache won't help either, there is no
>> > determinism in what's accessed, and given the size of the pool OS
>> > filesystem cache is useless anyway.
>> >
>> > Thanks,
>> > --
>> > Iain Buclaw
>> >
>> > *(p < e ? p++ : p) = (c & 0x0f) + '0';
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stat speed for objects in ceph

2016-09-20 Thread Wido den Hollander

> Op 20 september 2016 om 20:30 schreef Haomai Wang :
> 
> 
> On Wed, Sep 21, 2016 at 2:26 AM, Wido den Hollander  wrote:
> >
> >> Op 20 september 2016 om 19:27 schreef Gregory Farnum :
> >>
> >>
> >> In librados getting a stat is basically equivalent to reading a small
> >> object; there's not an index or anything so FileStore needs to descend its
> >> folder hierarchy. If looking at metadata for all the objects in the system
> >> efficiently is important you'll want to layer an index in somewhere.
> >> -Greg
> >>
> >
> > Should we expect a improvement here with BlueStore vs FileStore? That would 
> > basically be a RocksDB lookup on the OSD, right?
> 
> Yes, bluestore will be much better since it has indexed on Onode(like
> inode) in rocksdb. Although it's fast enough, it also cost some on
> construct object, if you only want to check object existence, we may
> need a more lightweight interface
> 

It's rados_stat() which would be called, that is the way to check if a object 
exists. If I remember the BlueStore architecture correctly it would be a lookup 
in RocksDB with all the information in there.

Wido

> >
> > Wido
> >
> >> On Tuesday, September 20, 2016, Iain Buclaw  wrote:
> >>
> >> > Hi,
> >> >
> >> > As a general observation, the speed of calling stat() on any object in
> >> > ceph is relatively slow.  I'm probably getting a rate of about 10K per
> >> > second using AIO, and even then it is really *really* bursty, to the
> >> > point where there could be 5 seconds of activity going in one
> >> > direction, then the callback thread wakes up and processes all queued
> >> > completions in a single blast.
> >> >
> >> > At our current rate with more than 1 billion objects in a pool, it's
> >> > looking like if I was to check the existence of every object, that it
> >> > would take around 19-24 hours to complete.
> >> >
> >> > Granted that our starting point before beginning some migrations to
> >> > Ceph was around 1 hour to check the existence of every object, this is
> >> > something of a concern.  Are there any ways via librados to improve
> >> > the throughput of processing objects?
> >> >
> >> > Adding more instances or sharding work doesn't seem to increase the
> >> > overall throughput at all.  And cache won't help either, there is no
> >> > determinism in what's accessed, and given the size of the pool OS
> >> > filesystem cache is useless anyway.
> >> >
> >> > Thanks,
> >> > --
> >> > Iain Buclaw
> >> >
> >> > *(p < e ? p++ : p) = (c & 0x0f) + '0';
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com 
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best Practices for Managing Multiple Pools

2016-09-20 Thread Heath Albritton
I'm wondering if anyone has some tips for managing different types of
pools, each of which fall on a different type of OSD.

Right now, I have a small cluster running with two kinds of OSD nodes,
ones with spinning disks (and SSD journals) and another with all SATA
SSD.  I'm currently running cache tiering and looking to move away
from that.

My end goal is to have a general purpose block storage pool on the
spinning disks along with object storage.  Then I'd like to do a
separate pool of low-latency block storage against the SSD nodes.
Finally, I'd like to add a third node type that has a high number of
spinning disks, no SSD journals and runs object storage on an EC pool.
This final pool would be for backup purposes.

I can envision running all these in the same cluster with a crushmap
that allocates the pools to the correct OSDs.  However, I'm concerned
about the radius of failure running all these different use cases on a
single cluster.

I have for example, had an instance where a single full OSD caused the
entire cluster to stop accepting writes, which affected all the pools
in the cluster, regardless of whether those pools had PGs on the
affected OSD.

It's simple enough to run separate clusters for these, but then I'd be
faced with that complexity as well, including some number of mons for
each.  I'm wondering if I'm overstating the risks and benefits of
having a single crushmap.  i.e. instead of cache tiering, I can do a
primary SSD secondary and tertiary on spinning disk.

Any thoughts and experiences on this topic would be welcome.


-H
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Same pg scrubbed over and over (Jewel)

2016-09-20 Thread Martin Bureau
Hello,


I noticed that the same pg gets scrubbed repeatedly on our new Jewel cluster:


Here's an excerpt from log:


2016-09-20 20:36:31.236123 osd.12 10.1.82.82:6820/14316 150514 : cluster [INF] 
25.3f scrub ok
2016-09-20 20:36:32.232918 osd.12 10.1.82.82:6820/14316 150515 : cluster [INF] 
25.3f scrub starts
2016-09-20 20:36:32.236876 osd.12 10.1.82.82:6820/14316 150516 : cluster [INF] 
25.3f scrub ok
2016-09-20 20:36:33.233268 osd.12 10.1.82.82:6820/14316 150517 : cluster [INF] 
25.3f deep-scrub starts
2016-09-20 20:36:33.242258 osd.12 10.1.82.82:6820/14316 150518 : cluster [INF] 
25.3f deep-scrub ok
2016-09-20 20:36:36.233604 osd.12 10.1.82.82:6820/14316 150519 : cluster [INF] 
25.3f scrub starts
2016-09-20 20:36:36.237221 osd.12 10.1.82.82:6820/14316 150520 : cluster [INF] 
25.3f scrub ok
2016-09-20 20:36:41.234490 osd.12 10.1.82.82:6820/14316 150521 : cluster [INF] 
25.3f deep-scrub starts
2016-09-20 20:36:41.243720 osd.12 10.1.82.82:6820/14316 150522 : cluster [INF] 
25.3f deep-scrub ok
2016-09-20 20:36:45.235128 osd.12 10.1.82.82:6820/14316 150523 : cluster [INF] 
25.3f deep-scrub starts
2016-09-20 20:36:45.352589 osd.12 10.1.82.82:6820/14316 150524 : cluster [INF] 
25.3f deep-scrub ok
2016-09-20 20:36:47.235310 osd.12 10.1.82.82:6820/14316 150525 : cluster [INF] 
25.3f scrub starts
2016-09-20 20:36:47.239348 osd.12 10.1.82.82:6820/14316 150526 : cluster [INF] 
25.3f scrub ok
2016-09-20 20:36:49.235538 osd.12 10.1.82.82:6820/14316 150527 : cluster [INF] 
25.3f deep-scrub starts
2016-09-20 20:36:49.243121 osd.12 10.1.82.82:6820/14316 150528 : cluster [INF] 
25.3f deep-scrub ok
2016-09-20 20:36:51.235956 osd.12 10.1.82.82:6820/14316 150529 : cluster [INF] 
25.3f deep-scrub starts
2016-09-20 20:36:51.244201 osd.12 10.1.82.82:6820/14316 150530 : cluster [INF] 
25.3f deep-scrub ok
2016-09-20 20:36:52.236076 osd.12 10.1.82.82:6820/14316 150531 : cluster [INF] 
25.3f scrub starts
2016-09-20 20:36:52.239376 osd.12 10.1.82.82:6820/14316 150532 : cluster [INF] 
25.3f scrub ok
2016-09-20 20:36:56.236740 osd.12 10.1.82.82:6820/14316 150533 : cluster [INF] 
25.3f scrub starts


How can I troubleshoot / resolve this ?


Regards,

Martin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cache tier not flushing 10.2.2

2016-09-20 Thread Jim Kilborn
Please disregard this. I have a error in my target_max_bytes, that was causing 
the issue. I now have it evicting the cache.







Sent from Mail for Windows 10



From: Jim Kilborn
Sent: Tuesday, September 20, 2016 12:59 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] cache tier not flushing 10.2.2



Simple issue I cant find with the cache tier. Thanks for taking the time…

Setup a new cluster with ssd cache tier. My cache tier is on 1TB ssd. With 2 
replicas. It just fills up my cache until the ceph filesystem stops allowing 
access.
I even set the target_max_bytes to 1048576 (1GB) and still doesn’t flush.

Here are the settings:

Setup the pools

ceph osd pool create cephfs-cache 512 512 replicated ssd_ruleset
ceph osd pool create cephfs-metadata 512 512 replicated ssd_ruleset
ceph osd pool create cephfs-data 512 512 erasure default spinning_ruleset
ceph osd pool set cephfs-cache min_size 1
ceph osd pool set cephfs-cache size 2
ceph osd pool set cephfs-metadata min_size 1
ceph osd pool set cephfs-metadata size 2



Add tiers

ceph osd tier add cephfs-data cephfs-cache
ceph osd tier cache-mode cephfs-cache writeback
ceph osd tier set-overlay cephfs-data cephfs-cache
ceph osd pool set cephfs-cache hit_set_type bloom
ceph osd pool set cephfs-cache hit_set_count 1
ceph osd pool set cephfs-cache hit_set_period 3600
ceph osd pool set cephfs-cache target_max_bytes 1048576 # 1 TB
ceph osd pool set cephfs-cache cache_target_dirty_ratio 0.4 # percentage of 
target_max_bytes before flushes dirty objects
ceph osd pool set cephfs-cache cache_target_dirty_high_ratio 0.6 # percentage 
of target_max_bytes before flushes dirty objects more aggressively
ceph osd pool set cephfs-cache cache_target_full_ratio 0.80 # percentage of 
cache full before evicts objects


Am I missing something stupid? Must be. I can cause it to flush with
rados -p cephfs-cache cache-try-flush-evict-all

Should my metadata not be on the same pool as the cache pool?

I cant figure out why it doesn’t start flushing when I copy over 2 GB data. It 
just goes to
'cephfs-cache' at/near target max

Regards,
Jim

Sent from Mail for Windows 10

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Auto recovering after loosing all copies of a PG(s)

2016-09-20 Thread Gregory Farnum
On Tue, Sep 20, 2016 at 6:19 AM, Iain Buclaw  wrote:
> On 1 September 2016 at 23:04, Wido den Hollander  wrote:
>>
>>> Op 1 september 2016 om 17:37 schreef Iain Buclaw :
>>>
>>>
>>> On 16 August 2016 at 17:13, Wido den Hollander  wrote:
>>> >
>>> >> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw :
>>> >>
>>> >>
>>> >> The desired behaviour for me would be for the client to get an instant
>>> >> "not found" response from stat() operations.  For write() to recreate
>>> >> unfound objects.  And for missing placement groups to be recreated on
>>> >> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
>>> >> it can still be accessed is just not workable, I'm afraid.
>>> >>
>>> >
>>> > Well, you can't make Ceph do that, but you can make librados do such a 
>>> > thing.
>>> >
>>> > I'm using the OSD and MON timeout settings in libvirt for example: 
>>> > http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157
>>> >
>>> > You can set these options:
>>> > - client_mount_timeout
>>> > - rados_mon_op_timeout
>>> > - rados_osd_op_timeout
>>> >
>>> > Where I think only the last two should be sufficient in your case.
>>> >
>>> > You wel get ETIMEDOUT back as error when a operation times out.
>>> >
>>> > Wido
>>> >
>>>
>>> This seems to be fine.
>>>
>>> Now what to do when a DR situation happens.
>>>
>>>
>>>   pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects
>>> 2485 GB used, 10691 GB / 13263 GB avail
>>> 3902 active+clean
>>>  128 creating
>>>   66 incomplete
>>>
>>>
>>> These PGs just never seem to finish creating.
>>>
>>
>> I have seen that happen as well, you sometimes need to restart the OSDs to 
>> let the create finish.
>>
>> Wido
>>
>
> Just had another DR situation happen again over the weekend, and I can
> confirm that setting client side timeouts did effectively nothing to
> help the situation.  According to the ceph performance stats, the
> total throughput of client operations went from 5000 per second to
> just 20.  All clients are set with rados osd op timeout = 0.5, and are
> using AIO.
>
> Why must everything come to a halt internally when 1/30 OSDs of the
> cluster is down? I managed only to get it up to 70 ops after forcibly
> completing the PGs (stale+active+clean).  Then I got back up to normal
> operations (-ish) after issuing force_create_pg, then stop and start
> the OSD where the PG got moved to.
>
> This is something that I'm trying to understand about ceph/librados.
> If one disk is down, the whole system is collapses to a trickling low
> rate that is not really any better than being completely down. It's as
> if it cannot cope with loosing a disk that holds the only copy of a
> PG.

Yes; the whole system is designed to prevent this. I understand your
use case but unfortunately Ceph would require a fair bit of surgery to
really be happy as a disposable object store. You might be able to
hack it together by having the OSD checks for down PGs return an error
code instead of putting requests on a waitlist, and by having clients
which see that error send off monitor commands, but it would
definitely be a hack.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best Practices for Managing Multiple Pools

2016-09-20 Thread Wido den Hollander

> Op 20 september 2016 om 21:23 schreef Heath Albritton :
> 
> 
> I'm wondering if anyone has some tips for managing different types of
> pools, each of which fall on a different type of OSD.
> 
> Right now, I have a small cluster running with two kinds of OSD nodes,
> ones with spinning disks (and SSD journals) and another with all SATA
> SSD.  I'm currently running cache tiering and looking to move away
> from that.
> 
> My end goal is to have a general purpose block storage pool on the
> spinning disks along with object storage.  Then I'd like to do a
> separate pool of low-latency block storage against the SSD nodes.
> Finally, I'd like to add a third node type that has a high number of
> spinning disks, no SSD journals and runs object storage on an EC pool.
> This final pool would be for backup purposes.
> 
> I can envision running all these in the same cluster with a crushmap
> that allocates the pools to the correct OSDs.  However, I'm concerned
> about the radius of failure running all these different use cases on a
> single cluster.
> 
> I have for example, had an instance where a single full OSD caused the
> entire cluster to stop accepting writes, which affected all the pools
> in the cluster, regardless of whether those pools had PGs on the
> affected OSD.
> 
> It's simple enough to run separate clusters for these, but then I'd be
> faced with that complexity as well, including some number of mons for
> each.  I'm wondering if I'm overstating the risks and benefits of
> having a single crushmap.  i.e. instead of cache tiering, I can do a
> primary SSD secondary and tertiary on spinning disk.
> 
> Any thoughts and experiences on this topic would be welcome.
> 

Well, there is no answer here which is the "right" one. Having it all in one 
cluster makes it easy to move pools between OSDs, and replace hardware easily. 
It's also one cluster which you have to manage.

But as you say, the failure domain becomes larger.

Having multiple clusters simply means more work and you are not able to migrate 
as smoothly as you can when it's all one cluster.

What you want can easily be done inside a single cluster. You just have to do 
proper monitoring. Imho, having a OSD go to full is monitoring which failed 
somewhere.

You can always set the nearfull ratio lower to get the cluster to go to WARN 
earlier.

Wido

> 
> -H
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel Docs | error on mount.ceph page

2016-09-20 Thread Ilya Dryomov
On Tue, Sep 20, 2016 at 7:48 PM, David  wrote:
> Sorry I don't know the correct way to report this.
>
> Potential error on this page:
>
> on http://docs.ceph.com/docs/jewel/man/8/mount.ceph/
>
> Currently:
>
> rsize
> int (bytes), max readahead, multiple of 1024, Default: 524288 (512*1024)
>
> Should it be something like the following?
>
> rsize
> int (bytes), max read size. Default: none
>
> rasize
> int (bytes), max readahead, multiple of 1024, Default: 8388608 (8192*1024)

This was fixed in master last week and can probably be added into the
next jewel point release.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stat speed for objects in ceph

2016-09-20 Thread Haomai Wang
On Wed, Sep 21, 2016 at 2:41 AM, Wido den Hollander  wrote:
>
>> Op 20 september 2016 om 20:30 schreef Haomai Wang :
>>
>>
>> On Wed, Sep 21, 2016 at 2:26 AM, Wido den Hollander  wrote:
>> >
>> >> Op 20 september 2016 om 19:27 schreef Gregory Farnum :
>> >>
>> >>
>> >> In librados getting a stat is basically equivalent to reading a small
>> >> object; there's not an index or anything so FileStore needs to descend its
>> >> folder hierarchy. If looking at metadata for all the objects in the system
>> >> efficiently is important you'll want to layer an index in somewhere.
>> >> -Greg
>> >>
>> >
>> > Should we expect a improvement here with BlueStore vs FileStore? That 
>> > would basically be a RocksDB lookup on the OSD, right?
>>
>> Yes, bluestore will be much better since it has indexed on Onode(like
>> inode) in rocksdb. Although it's fast enough, it also cost some on
>> construct object, if you only want to check object existence, we may
>> need a more lightweight interface
>>
>
> It's rados_stat() which would be called, that is the way to check if a object 
> exists. If I remember the BlueStore architecture correctly it would be a 
> lookup in RocksDB with all the information in there.

Exactly, but compared to database query, this lookup is still heavy.
Each onode construct need to get lots of keys and do inline construct.
Of course, it's a cheaper one in all rados interfaces.

>
> Wido
>
>> >
>> > Wido
>> >
>> >> On Tuesday, September 20, 2016, Iain Buclaw  wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > As a general observation, the speed of calling stat() on any object in
>> >> > ceph is relatively slow.  I'm probably getting a rate of about 10K per
>> >> > second using AIO, and even then it is really *really* bursty, to the
>> >> > point where there could be 5 seconds of activity going in one
>> >> > direction, then the callback thread wakes up and processes all queued
>> >> > completions in a single blast.
>> >> >
>> >> > At our current rate with more than 1 billion objects in a pool, it's
>> >> > looking like if I was to check the existence of every object, that it
>> >> > would take around 19-24 hours to complete.
>> >> >
>> >> > Granted that our starting point before beginning some migrations to
>> >> > Ceph was around 1 hour to check the existence of every object, this is
>> >> > something of a concern.  Are there any ways via librados to improve
>> >> > the throughput of processing objects?
>> >> >
>> >> > Adding more instances or sharding work doesn't seem to increase the
>> >> > overall throughput at all.  And cache won't help either, there is no
>> >> > determinism in what's accessed, and given the size of the pool OS
>> >> > filesystem cache is useless anyway.
>> >> >
>> >> > Thanks,
>> >> > --
>> >> > Iain Buclaw
>> >> >
>> >> > *(p < e ? p++ : p) = (c & 0x0f) + '0';
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com 
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how run multiple node in single machine in previous version of ceph

2016-09-20 Thread Brad Hubbard
Just use git to checkout and build that branch (older branches use
autotools) and then follow the instructions for that release.

http://docs.ceph.com/docs/infernalis/dev/quick_guide/
http://docs.ceph.com/docs/hammer/dev/quick_guide/

On Tue, Sep 20, 2016 at 12:19 AM, agung Laksono  wrote:
> The latest version of ceph, it's easy to develop a cluster with
> multiple nodes in an only single machine by using vstart.sh.
>
> How if I want to develop a cluster with multiple nodes on a single machine
> for an
> old version of ceph like infernalis or an older one?
>
> Any answer is very appreciated. thanks
>
>
> --
> Cheers,
>
> Agung Laksono
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com