Re: [ceph-users] OSD disk concern

2017-04-19 Thread gjprabu
Hi Shuresh,



   Thanks for your reply,   Is it ok to have OS on normal SATA hard 
drive, volume and journal on same SSD.  Mainly we are asking this suggestion 
for performance purpose.



Regards

Prabu GJ  



 On Wed, 19 Apr 2017 11:54:04 +0530 Shuresh  
wrote 




Hi Prabhu,



You can use both the OS and OSD in single SSD.



Regards

Shureshbabu




On 19 April 2017 at 11:02, gjprabu  wrote:








Hi Team,



 Ceph OSD disk allocation procedure suggested that "We recommend using 
a dedicated drive for the operating system and software, and one drive for each 
Ceph OSD Daemon you run on the host"  We have only SSD hard disk and is it 
advisable to run OS and OSD on the same disk.



Regards

Prabu GJ







___

 ceph-users mailing list

 ceph-users@lists.ceph.com

 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD disk concern

2017-04-19 Thread Peter Maloney
On 04/19/17 07:42, gjprabu wrote:
> Hi Shuresh,
>
>Thanks for your reply,   Is it ok to have OS on normal SATA
> hard drive, volume and journal on same SSD.  Mainly we are asking this
> suggestion for performance purpose.
>
For performance, it's always best to make it as parallel as possible.

The OS is probably pretty static... mostly logs are written. So it might
not impact performance much... but how much is too much depends on your
needs. It's up to you to decide to combine them. The best practice is
not to.

The mons don't need much IO, but it is said that a really huge scale
should have them totally separate as they become a bottleneck, and at
small scale the mon still doesn't need much IO, but what it uses will
slow down an OSD it shares with. So a mon could easily share the OS
disk, especially an SSD, on a small cluster.

And in general, the OSD should always be completely separate. But it
won't crash if you combine it, until it ends up so slow that it times
out. Whether this is acceptable or not depends on your needs.

And same with journals, but more significantly slower if you colocate
the journal with the osd, except if it's all SSD, then it's more cost
effective to have more SSD osds with colocated journals than having
separate journals to go with them (because you won't use all the space
on the journal, it's more costly relative to the benefit, also using up
valuable disk bays).

> Regards
> Prabu GJ  
>
>  On Wed, 19 Apr 2017 11:54:04 +0530 *Shuresh
> * wrote 
>
> Hi Prabhu,
>
> You can use both the OS and OSD in single SSD.
>
> Regards
> Shureshbabu
>
> On 19 April 2017 at 11:02, gjprabu  > wrote:
>
>
> Hi Team,
>
>  Ceph OSD disk allocation procedure suggested that
> "*We recommend using a dedicated drive for the operating
> system and software, and one drive for each Ceph OSD Daemon
> you run on the host*"  We have only SSD hard disk and is it
> advisable to run OS and OSD on the same disk.
>
> Regards
> Prabu GJ
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does cephfs guarantee client cache consistency for file data?

2017-04-19 Thread 许雪寒
Hi, everyone.

I’m new to cephfs. I wonder whether cephfs guarantee client cache consistency 
for file content. For example, if client A read some data of file X, then 
client B modified the X’s content in the range that A read, will A be notified 
of the modification?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does cephfs guarantee client cache consistency for file data?

2017-04-19 Thread David Disseldorp
Hi,

On Wed, 19 Apr 2017 08:19:50 +, 许雪寒 wrote:

> I’m new to cephfs. I wonder whether cephfs guarantee client cache consistency 
> for file content. For example, if client A read some data of file X, then 
> client B modified the X’s content in the range that A read, will A be 
> notified of the modification?

Yes, clients are granted fine-grained cache "capabilities" by the MDS.
These capabilities can be revoked to trigger a flush of cached content,
prior to satisfying a request from a separate client.

Jeff Layton did a nice write up on this:
https://jtlayton.wordpress.com/2016/09/01/cephfs-and-the-nfsv4-change-attribute/

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd kernel client fencing

2017-04-19 Thread Chaofan Yu
Hi list,

  I wonder someone can help with rbd kernel client fencing (aimed to avoid 
simultaneously rbd map on different hosts).

I know the exclusive rbd image feature is added later to avoid manual rbd lock 
CLIs. But want to know previous blacklist solution.

The official workflow I’ve got is listed below (without exclusive rbd feature) :

 - identify old rbd lock holder (rbd lock list )
 - blacklist old owner (ceph osd blacklist add )
 - break old rbd lock (rbd lock remove   )
 - lock rbd image on new host (rbd lock add  )
 - map rbd image on new host

The blacklisted entry identified by entity_addr_t (ip, port, nonce).

However as far as I know, ceph kernel client will do socket reconnection if 
connection failed. So I wonder in this scenario it won’t work:

1. old client network down for a while
2. perform below steps on new host to achieve failover
 - identify old rbd lock holder (rbd lock list )
 - blacklist old owner (ceph osd blacklist add )
 - break old rbd lock (rbd lock remove   )
 - lock rbd image on new host (rbd lock add  )
 - map rbd image on new host
3. old client network come back and reconnect to osds with new created socket 
client, i.e. new (ip, port,nonce) turple

as a result both new and old client can write to same rbd image, which might 
potentially cause the data corruption.

So does this mean if kernel client does not support exclusive-lock image 
feature, fencing is not possible ?___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: Does cephfs guarantee client cache consistency for file data?

2017-04-19 Thread 许雪寒
Thanks, everyone:-)

I'm still not very clear. Do these cache "capabilities" only apply to metadata 
operations or both metadata and data?

-邮件原件-
发件人: David Disseldorp [mailto:dd...@suse.de] 
发送时间: 2017年4月19日 16:46
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] Does cephfs guarantee client cache consistency for file 
data?

Hi,

On Wed, 19 Apr 2017 08:19:50 +, 许雪寒 wrote:

> I’m new to cephfs. I wonder whether cephfs guarantee client cache consistency 
> for file content. For example, if client A read some data of file X, then 
> client B modified the X’s content in the range that A read, will A be 
> notified of the modification?

Yes, clients are granted fine-grained cache "capabilities" by the MDS.
These capabilities can be revoked to trigger a flush of cached content, prior 
to satisfying a request from a separate client.

Jeff Layton did a nice write up on this:
https://jtlayton.wordpress.com/2016/09/01/cephfs-and-the-nfsv4-change-attribute/

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: Why is there no data backup mechanism in the rados layer?

2017-04-19 Thread 许雪寒
Thanks:-)

If there is a mechanism that can provide a replication functionality with the 
granularity of object, that is rados can do replication of objects on behalf of 
higher layer features while leaving other objects not replicated if those 
features decide some objects need to be replicated to another cluster, wouldn't 
this be a better way?

发件人: Christian Balzer [mailto:ch...@gol.com] 
发送时间: 2017年1月3日 19:47
收件人: ceph-users@lists.ceph.com
抄送: 许雪寒
主题: Re: [ceph-users] Why is there no data backup mechanism in the rados layer?


Hello,

On Tue, 3 Jan 2017 11:16:27 + 许雪寒 wrote:

> Hi, everyone.
> 
> I’m researching the online backup mechanism of ceph, like rbd mirroring and 
> multi-site. And I’m a little confused. Why is there no data backup mechanism 
> in the rados layer? Wouldn’t this save the bother to implement a backup 
> system for every higher level feature of ceph, like rbd and rgw?
> 
> Thank you:)
>
Firstly, probably the same reasons RGW and RBD mirrorring are set up the
way they are, that is asynchronously to avoid huge latency penalties. 
And thus tricky on the RADOS level in a generic AND selectable fashion.
I.e. you may NOT want to mirror your whole cluster over that puny 1Gb/s
link.

Secondly, what you're talking about is HA, redundancy and replication, NOT
backup.
A backup system should if anyhow possible not involve Ceph at all, to
avoid having a single bug bringing down your li


<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: Does cephfs guarantee client cache consistency for file data?

2017-04-19 Thread John Spray
On Wed, Apr 19, 2017 at 11:03 AM, 许雪寒  wrote:
> Thanks, everyone:-)
>
> I'm still not very clear. Do these cache "capabilities" only apply to 
> metadata operations or both metadata and data?

Both metadata and data are consistent between clients.  If a client
has the capability to buffer data for a file, it will be told to flush
it before another client reads the file.  Similarly, if a client has
the capability to cache (for read) data for a file, it will be told to
drop that before another client writes the file.

Usually, if the MDS sees two clients using the same file concurrently,
it will instruct both of them to switch to unbuffered IO.

John

>
> -邮件原件-
> 发件人: David Disseldorp [mailto:dd...@suse.de]
> 发送时间: 2017年4月19日 16:46
> 收件人: 许雪寒
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] Does cephfs guarantee client cache consistency for file 
> data?
>
> Hi,
>
> On Wed, 19 Apr 2017 08:19:50 +, 许雪寒 wrote:
>
>> I’m new to cephfs. I wonder whether cephfs guarantee client cache 
>> consistency for file content. For example, if client A read some data of 
>> file X, then client B modified the X’s content in the range that A read, 
>> will A be notified of the modification?
>
> Yes, clients are granted fine-grained cache "capabilities" by the MDS.
> These capabilities can be revoked to trigger a flush of cached content, prior 
> to satisfying a request from a separate client.
>
> Jeff Layton did a nice write up on this:
> https://jtlayton.wordpress.com/2016/09/01/cephfs-and-the-nfsv4-change-attribute/
>
> Cheers, David
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Richard Hesketh
On 18/04/17 22:28, Anthony D'Atri wrote:
> I get digests, so please forgive me if this has been covered already.
> 
>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
> 
> 1:4-5 is common but depends on your needs and the devices in question, ie. 
> assuming LFF drives and that you aren’t using crummy journals.
> 
>> First of all, is this even a valid architecture decision? 
> 
> Inktank described it to me back in 2014/2015 so I don’t think it’s ultra 
> outré.  It does sound like a lot of work to maintain, especially when 
> components get replaced or added.
> 
>> it should boost performance levels considerably compared to spinning disks,
> 
> Performance in which sense?  I would expect it to boost read performance but 
> not so much writes.
> 
> I haven’t used cache tiering so can’t comment on the relative merits.  Your 
> local workload may be a factor.
> 
> — aad

As it happens I've got a ceph cluster with a 1:2 SSD to HDD ratio and I did 
some fio testing a while ago with an SSD-primary pool to see how it performed, 
investigating as an alternative to a cache layer. Generally the results were as 
aad predicts - read performance for the pool was considerably better, almost as 
good as a pure SSD pool. Write performance was better but not so significantly 
improved, only going up to maybe 50% faster depending on the exact workload.

In the end I went with splitting the HDDs and SSDs into separate pools, and 
just using the SSD pool for VMs/datablocks which needed to be snappier. For 
most of my users it didn't matter that the backing pool was kind of slow, and 
only a few were wanting to do I/O intensive workloads where the speed was 
required, so putting so much of the data on the SSDs would have been something 
of a waste.

-- 
Richard Hesketh



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph extension - how to equilibrate ?

2017-04-19 Thread pascal.pu...@pci-conseil.net

[...]
I hope those aren't SMR disks... make sure they're not or it will be 
very slow, to the point where *osds will time out and die*.
Hopefully : DELL 8TB 7.2K RPM NLSAS 12Gbps 512e 3.5in Hot-plug Hard 
Drive,sync

:)
This is not for performance, just for cold data.


ceph osd crush move osd.X host=nodeY

If your journals aren't being moved too, then flush the journals after 
the osds are stopped:


sync
ceph-osd --id $n --setuser ceph --setgroup ceph --flush-journal

(if that crashes, start the osd, then stop again, and retry)

and before starting them, make new journals.

ceph-osd --id $n --setuser ceph --setgroup ceph --mkjournal


Perfect ! There was a way. thanks

Regards,









___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding a new rack to crush map without pain?

2017-04-19 Thread Matthew Vernon
Hi,

> How many OSD's are we talking about? We're about 500 now, and even
> adding another 2000-3000 is a 5 minute cut/paste job of editing the
> CRUSH map. If you really are adding racks and racks of OSD's every week,
> you should have found the crush location hook a long time ago. 

We have 540 at the moment, and have another 540 due in May, and then
about 1500 due at some point in the summer.

Regards,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph extension - how to equilibrate ?

2017-04-19 Thread Maxime Guyot
Hi Pascal,

I ran into the same situation some time ago: a small cluster and adding a node 
with HDDs double the size of the existing ones and wrote about it here: 
http://ceph.com/planet/the-schrodinger-ceph-cluster/

When adding OSDs to a cluster rebalancing/data movement is unavoidable in most 
cases. Since you will be going from a 144TB cluster to a 240 TB cluster, you 
can estimate that +66% of your data will be rebalanced/moved.

Peter already covered how to move the HDDs from one server to another (incl. 
journal). I just want to point out that you can do the “ceph osd crush set" 
before you do the physical move of the drives. This lets you rebalance on your 
own terms (schedule, rollback etc…).

The easy way:

-  Create the new OSDs (8TB) with weight 0

-  Move each OSDs to its desired location and weights: “ceph osd crush 
set osd.X  root= host=”

-  Monitor and wait for the rebalance to be done (a few days or weeks 
depending on performance)

-  Set noout && physically move the drives && unset noout

In production, you want to consider the op priority and the granularity of the 
increase (increasing weights progressively etc…).

Cheers,
Maxime

From: ceph-users  on behalf of Peter Maloney 

Date: Tuesday 18 April 2017 20:26
To: "pascal.pu...@pci-conseil.net" , 
"ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Ceph extension - how to equilibrate ?

On 04/18/17 16:31, 
pascal.pu...@pci-conseil.net wrote:

Hello,

Just an advise : next time, I will extend my Jewel ceph cluster with a fourth 
node.

Actually, we have 3 x nodes of 12 x OSD with 4TB DD (36 x DD 4TB).

I will add a new node with 12 x 8TB DD (will add 12 new OSD => 48 OSD).
I hope those aren't SMR disks... make sure they're not or it will be very slow, 
to the point where osds will time out and die.


So, how to simply equilibrate ?

How to just unplug 3 x DD 4TB per node and add to fourth node  and just plug 3 
x 8TB in each node coming from fourth node ?
I think you only have to stop them (hopefully not enough to cause missing 
objects, and optionally set noout first), unmount them, move the disks, mount 
them and start them on the new node. Then change the crush rule:

ceph osd crush move osd.X host=nodeY

If your journals aren't being moved too, then flush the journals after the osds 
are stopped:

sync
ceph-osd --id $n --setuser ceph --setgroup ceph --flush-journal

(if that crashes, start the osd, then stop again, and retry)

and before starting them, make new journals.

ceph-osd --id $n --setuser ceph --setgroup ceph --mkjournal



I want at the end : 3 x DD 8TB per node and 9 x DD 4TB per node ?

How to do that in the easyest way ?

I don't want move all data : It will take a long time per OSD...
I don't know how much data this will move if any... but if it moves data, you 
probably don't have a choice.


Is there a way to just switch OSD between node ?

Thanks for your help.
Pascal,





___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding a new rack to crush map without pain?

2017-04-19 Thread Maxime Guyot
Hi Matthew,

I would expect the osd_crush_location parameter to take effect from the OSD 
activation. Maybe ceph-ansible would have info there?
A work around might be “set noin”, restart all the OSDs once the ceph.conf 
includes the crush location and enjoy the automatic CRUSHmap update (if you 
have osd crush update on start = true).

Cheers,
Maxime

On 12/04/17 18:46, "ceph-users on behalf of Matthew Vernon" 
 wrote:

Hi,

Our current (jewel) CRUSH map has rack / host / osd (and the default
replication rule does step chooseleaf firstn 0 type rack). We're shortly
going to be adding some new hosts in new racks, and I'm wondering what
the least-painful way of getting the new osds associated with the
correct (new) rack will be.

We deploy with ceph-ansible, which can add bits of the form
[osd.104]
osd crush location = root=default rack=1 host=sto-1-1

to ceph.conf, but I think this doesn't help for new osds, since
ceph-disk will activate them before ceph.conf is fully assembled (and
trying to arrange it otherwise would be serious hassle).

Would making a custom crush location hook be the way to go? then it'd
say rack=4 host=sto-4-x and new osds would end up allocated to rack 4?
And would I need to have done ceph osd crush add-bucket rack4 rack
first, presumably?

I am planning on adding osds to the cluster one box at a time, rather
than going with the add-everything-at-crush-weight-0 route; if nothing
else it seems easier to automate. And I'd rather avoid having to edit
the crush map directly...

Any pointers welcomed :)

Regards,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Sharing SSD journals and SSD drive choice

2017-04-19 Thread Adam Carheden
Does anyone know if XFS uses a single thread to write to it's journal?

I'm evaluating SSDs to buy as journal devices. I plan to have multiple
OSDs share a single SSD for journal. I'm benchmarking several brands as
described here:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

It appears that sequential write speed using multiple threads varies
widely between brands. Here's what I have so far:
 SanDisk SDSSDA240G, dd:6.8 MB/s
 SanDisk SDSSDA240G, fio  1 jobs:   6.7 MB/s
 SanDisk SDSSDA240G, fio  2 jobs:   7.4 MB/s
 SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio  8 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio 16 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s
 SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s
HFS250G32TND-N1A2A 3P10, dd:1.8 MB/s
HFS250G32TND-N1A2A 3P10, fio  1 jobs:   4.8 MB/s
HFS250G32TND-N1A2A 3P10, fio  2 jobs:   5.2 MB/s
HFS250G32TND-N1A2A 3P10, fio  4 jobs:   9.5 MB/s
HFS250G32TND-N1A2A 3P10, fio  8 jobs:  23.4 MB/s
HFS250G32TND-N1A2A 3P10, fio 16 jobs:   7.2 MB/s
HFS250G32TND-N1A2A 3P10, fio 32 jobs:  49.8 MB/s
HFS250G32TND-N1A2A 3P10, fio 64 jobs:  70.5 MB/s
INTEL SSDSC2BB150G7, dd:   90.1 MB/s
INTEL SSDSC2BB150G7, fio  1 jobs:  91.0 MB/s
INTEL SSDSC2BB150G7, fio  2 jobs: 108.3 MB/s
INTEL SSDSC2BB150G7, fio  4 jobs: 134.2 MB/s
INTEL SSDSC2BB150G7, fio  8 jobs: 118.2 MB/s
INTEL SSDSC2BB150G7, fio 16 jobs:  39.9 MB/s
INTEL SSDSC2BB150G7, fio 32 jobs:  25.4 MB/s
INTEL SSDSC2BB150G7, fio 64 jobs:  15.8 MB/s

The SanDisk is slow, but speed is the same at any number of threads. The
Intel peaks at 4-6 threads and then declines rapidly into sub-par
performance (at least for a pricey "enterprise" drive). The SK Hynix is
slow at low numbers of threads but gets huge performance gains with more
threads. (This is all with one trial, but I have a script running
multiple trials across all drives today.)

So if XFS has a single thread that does journaling, it looks like my
best option would be 1 intel SSD shared by 4-6 OSDs. If XFS already
throws multiple threads at the journal, then having OSDs share an Intel
drive will likely kill my SSD performance, but having as many OSDs as I
can cram in a chassis share the SK Hynix drive would get me great
performance for a fraction of the cost.

Anyone have any related advice or experience to share regarding journal
SSD selection?
-- 
Adam Carheden

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Maxime Guyot
Hi,

>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>1:4-5 is common but depends on your needs and the devices in question, ie. 
>assuming LFF drives and that you aren’t using crummy journals.

You might be speaking about different ratios here. I think that Anthony is 
speaking about journal/OSD and Reed speaking about capacity ratio between and 
HDD and SSD tier/root. 

I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on HDD), 
like Richard says you’ll get much better random read performance with primary 
OSD on SSD but write performance won’t be amazing since you still have 2 HDD 
copies to write before ACK. 

I know the doc suggests using primary affinity but since it’s a OSD level 
setting it does not play well with other storage tiers so I searched for other 
options. From what I have tested, a rule that selects the first/primary OSD 
from the ssd-root then the rest of the copies from the hdd-root works. Though I 
am not sure it is *guaranteed* that the first OSD selected will be primary.

“rule hybrid {
  ruleset 2
  type replicated
  min_size 1
  max_size 10
  step take ssd-root
  step chooseleaf firstn 1 type host
  step emit
  step take hdd-root
  step chooseleaf firstn -1 type host
  step emit
}”

Cheers,
Maxime



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Reed Dier
Hi Maxime,

This is a very interesting concept. Instead of the primary affinity being used 
to choose SSD for primary copy, you set crush rule to first choose an osd in 
the ‘ssd-root’, then the ‘hdd-root’ for the second set.

And with 'step chooseleaf first {num}’
> If {num} > 0 && < pool-num-replicas, choose that many buckets. 
So 1 chooses that bucket
> If {num} < 0, it means pool-num-replicas - {num}
And -1 means it will fill remaining replicas on this bucket.

This is a very interesting concept, one I had not considered.
Really appreciate this feedback.

Thanks,

Reed

> On Apr 19, 2017, at 12:15 PM, Maxime Guyot  wrote:
> 
> Hi,
> 
>>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>> 1:4-5 is common but depends on your needs and the devices in question, ie. 
>> assuming LFF drives and that you aren’t using crummy journals.
> 
> You might be speaking about different ratios here. I think that Anthony is 
> speaking about journal/OSD and Reed speaking about capacity ratio between and 
> HDD and SSD tier/root. 
> 
> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
> HDD), like Richard says you’ll get much better random read performance with 
> primary OSD on SSD but write performance won’t be amazing since you still 
> have 2 HDD copies to write before ACK. 
> 
> I know the doc suggests using primary affinity but since it’s a OSD level 
> setting it does not play well with other storage tiers so I searched for 
> other options. From what I have tested, a rule that selects the first/primary 
> OSD from the ssd-root then the rest of the copies from the hdd-root works. 
> Though I am not sure it is *guaranteed* that the first OSD selected will be 
> primary.
> 
> “rule hybrid {
>  ruleset 2
>  type replicated
>  min_size 1
>  max_size 10
>  step take ssd-root
>  step chooseleaf firstn 1 type host
>  step emit
>  step take hdd-root
>  step chooseleaf firstn -1 type host
>  step emit
> }”
> 
> Cheers,
> Maxime
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bluestore object overhead

2017-04-19 Thread Pavel Shub
Hey All,

I'm running a test of bluestore in a small VM and seeing 2x overhead
for each object in cephfs. Here's the output of df detail
https://gist.github.com/pavel-citymaps/868a7c4b1c43cea9ab86cdf2e79198ee

This is on a VM with all daemons & 20gb disk, all pools are of size 1.
Is this the expected amount of overhead per object? Is there anyway to
tweak bluestore settings?

Thanks,
Pavel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore object overhead

2017-04-19 Thread Gregory Farnum
On Wed, Apr 19, 2017 at 1:26 PM, Pavel Shub  wrote:
> Hey All,
>
> I'm running a test of bluestore in a small VM and seeing 2x overhead
> for each object in cephfs. Here's the output of df detail
> https://gist.github.com/pavel-citymaps/868a7c4b1c43cea9ab86cdf2e79198ee
>
> This is on a VM with all daemons & 20gb disk, all pools are of size 1.
> Is this the expected amount of overhead per object? Is there anyway to
> tweak bluestore settings?

You're going to need to be clearer about what you mean by 2x overhead.
Bluestore itself has a minimum size beneath which it will journal
objects and then copy them into place, which might be considered 2x
overhead. If you're talking about total number of cluster-wide disk
ops, there's also a CephFS log which journals metadata updates that
get flushed out to backing objects later, which might be considered 2x
overhead. But I don't know what you mean just based on a ceph df. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librbd::ImageCtx: error reading immutable metadata: (2) No such file or directory

2017-04-19 Thread Gregory Farnum
On Tue, Apr 18, 2017 at 4:27 AM, Frode Nordahl  wrote:
> Hello all,
>
> A while ago I came across a Ceph cluster with a RBD volume missing the
> header object describing the characteristics of the volume, making it
> impossible to attach or perform any operations on said volume.
>
> As a courtesy to anyone else encountering the same situation I would like to
> share how I solved this:
> http://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recovery/

This is a really clear writeup. Thanks for sharing!
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-19 Thread Gregory Farnum
On Tue, Apr 18, 2017 at 11:34 AM, Peter Maloney
 wrote:
> On 04/18/17 11:44, Jogi Hofmüller wrote:
>
> Hi,
>
> Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj:
>
> On 04/18/2017 11:24 AM, Jogi Hofmüller wrote:
>
> This might have been true for hammer and older versions of ceph.
> From
> what I can tell now, every snapshot taken reduces performance of
> the
> entire cluster :(
>
> Really? Can others confirm this? Is this a 'wellknown fact'?
> (unknown only to us, perhaps...)
>
> I have to add some more/new details now. We started removing snapshots
> for VMs today. We did this VM for VM and waited some time in between
> while monitoring the cluster.
>
> After having removed all snapshots for the third VM the cluster went
> back to a 'normal' state again: no more slow requests. i/o waits for
> VMs are down to acceptable numbers again (<10% peeks, <5% average).
>
> So, either there is one VM/image that irritates the entire cluster or
> we reached some kind of threshold or it's something completely
> different.
>
> As for the well known fact: Peter Maloney pointed that out in this
> thread (mail from last Thursday).
>
> The well known fact part was CoW which I guess is for all versions.
>
> The 'slower with every snapshot even after CoW totally flattens it' issue I
> just find easy to test, and I didn't test it on hammer or earlier, and
> others confirmed it, but didn't keep track of the versions. Just make an rbd
> image, map it (probably... but my tests were with qemu librbd), do fio
> randwrite tests with sync and direct on the device (no need for a fs, or
> anything), and then make a few snaps and watch it go way slower.

I'm not sure this is a correct diagnosis or assessment.

In general, snapshots incur costs in two places:
1) the first write to an object after it is logically snapshotted,
2) when removing snapshots.

There should be no long-term performance degradation, especially in
XFS — it creates new copies of objects for each snapshot they change
in. (btrfs and bluestore use block-based CoW, so they can suffer from
fragmentation if things go too badly.)
However, the costs of snapshot trimming (especially in Jewel) have
been much discussed recently. (I'll have some announcements about
improvements there soon!) So if you've got live trims happening, yes,
there's an incremental load on the cluster.

Similarly, creating a snapshot requires copying each snapshotted
object into a new location, and then applying the write. Generally,
that should amortize into nothingness, but it sounds like in this case
you were basically doing a single IO per object for every snapshot you
created — which, yes, would be impressively slow overall.

The reports I've seen of slow snapshots have been one of the two above
issues. Sometimes it's compounded by people not having enough
incremental IOPS available to support their client workload while
doing snapshots, but that doesn't mean snapshots are inherently
expensive or inefficient[1], just that they do have a non-zero cost
which your cluster needs to be able to provide.
-Greg

[1]: Although, yes, snap trimming is more expensive than in many
similar systems. There are reasons for that which I discussed at Vault
and will present on again at the upcoming OpenStack Boston Ceph day.
:)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore object overhead

2017-04-19 Thread Pavel Shub
On Wed, Apr 19, 2017 at 4:33 PM, Gregory Farnum  wrote:
> On Wed, Apr 19, 2017 at 1:26 PM, Pavel Shub  wrote:
>> Hey All,
>>
>> I'm running a test of bluestore in a small VM and seeing 2x overhead
>> for each object in cephfs. Here's the output of df detail
>> https://gist.github.com/pavel-citymaps/868a7c4b1c43cea9ab86cdf2e79198ee
>>
>> This is on a VM with all daemons & 20gb disk, all pools are of size 1.
>> Is this the expected amount of overhead per object? Is there anyway to
>> tweak bluestore settings?
>
> You're going to need to be clearer about what you mean by 2x overhead.
> Bluestore itself has a minimum size beneath which it will journal
> objects and then copy them into place, which might be considered 2x
> overhead. If you're talking about total number of cluster-wide disk
> ops, there's also a CephFS log which journals metadata updates that
> get flushed out to backing objects later, which might be considered 2x
> overhead. But I don't know what you mean just based on a ceph df. :)
> -Greg

Sorry, I meant the disk space taken up by the files. I have a dataset
with lots of small files, my sample set 2.5gb in total size and 5gb on
a filesystem with a 4kb block size. When put the files inside ceph
bluestore they take up 6gb. Does bluestore have an internal block
size? Is there a way to adjust it? For comparison I created a
filestore OSD with 2kb block size and the data took up only 4.5gb.

- Pavel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore object overhead

2017-04-19 Thread Gregory Farnum
On Wed, Apr 19, 2017 at 1:49 PM, Pavel Shub  wrote:
> On Wed, Apr 19, 2017 at 4:33 PM, Gregory Farnum  wrote:
>> On Wed, Apr 19, 2017 at 1:26 PM, Pavel Shub  wrote:
>>> Hey All,
>>>
>>> I'm running a test of bluestore in a small VM and seeing 2x overhead
>>> for each object in cephfs. Here's the output of df detail
>>> https://gist.github.com/pavel-citymaps/868a7c4b1c43cea9ab86cdf2e79198ee
>>>
>>> This is on a VM with all daemons & 20gb disk, all pools are of size 1.
>>> Is this the expected amount of overhead per object? Is there anyway to
>>> tweak bluestore settings?
>>
>> You're going to need to be clearer about what you mean by 2x overhead.
>> Bluestore itself has a minimum size beneath which it will journal
>> objects and then copy them into place, which might be considered 2x
>> overhead. If you're talking about total number of cluster-wide disk
>> ops, there's also a CephFS log which journals metadata updates that
>> get flushed out to backing objects later, which might be considered 2x
>> overhead. But I don't know what you mean just based on a ceph df. :)
>> -Greg
>
> Sorry, I meant the disk space taken up by the files. I have a dataset
> with lots of small files, my sample set 2.5gb in total size and 5gb on
> a filesystem with a 4kb block size. When put the files inside ceph
> bluestore they take up 6gb. Does bluestore have an internal block
> size? Is there a way to adjust it? For comparison I created a
> filestore OSD with 2kb block size and the data took up only 4.5gb.

I can't speak with authority on bluestore, but at those total sizes I
think you're just seeing the effects of the internal journaling.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd kernel client fencing

2017-04-19 Thread Kjetil Jørgensen
Hi,

As long as you blacklist the old owner by ip, you should be fine. Do
note that rbd lock remove implicitly also blacklists unless you also
pass rbd lock remove the --rbd_blacklist_on_break_lock=false option.
(that is I think "ceph osd blacklist add a.b.c.d interval" translates
into blacklisting a.b.c.d:0/0 - which should block every client with
source ip a.b.c.d).

Regardless, I believe the client taking out the lock (rbd cli) and the
kernel client mapping the rbd will be different (port, nonce), so
specifically if it is possible to blacklist a specific client by (ip,
port, nonce) it wouldn't do you much good where you have different
clients dealing with the locking and doing the actual IO/mapping (rbd
cli and kernel).

We do a variation of what you are suggesting, although additionally we
check for watches, if watched we give up and complain rather than
blacklist. If previous lock were held by my ip we just silently
reclaim. The hosts themselves run a process watching for
blacklistentries, and if they see themselves blacklisted they commit
suicide and re-boot. On boot, machine removes blacklist, reclaims any
locks it used to hold before starting the things that might map rbd
images. There's some warts in there, but for the most part it works
well.

If you are going the fencing route - I would strongly advise you also
ensure your process don't end up with the possibility of cascading
blacklists, in addition to being highly disruptive, it causes osd(?)
map churn. (We accidentally did this - and ended up almost running our
monitors out of disk).

Cheers,
KJ

On Wed, Apr 19, 2017 at 2:35 AM, Chaofan Yu  wrote:
> Hi list,
>
>   I wonder someone can help with rbd kernel client fencing (aimed to avoid
> simultaneously rbd map on different hosts).
>
> I know the exclusive rbd image feature is added later to avoid manual rbd
> lock CLIs. But want to know previous blacklist solution.
>
> The official workflow I’ve got is listed below (without exclusive rbd
> feature) :
>
>  - identify old rbd lock holder (rbd lock list )
>  - blacklist old owner (ceph osd blacklist add )
>  - break old rbd lock (rbd lock remove   )
>  - lock rbd image on new host (rbd lock add  )
>  - map rbd image on new host
>
>
> The blacklisted entry identified by entity_addr_t (ip, port, nonce).
>
> However as far as I know, ceph kernel client will do socket reconnection if
> connection failed. So I wonder in this scenario it won’t work:
>
> 1. old client network down for a while
> 2. perform below steps on new host to achieve failover
> - identify old rbd lock holder (rbd lock list )
>
>  - blacklist old owner (ceph osd blacklist add )
>  - break old rbd lock (rbd lock remove   )
>  - lock rbd image on new host (rbd lock add  )
>  - map rbd image on new host
>
> 3. old client network come back and reconnect to osds with new created
> socket client, i.e. new (ip, port,nonce) turple
>
> as a result both new and old client can write to same rbd image, which might
> potentially cause the data corruption.
>
> So does this mean if kernel client does not support exclusive-lock image
> feature, fencing is not possible ?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Anthony D'Atri
Re ratio, I think you’re right.

Write performance depends for sure on what the journal devices are.  If the 
journals are colo’d on spinners, then for sure the affinity game isn’t going to 
help writes massively.

My understanding of write latency is that min_size journals have to be written 
before the op returns, so if journals aren’t on SSD’s that’s going to be a big 
bottleneck.




> Hi,
> 
>>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>> 1:4-5 is common but depends on your needs and the devices in question, ie. 
>> assuming LFF drives and that you aren’t using crummy journals.
> 
> You might be speaking about different ratios here. I think that Anthony is 
> speaking about journal/OSD and Reed speaking about capacity ratio between and 
> HDD and SSD tier/root. 
> 
> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
> HDD), like Richard says you’ll get much better random read performance with 
> primary OSD on SSD but write performance won’t be amazing since you still 
> have 2 HDD copies to write before ACK. 
> 
> I know the doc suggests using primary affinity but since it’s a OSD level 
> setting it does not play well with other storage tiers so I searched for 
> other options. From what I have tested, a rule that selects the first/primary 
> OSD from the ssd-root then the rest of the copies from the hdd-root works. 
> Though I am not sure it is *guaranteed* that the first OSD selected will be 
> primary.
> 
> “rule hybrid {
>  ruleset 2
>  type replicated
>  min_size 1
>  max_size 10
>  step take ssd-root
>  step chooseleaf firstn 1 type host
>  step emit
>  step take hdd-root
>  step chooseleaf firstn -1 type host
>  step emit
> }”
> 
> Cheers,
> Maxime
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating journal on needed partition

2017-04-19 Thread Ben Hines
This is my experience.  For creating new OSDs, i just created Rundeck jobs
that run ceph-deploy. It's relatively rare that new OSDs are created, so it
is fine.

Originally I was automating them with configuration management tools but it
tended to encounter edge cases and problems that ceph-deploy already
handles nicely.

-Ben

On Tue, Apr 18, 2017 at 6:22 AM, Vincent Godin 
wrote:

> Hi,
>
> If you're using ceph-deploy, just run the command :
>
> ceph-deploy osd prepare --overwrite-conf {your_host}:/dev/sdaa:/dev/sdaf2
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Reed Dier
In this case the spinners have their journals on an NVMe drive, 3 OSD : 1 NVMe 
Journal.

Will be trying tomorrow to get some benchmarks and compare some hdd/ssd/hybrid 
workloads to see performance differences across the three backing layers.

Most client traffic is read oriented to begin with, so keeping reads quick is 
likely the biggest goal here.

Appreciate everyone’s input and advice.

Reed

> On Apr 19, 2017, at 5:59 PM, Anthony D'Atri  wrote:
> 
> Re ratio, I think you’re right.
> 
> Write performance depends for sure on what the journal devices are.  If the 
> journals are colo’d on spinners, then for sure the affinity game isn’t going 
> to help writes massively.
> 
> My understanding of write latency is that min_size journals have to be 
> written before the op returns, so if journals aren’t on SSD’s that’s going to 
> be a big bottleneck.
> 
> 
> 
> 
>> Hi,
>> 
 Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>>> 1:4-5 is common but depends on your needs and the devices in question, ie. 
>>> assuming LFF drives and that you aren’t using crummy journals.
>> 
>> You might be speaking about different ratios here. I think that Anthony is 
>> speaking about journal/OSD and Reed speaking about capacity ratio between 
>> and HDD and SSD tier/root. 
>> 
>> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
>> HDD), like Richard says you’ll get much better random read performance with 
>> primary OSD on SSD but write performance won’t be amazing since you still 
>> have 2 HDD copies to write before ACK. 
>> 
>> I know the doc suggests using primary affinity but since it’s a OSD level 
>> setting it does not play well with other storage tiers so I searched for 
>> other options. From what I have tested, a rule that selects the 
>> first/primary OSD from the ssd-root then the rest of the copies from the 
>> hdd-root works. Though I am not sure it is *guaranteed* that the first OSD 
>> selected will be primary.
>> 
>> “rule hybrid {
>> ruleset 2
>> type replicated
>> min_size 1
>> max_size 10
>> step take ssd-root
>> step chooseleaf firstn 1 type host
>> step emit
>> step take hdd-root
>> step chooseleaf firstn -1 type host
>> step emit
>> }”
>> 
>> Cheers,
>> Maxime
>> 
>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

2017-04-19 Thread Aaron Ten Clay
I'm new to doing this all via systemd and systemd-coredump, but I appear to
have gotten cores from two OSD processes. When xzipped they are < 2MIB
each, but I threw them on my webserver to avoid polluting the mailing list.
This seems oddly small, so if I've botched the process somehow let me know
:)

https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493.xz
https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508.xz

And for reference:
root@osd001:/var/lib/systemd/coredump# ceph -v
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)


I am also investigating sysdig as recommended.

Thanks!
-Aaron


On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil  wrote:

> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
> > Hi all,
> >
> > Our cluster is experiencing a very odd issue and I'm hoping for some
> > guidance on troubleshooting steps and/or suggestions to mitigate the
> issue.
> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
> are
> > eventually nuked by oom_killer.
>
> My guess is that there there is a bug in a decoding path and it's
> trying to allocate some huge amount of memory.  Can you try setting a
> memory ulimit to something like 40gb and then enabling core dumps so you
> can get a core?  Something like
>
> ulimit -c unlimited
> ulimit -m 2000
>
> or whatever the corresponding systemd unit file options are...
>
> Once we have a core file it will hopefully be clear who is
> doing the bad allocation...
>
> sage
>
>
>
> >
> > I'll try to explain the situation in detail:
> >
> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
> are in
> > a different CRUSH "root", used as a cache tier for the main storage
> pools,
> > which are erasure coded and used for cephfs. The OSDs are spread across
> two
> > identical machines with 128GiB of RAM each, and there are three monitor
> > nodes on different hardware.
> >
> > Several times we've encountered crippling bugs with previous Ceph
> releases
> > when we were on RC or betas, or using non-recommended configurations, so
> in
> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> and
> > went with stable Kraken 11.2.0 with the configuration mentioned above.
> > Everything was fine until the end of March, when one day we find all but
> a
> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> > came along and nuked almost all the ceph-osd processes.
> >
> > We've gone through a bunch of iterations of restarting the OSDs, trying
> to
> > bring them up one at a time gradually, all at once, various configuration
> > settings to reduce cache size as suggested in this ticket:
> > http://tracker.ceph.com/issues/18924...
> >
> > I don't know if that ticket really pertains to our situation or not, I
> have
> > no experience with memory allocation debugging. I'd be willing to try if
> > someone can point me to a guide or walk me through the process.
> >
> > I've even tried, just to see if the situation was  transitory, adding
> over
> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
> in a
> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> > oom_killer victims once again.
> >
> > No software or hardware changes took place around the time this problem
> > started, and no significant data changes occurred either. We added about
> > 40GiB of ~1GiB files a week or so before the problem started and that's
> the
> > last time data was written.
> >
> > I can only assume we've found another crippling bug of some kind, this
> level
> > of memory usage is entirely unprecedented. What can we do?
> >
> > Thanks in advance for any suggestions.
> > -Aaron
> >
> >
>



-- 
Aaron Ten Clay
https://aarontc.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] chooseleaf updates

2017-04-19 Thread Donny Davis
In reading the docs, I am curious if I can change the chooseleaf parameter
as my cluster expands. I currently only have one node and used this
parameter in ceph.conf

osd crush chooseleaf type = 0

Can this be changed after I expand nodes. The other two nodes are currently
on gluster, but moving to ceph this weekend.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore object overhead

2017-04-19 Thread Jason Dillaman
Does the bluestore min alloc size apply for 4k block-size files [1]?

[1] https://github.com/ceph/ceph/blob/master/src/common/config_opts.h#L1063

On Wed, Apr 19, 2017 at 4:51 PM, Gregory Farnum  wrote:
> On Wed, Apr 19, 2017 at 1:49 PM, Pavel Shub  wrote:
>> On Wed, Apr 19, 2017 at 4:33 PM, Gregory Farnum  wrote:
>>> On Wed, Apr 19, 2017 at 1:26 PM, Pavel Shub  wrote:
 Hey All,

 I'm running a test of bluestore in a small VM and seeing 2x overhead
 for each object in cephfs. Here's the output of df detail
 https://gist.github.com/pavel-citymaps/868a7c4b1c43cea9ab86cdf2e79198ee

 This is on a VM with all daemons & 20gb disk, all pools are of size 1.
 Is this the expected amount of overhead per object? Is there anyway to
 tweak bluestore settings?
>>>
>>> You're going to need to be clearer about what you mean by 2x overhead.
>>> Bluestore itself has a minimum size beneath which it will journal
>>> objects and then copy them into place, which might be considered 2x
>>> overhead. If you're talking about total number of cluster-wide disk
>>> ops, there's also a CephFS log which journals metadata updates that
>>> get flushed out to backing objects later, which might be considered 2x
>>> overhead. But I don't know what you mean just based on a ceph df. :)
>>> -Greg
>>
>> Sorry, I meant the disk space taken up by the files. I have a dataset
>> with lots of small files, my sample set 2.5gb in total size and 5gb on
>> a filesystem with a 4kb block size. When put the files inside ceph
>> bluestore they take up 6gb. Does bluestore have an internal block
>> size? Is there a way to adjust it? For comparison I created a
>> filestore OSD with 2kb block size and the data took up only 4.5gb.
>
> I can't speak with authority on bluestore, but at those total sizes I
> think you're just seeing the effects of the internal journaling.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd kernel client fencing

2017-04-19 Thread Chaofan Yu
Thank you so much.

The blacklist entries are stored in osd map, which is supposed to be tiny and 
clean. 
So we are doing similar cleanups after reboot.

I’m quite interested in how the host commit suicide and reboot,

can you successfully umount the folder and unmap the rbd block device

after it is blacklisted?

I wonder whether the IO will hang and the umount process will stop at D state

thus the host cannot be shutdown since it is waiting for the umount to finish

==

and now that cento 7.3 kernel support exclusive lock feature,

could anyone give out new flow of failover ?

Thanks.


> On 20 Apr 2017, at 6:31 AM, Kjetil Jørgensen  wrote:
> 
> Hi,
> 
> As long as you blacklist the old owner by ip, you should be fine. Do
> note that rbd lock remove implicitly also blacklists unless you also
> pass rbd lock remove the --rbd_blacklist_on_break_lock=false option.
> (that is I think "ceph osd blacklist add a.b.c.d interval" translates
> into blacklisting a.b.c.d:0/0 - which should block every client with
> source ip a.b.c.d).
> 
> Regardless, I believe the client taking out the lock (rbd cli) and the
> kernel client mapping the rbd will be different (port, nonce), so
> specifically if it is possible to blacklist a specific client by (ip,
> port, nonce) it wouldn't do you much good where you have different
> clients dealing with the locking and doing the actual IO/mapping (rbd
> cli and kernel).
> 
> We do a variation of what you are suggesting, although additionally we
> check for watches, if watched we give up and complain rather than
> blacklist. If previous lock were held by my ip we just silently
> reclaim. The hosts themselves run a process watching for
> blacklistentries, and if they see themselves blacklisted they commit
> suicide and re-boot. On boot, machine removes blacklist, reclaims any
> locks it used to hold before starting the things that might map rbd
> images. There's some warts in there, but for the most part it works
> well.
> 
> If you are going the fencing route - I would strongly advise you also
> ensure your process don't end up with the possibility of cascading
> blacklists, in addition to being highly disruptive, it causes osd(?)
> map churn. (We accidentally did this - and ended up almost running our
> monitors out of disk).
> 
> Cheers,
> KJ
> 
> On Wed, Apr 19, 2017 at 2:35 AM, Chaofan Yu  wrote:
>> Hi list,
>> 
>>  I wonder someone can help with rbd kernel client fencing (aimed to avoid
>> simultaneously rbd map on different hosts).
>> 
>> I know the exclusive rbd image feature is added later to avoid manual rbd
>> lock CLIs. But want to know previous blacklist solution.
>> 
>> The official workflow I’ve got is listed below (without exclusive rbd
>> feature) :
>> 
>> - identify old rbd lock holder (rbd lock list )
>> - blacklist old owner (ceph osd blacklist add )
>> - break old rbd lock (rbd lock remove   )
>> - lock rbd image on new host (rbd lock add  )
>> - map rbd image on new host
>> 
>> 
>> The blacklisted entry identified by entity_addr_t (ip, port, nonce).
>> 
>> However as far as I know, ceph kernel client will do socket reconnection if
>> connection failed. So I wonder in this scenario it won’t work:
>> 
>> 1. old client network down for a while
>> 2. perform below steps on new host to achieve failover
>> - identify old rbd lock holder (rbd lock list )
>> 
>> - blacklist old owner (ceph osd blacklist add )
>> - break old rbd lock (rbd lock remove   )
>> - lock rbd image on new host (rbd lock add  )
>> - map rbd image on new host
>> 
>> 3. old client network come back and reconnect to osds with new created
>> socket client, i.e. new (ip, port,nonce) turple
>> 
>> as a result both new and old client can write to same rbd image, which might
>> potentially cause the data corruption.
>> 
>> So does this mean if kernel client does not support exclusive-lock image
>> feature, fencing is not possible ?
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> 
> -- 
> Kjetil Joergensen 
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deleted a pool - when will a PG be removed from the OSD?

2017-04-19 Thread Daniel Marks
Hi all,

I am wondering when the PGs for a deleted pool get removed from their OSDs. 
http://docs.ceph.com/docs/master/dev/osd_internals/pg_removal/ 
 says that it 
is happening asynchronously, but what is the trigger?

I deleted the pool with id 15 two days ago, but I am still seeing the PG 
directories on the OSD:

/var/lib/ceph/osd/ceph-118/current # ls -1 | grep "^15"
15.8f_head
15.8f_TEMP
15.99_head
15.99_TEMP
15.f4_head
15.f4_TEMP


Best regards,
Daniel Marks


signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com