[ceph-users] Inherent insecurity of OSD daemons when using only a "public network"

2017-01-13 Thread Christian Balzer
Hello, Something I came across a while agao, but the recent discussion here jolted my memory. If you have a cluster configured with just a "public network" and that network being in RFC space like 10.0.0.0/8, you'd think you'd be "safe", wouldn't you? Alas you're not: --- root@ceph-01:~# netsta

Re: [ceph-users] Ceph Network question

2017-01-13 Thread Christian Balzer
Hello, On Thu, 12 Jan 2017 10:03:33 -0500 Sivaram Kannan wrote: > Hi, > > Thanks for the reply. The public network I am talking about is an > isolated network with no access to internet, but lot of compute > traffic though. If it is more about security, I would try setting up > both in the same

Re: [ceph-users] Calamari or Alternative

2017-01-13 Thread Marko Stojanovic
There is another nice tool for ceph monitoring: https://github.com/inkscope/inkscope Little hard to setup but beside just monitoring you can also manage some items using it. regards Marko On 1/13/17 07:30, Tu Holmes wrote: I'll give ceph-dash a look. Thanks! On Thu, Jan 12, 2017 at 9:19

Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-13 Thread ulembke
Hi Greg, Am 2017-01-12 19:54, schrieb Gregory Farnum: ... That's not what anybody intended to have happen. It's possible the simultaneous loss of a monitor and the OSDs is triggering a case that's not behaving correctly. Can you create a ticket at tracker.ceph.com with your logs and what steps

[ceph-users] Questions about rbd image features

2017-01-13 Thread Vincent Godin
We are using a production cluster which started in Firefly, then moved to Giant, Hammer and finally Jewel. So our images have different features correspondind to the value of "rbd_default_features" of the version when they were created. We have actually three pack of features activated : image with

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-01-13 Thread Dan van der Ster
Hammer or jewel? I've forgotten which thread pool is handling the snap trim nowadays -- is it the op thread yet? If so, perhaps all the op threads are stuck sleeping? Just a wild guess. (Maybe increasing # op threads would help?). -- Dan On Thu, Jan 12, 2017 at 3:11 PM, Nick Fisk wrote: > Hi, >

Re: [ceph-users] Inherent insecurity of OSD daemons when using only a "public network"

2017-01-13 Thread Willem Jan Withagen
On 13-1-2017 09:07, Christian Balzer wrote: > > Hello, > > Something I came across a while agao, but the recent discussion here > jolted my memory. > > If you have a cluster configured with just a "public network" and that > network being in RFC space like 10.0.0.0/8, you'd think you'd be "safe"

[ceph-users] Use of Spectrum Protect journal based backups for XFS filesystems in mapped RBDs?

2017-01-13 Thread Jens Dueholm Christensen
Hi I know this isn't the obvious choice to ask this, but nonetheless: Has anyone had any experience with running IBM Spectrum Protect (or Tivoli Storage Manager as it was previously know as) BA client backups of filesystems created inside RBDs using TSMs Journal-based backup features[1][2]? Th

Re: [ceph-users] Questions about rbd image features

2017-01-13 Thread Jason Dillaman
On Fri, Jan 13, 2017 at 5:11 AM, Vincent Godin wrote: > We are using a production cluster which started in Firefly, then moved to > Giant, Hammer and finally Jewel. So our images have different features > correspondind to the value of "rbd_default_features" of the version when > they were created.

Re: [ceph-users] Calamari or Alternative

2017-01-13 Thread Alexandre DERUMIER
Another tool : http://openattic.org/ - Mail original - De: "Marko Stojanovic" À: "Tu Holmes" , "John Petrini" Cc: "ceph-users" Envoyé: Vendredi 13 Janvier 2017 09:30:16 Objet: Re: [ceph-users] Calamari or Alternative There is another nice tool for ceph monitoring: [ https://githu

Re: [ceph-users] Calamari or Alternative

2017-01-13 Thread Tu Holmes
I remember seeing one of the openATTIC project people on the list mentioning that. My initial question is, "Can you configure openATTIC just to monitor an existing cluster without having to build a new one?" //Tu On Fri, Jan 13, 2017 at 6:10 AM Alexandre DERUMIER wrote: > Another tool : > > htt

[ceph-users] All SSD cluster performance

2017-01-13 Thread Mohammed Naser
Hi everyone, We have a deployment with 90 OSDs at the moment which is all SSD that’s not hitting quite the performance that it should be in my opinion, a `rados bench` run gives something along these numbers: Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up t

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Wido den Hollander
> Op 13 januari 2017 om 18:18 schreef Mohammed Naser : > > > Hi everyone, > > We have a deployment with 90 OSDs at the moment which is all SSD that’s not > hitting quite the performance that it should be in my opinion, a `rados > bench` run gives something along these numbers: > > Maintainin

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Mohammed Naser
> On Jan 13, 2017, at 12:37 PM, Wido den Hollander wrote: > > >> Op 13 januari 2017 om 18:18 schreef Mohammed Naser : >> >> >> Hi everyone, >> >> We have a deployment with 90 OSDs at the moment which is all SSD that’s not >> hitting quite the performance that it should be in my opinion, a `

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Wido den Hollander
> Op 13 januari 2017 om 18:39 schreef Mohammed Naser : > > > > > On Jan 13, 2017, at 12:37 PM, Wido den Hollander wrote: > > > > > >> Op 13 januari 2017 om 18:18 schreef Mohammed Naser : > >> > >> > >> Hi everyone, > >> > >> We have a deployment with 90 OSDs at the moment which is all SSD

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Mohammed Naser
> On Jan 13, 2017, at 12:41 PM, Wido den Hollander wrote: > > >> Op 13 januari 2017 om 18:39 schreef Mohammed Naser : >> >> >> >>> On Jan 13, 2017, at 12:37 PM, Wido den Hollander wrote: >>> >>> Op 13 januari 2017 om 18:18 schreef Mohammed Naser : Hi everyone,

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Somnath Roy
<< Both OSDs are pinned to two cores on the system Is there any reason you are pinning osds like that ? I would say for object workload there is no need to pin osds. The configuration you mentioned , Ceph with 4M object PUT it should be saturating your network first. Have you run say 4M object G

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Somnath Roy
Also, there are lot of discussion about SSDs not suitable for Ceph write workload (with filestore) in community as those are not good for odirect/odsync kind of writes. Hope your SSDs are tolerant of that. -Original Message- From: Somnath Roy Sent: Friday, January 13, 2017 10:06 AM To: '

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Mohammed Naser
These Intel SSDs are more than capable of handling the workload, in addition, this cluster is used as an RBD backend for an OpenStack cluster. Sent from my iPhone > On Jan 13, 2017, at 1:08 PM, Somnath Roy wrote: > > Also, there are lot of discussion about SSDs not suitable for Ceph write >

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Wido den Hollander
> Op 13 januari 2017 om 18:50 schreef Mohammed Naser : > > > > > On Jan 13, 2017, at 12:41 PM, Wido den Hollander wrote: > > > > > >> Op 13 januari 2017 om 18:39 schreef Mohammed Naser : > >> > >> > >> > >>> On Jan 13, 2017, at 12:37 PM, Wido den Hollander wrote: > >>> > >>> > Op

Re: [ceph-users] rgw leaking data, orphan search loop

2017-01-13 Thread Wido den Hollander
> Op 24 december 2016 om 13:47 schreef Wido den Hollander : > > > > > Op 23 december 2016 om 16:05 schreef Wido den Hollander : > > > > > > > > > Op 22 december 2016 om 19:00 schreef Orit Wasserman : > > > > > > > > > HI Maruis, > > > > > > On Thu, Dec 22, 2016 at 12:00 PM, Marius Vaitiek

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Mohammed Naser
> On Jan 13, 2017, at 1:34 PM, Wido den Hollander wrote: > >> >> Op 13 januari 2017 om 18:50 schreef Mohammed Naser : >> >> >> >>> On Jan 13, 2017, at 12:41 PM, Wido den Hollander wrote: >>> >>> Op 13 januari 2017 om 18:39 schreef Mohammed Naser : > On Jan 13, 2

Re: [ceph-users] Calamari or Alternative

2017-01-13 Thread Brian Godette
We're using: https://github.com/rochaporto/collectd-ceph for time-series, with a slightly modified Grafana dashboard from the one referenced. https://github.com/Crapworks/ceph-dash for quick health status. Both took a small bit of modification to make them work with Jewel at the time, not

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Wido den Hollander
> Op 13 januari 2017 om 20:33 schreef Mohammed Naser : > > > > > On Jan 13, 2017, at 1:34 PM, Wido den Hollander wrote: > > > >> > >> Op 13 januari 2017 om 18:50 schreef Mohammed Naser : > >> > >> > >> > >>> On Jan 13, 2017, at 12:41 PM, Wido den Hollander wrote: > >>> > >>> > Op

[ceph-users] Ceph Monitoring

2017-01-13 Thread Chris Jones
General question/survey: Those that have larger clusters, how are you doing alerting/monitoring? Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about collectd related but more on initial alerts of an issue or potential issue? What threshold do you use basically? Just trying

Re: [ceph-users] Ceph Monitoring

2017-01-13 Thread David Turner
We don't use many critical alerts (that will have our NOC wake up an engineer), but the main one that we do have is a check that tells us if there are 2 or more hosts with osds that are down. We have clusters with 60 servers in them, so having an osd die and backfill off of isn't something to w

Re: [ceph-users] Ceph Monitoring

2017-01-13 Thread Chris Jones
Thanks. What about 'NN ops > 32 sec' (blocked ops) type alerts? Does anyone monitor for those type and if so what criteria do you use? Thanks again! On Fri, Jan 13, 2017 at 3:28 PM, David Turner wrote: > We don't use many critical alerts (that will have our NOC wake up an > engineer), but the

Re: [ceph-users] Ceph Monitoring

2017-01-13 Thread David Turner
We don't currently monitor that, but my todo list has an item to monitor for blocked requests longer than 500 seconds to critical on. You can see how long they've been blocked for from `ceph health detail`. Our cluster doesn't need to be super fast at any given point, but it does need to be pr

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-01-13 Thread Nick Fisk
We're on Jewel and your right, I'm pretty sure the snap stuff is also now handled in the op thread. The dump historic ops socket command showed a 10s delay at the "Reached PG" stage, from Greg's response [1], it would suggest that the OSD itself isn't blocking but the PG it's currently sleeping

Re: [ceph-users] Ceph Monitoring

2017-01-13 Thread Paweł Sadowski
We monitor few things: - cluster health (error only, ignoring warnings since we have separate checks for interesting things) - if all PGs are active (number of active replicas >= min_size) - if there are any blocked requests (it's a good indicator, in our case, that some disk is going to fail s

[ceph-users] ceph radosgw - 500 errors -- odd

2017-01-13 Thread Sean Sullivan
I am sorry for posting this if this has been addressed already. I am not sure on how to search through old ceph-users mailing list posts. I used to use gmane.org but that seems to be down. My setup:: I have a moderate ceph cluster (ceph hammer 94.9 - fe6d859066244b97b24f09d46552afc2071e6f90 ). Th

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-13 Thread Robert Longstaff
FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS 7 w/ elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has allocated ~11GB of RAM after a single day of usage. Only the active manager is performing this way. The growth is linear and reproducible. The cluste

Re: [ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

2017-01-13 Thread Gregory Farnum
On Thu, Jan 12, 2017 at 7:58 PM, 许雪寒 wrote: > Thank you for your continuous helpJ. > > > > We are using hammer 0.94.5 version, and what I read is the version of the > source code. > > However, on the other hand, if Pipe::do_recv do act as blocked, is it > reasonable for the Pipe::reader_thread to

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-13 Thread Brad Hubbard
Want to install debuginfo packages and use something like this to try and find out where it is spending most of its time? https://poormansprofiler.org/ Note that you may need to do multiple runs to get a "feel" for where it is spending most of its time. Also not that likely only one or two thread

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Christian Balzer
Hello, On Fri, 13 Jan 2017 13:18:35 -0500 Mohammed Naser wrote: > These Intel SSDs are more than capable of handling the workload, in addition, > this cluster is used as an RBD backend for an OpenStack cluster. > I haven't tested the S3520s yet, them being the first 3D NAND offering from Inte