Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Stefan Kooman
pect something around processing config values. I've just set the same config setting on a test cluster and restarted an OSD without problem. So, not sure what is going on there. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6

Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Stefan Kooman
m I doing something wrong? I wonder if they would still crash if the OSD would drop their caches beforehand. There is support for this in master, but it doesn't look like it's backported to nautilus: https://tracker.ceph.com/issues/24176 Gr. Stefan -- | BIT BV https://www.bit.nl/

Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-21 Thread Stefan Priebe - Profihost AG
Hello Igor, thanks for all your feedback and all your help. The first thing i'll try is to upgrade a bunch of system from 4.19.66 kernel to 4.19.97 and see what happens. I'll report back in 7-10 days to verify whether this helps. Greets, Stefan Am 20.01.20 um 13:12 schrieb Igor Fed

Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-19 Thread Stefan Priebe - Profihost AG
480) Put( Prefix = O key = 0x7f8001cc45c881217262'd_data.4303206b8b4567.9632!='0xfffe'o' Value size = 510) on the right size i always see 0xfffeffff on all failed OSDs. greets, Stefan Am 19.01.20 um 14:07 schrieb Stefan Priebe -

Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-19 Thread Stefan Priebe - Profihost AG
Yes, except that this happens on 8 different clusters with different hw but same ceph version and same kernel version. Greets, Stefan > Am 19.01.2020 um 11:53 schrieb Igor Fedotov : > > So the intermediate summary is: > > Any OSD in the cluster can experience interim RocksDB c

Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-17 Thread Stefan Priebe - Profihost AG
ioned PR denotes high memory pressure as potential trigger for these > read errors. So if such pressure happens the hypothesis becomes more valid. we already do this heavily and have around 10GB of memory per OSD. Also no of those machines show any io pressure at all. All hosts show a constant ra

Re: [ceph-users] ceph nautilus cluster name

2020-01-16 Thread Stefan Kooman
if nameing support is already removed from the code but in any case don't try to name it anything else. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___

Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-16 Thread Stefan Priebe - Profihost AG
onth between those failures - most probably logs are already deleted. > Also please note that patch you mentioned doesn't fix previous issues > (i.e. duplicate allocations), it prevents from new ones only. > > But fsck should show them if any... None showed. Stefan > Thanks

Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-16 Thread Stefan Priebe - Profihost AG
c_thread()' thread 7f3350a14700 time 2020-01-16 01:10:13.404113 /build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0) ceph version 12.2.12-11-gd3eae83543 (d3eae83543bffc0fc6c43823feb637fa851b6213) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, in

[ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-16 Thread Stefan Priebe - Profihost AG
(BlueStore::KVSyncThread::entry()+0xd) [0x55e6df8a208d] 4: (()+0x7494) [0x7f8c50190494] 5: (clone()+0x3f) [0x7f8c4f217acf] all bluestore OSDs are randomly crashing sometimes (once a week). Greets, Stefan ___ ceph-users mailing list ceph-users@lists.cep

[ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-15 Thread Stefan Bauer
OSDs and could reduce latency from 2.5ms to 0.7ms now. :p Cheers Stefan -Ursprüngliche Nachricht- Von: Виталий Филиппов  Gesendet: Dienstag 14 Januar 2020 10:28 An: Wido den Hollander ; Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] low io with

Re: [ceph-users] units of metrics

2020-01-14 Thread Stefan Kooman
choosing the wrong new > value, or we misunderstood what the old value really was and have been > plotting it wrong all this time. I think the last one: not plotting what you think you did. We are using the telegraf plugin from the manager and using "mds.request" from "ceph_da

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Stefan Bauer
Thank you all, performance is indeed better now. Can now go back to sleep ;) KR Stefan -Ursprüngliche Nachricht- Von: Виталий Филиппов  Gesendet: Dienstag 14 Januar 2020 10:28 An: Wido den Hollander ; Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] low io

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Stefan Bauer
Hi Vitaliy, thank you for your time. Do you mean cephx sign messages = false with "diable signatures" ? KR Stefan -Ursprüngliche Nachricht- Von: Виталий Филиппов  Gesendet: Dienstag 14 Januar 2020 10:28 An: Wido den Hollander ; Stefan Bauer CC: ceph-users@list

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Stefan Bauer
Hi Stefan, thank you for your time. "temporary write through" does not seem to be a legit parameter. However write through is already set: root@proxmox61:~# echo "temporary write through" > /sys/block/sdb/device/scsi_disk/*/cache_type root@proxmox61:~# ca

Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Stefan Priebe - Profihost AG
Hello, does anybody have real live experience with externel block db? Greets, Stefan Am 13.01.20 um 08:09 schrieb Stefan Priebe - Profihost AG: > Hello, > > i'm plannung to split the block db to a seperate flash device which i > also would like to use as an OSD for erasure co

Re: [ceph-users] units of metrics

2020-01-14 Thread Stefan Kooman
metric is needed to perform calculations to obtain "avgtime" (sum/avgcount). Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailin

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-13 Thread Stefan Priebe - Profihost AG
Hi Stefan, Am 13.01.20 um 17:09 schrieb Stefan Bauer: > Hi, > > > we're playing around with ceph but are not quite happy with the IOs. > > > 3 node ceph / proxmox cluster with each: > > > LSI HBA 3008 controller > > 4 x MZILT960HAHQ/007 Samsung

[ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-13 Thread Stefan Bauer
e on average 13000 iops / read We're expecting more. :( any ideas or is that all we can expect? money is not a problem for this test-bed, any ideas howto gain more IOS is greatly appreciated. Thank you. Stefan ___ ceph-users mailing list

[ceph-users] block db sizing and calculation

2020-01-12 Thread Stefan Priebe - Profihost AG
nds a minimum size of 140GB per 14TB HDD. Is there any recommandation of how many osds a single flash device can serve? The optane ones can do 2000MB/s write + 500.000 iop/s. Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com

[ceph-users] Hardware selection for ceph backup on ceph

2020-01-10 Thread Stefan Priebe - Profihost AG
4K native Dual 25gb network Does it fit? Has anybody experience with the drives? Can we use EC or do we need to use normal replication? Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users

Re: [ceph-users] Looking for experience

2020-01-10 Thread Stefan Priebe - Profihost AG
> Am 10.01.2020 um 07:10 schrieb Mainor Daly : > >  > Hi Stefan, > > before I give some suggestions, can you first describe your usecase for which > you wanna use that setup? Also which aspects are important for you. It’s just the backup target of another ceph Clus

Re: [ceph-users] Looking for experience

2020-01-09 Thread Stefan Priebe - Profihost AG
DB or not? Since we started using ceph we're mostly subscribed to SSDs - so no knowlege about HDD in place. Greets, Stefan Am 09.01.20 um 16:49 schrieb Stefan Priebe - Profihost AG: > >> Am 09.01.2020 um 16:10 schrieb Wido den Hollander : >> >>  >> >>> O

Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
recordsize: https://blog.programster.org/zfs-record-size, https://blogs.oracle.com/roch/tuning-zfs-recordsize Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl

Re: [ceph-users] Looking for experience

2020-01-09 Thread Stefan Priebe - Profihost AG
> Am 09.01.2020 um 16:10 schrieb Wido den Hollander : > >  > >> On 1/9/20 2:27 PM, Stefan Priebe - Profihost AG wrote: >> Hi Wido, >>> Am 09.01.20 um 14:18 schrieb Wido den Hollander: >>> >>> >>> On 1/9/20 2:07 PM, Daniel Aberger -

Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
Quoting Kyriazis, George (george.kyria...@intel.com): > > > > On Jan 9, 2020, at 8:00 AM, Stefan Kooman wrote: > > > > Quoting Kyriazis, George (george.kyria...@intel.com): > > > >> The source pool has mainly big files, but there are quite a few > &g

Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
Quoting Kyriazis, George (george.kyria...@intel.com): > The source pool has mainly big files, but there are quite a few > smaller (<4KB) files that I’m afraid will create waste if I create the > destination zpool with ashift > 12 (>4K blocks). I am not sure, > though, if ZFS will actually write b

Re: [ceph-users] Looking for experience

2020-01-09 Thread Stefan Priebe - Profihost AG
about this and most probobly some overhead we currently have in those numbers. Those values come from our old classic raid storage boxes. Those use btrfs + zlib compression + subvolumes for those backups and we've collected those numbers from all of them. The new system should just replicate snapshot

Re: [ceph-users] CRUSH rebalance all at once or host-by-host?

2020-01-09 Thread Stefan Kooman
ts/blob/master/tools/upmap/upmap-remapped.py This way you can pause the process or get in "HEALTH_OK" state when you want to. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl

Re: [ceph-users] Log format in Ceph

2020-01-08 Thread Stefan Kooman
istoric_slow_ops" on the storage node hosting this OSD and you will get JSON output with the reason (flag_point) of the slow op and the series of events. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i..

Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Stefan Kooman
t; (well, except for that scrub bug, but my work-around for that is in all > release versions). What scrub bug are you talking about? Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +

Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Stefan Kooman
dos bench is > stable again. > > apt-get install irqbalance nftables ^^ Are these some of these changes? Do you need those packages in order to unload / blacklist them? I don't get what your fixes are, or what the problem was. Firewall issues? What Ceph version did you upgr

Re: [ceph-users] Architecture - Recommendations

2020-01-06 Thread Stefan Kooman
ything in containers. It makes (performance) debugging *a lot* easier as you can actually isolate things. Something which is way more difficult to achieve in servers where you have a complex workload going on ... I guess (no proof of that) that performance will be more consistent as well. Gr. Stefan --

Re: [ceph-users] ceph luminous bluestore poor random write performances

2020-01-02 Thread Stefan Kooman
use case? Low latency generally matters most. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com

Re: [ceph-users] Architecture - Recommendations

2019-12-31 Thread Stefan Kooman
es. Are you planning on dedicated monitor nodes (I would definately do that)? Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list

Re: [ceph-users] Architecture - Recommendations

2019-12-31 Thread Stefan Kooman
PN VXLAN network is not trivial ... I advise on getting networking expertise in your team. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users

Re: [ceph-users] HEALTH_ERR, size and min_size

2019-12-29 Thread Stefan Kooman
Quoting Ml Ml (mliebher...@googlemail.com): > Hello Stefan, > > The status was "HEALTH_OK" before i ran those commands. \o/ > root@ceph01:~# ceph osd crush rule dump > [ > { > "rule_id": 0, > "rule_name": "repli

Re: [ceph-users] HEALTH_ERR, size and min_size

2019-12-29 Thread Stefan Kooman
r "OSD" and not host. What does a "ceph osd crush rule dump" shows? Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mail

Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-27 Thread Stefan Kooman
ur not concerned about lifetime this is just fine. We use quite a lot of them and even after ~ 2 years the most used SSD is at 4.4% write capacity. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318

Re: [ceph-users] cephfs kernel client io performance decreases extremely

2019-12-27 Thread Stefan Kooman
of mds within > five seconds as follow, You should run this iostat -x 1 on the OSD nodes ... MDS is not doing any IO in and of itself as far as Ceph is concerned. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6

Re: [ceph-users] Architecture - Recommendations

2019-12-27 Thread Stefan Kooman
GP EVPN (over VXLAN)? In that case you would have the ceph nodes in the overlay ... You can put a LB / Proxy up front (Varnish, ha-proxy, nginx, relayd, etc.)... (outside of Ceph network) and connect over HTTP to the RGW nodes ... wich can reach the Ceph network (or are even part of it) on the b

Re: [ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-16 Thread Stefan Kooman
now how to migrate > without inactive pgs and slow reguests? Several users reported that setting the following parameters: osd op queue = wpq osd op queue cut off = high Helped in cases like this. Your milage may vary ... Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koop

Re: [ceph-users] Use telegraf/influx to detect problems is very difficult

2019-12-12 Thread Stefan Kooman
want to know where it is used for: https://tracker.ceph.com/issues/35947 TL;DR: it's not what you think it is. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl

Re: [ceph-users] //: // ceph-mon is blocked after shutting down and ip address changed

2019-12-11 Thread Stefan Kooman
s to connect to 3300 ... you might get a timeout as well. Not sure if messenger falls back to v1. What happens when you change ceph.conf (first without restarting the mon) and try a "ceph -s" again with a ceph client on the monitor node? Gr. Stefan -- | BIT

Re: [ceph-users] Ceph assimilated configuration - unable to remove item

2019-12-11 Thread Stefan Kooman
rkaround for now if you want to override the config store: just put that in your config file and reboot the daemon(s). Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl __

Re: [ceph-users] 回复: ceph-mon is blocked after shutting down and ip address changed

2019-12-10 Thread Stefan Kooman
$hostname quorum_status If there is no monitor in quorum ... then that's your problem. See [1] for more info on debugging the monitor. Gr. Stefan [1]: https://docs.ceph.com/docs/nautilus/rados/troubleshooting/troubleshooting-mon/ -- | BIT BV https://www.bit.nl/Kamer van Koophandel

Re: [ceph-users] Ceph-mgr :: Grafana + Telegraf / influxdb metrics format

2019-12-10 Thread Stefan Kooman
might be abble to use a Prometheus dashboard and convert that to InfluxDB compatible dashboard in Grafana. I think I would do that if I would do it all over again. And / or use Prometheus with a InfluxDB as the backend for long(er) term storage. With the new InluxDB query langue "flux" [5],

Re: [ceph-users] ceph-mon is blocked after shutting down and ip address changed

2019-12-10 Thread Stefan Kooman
14s > > ... > > > I changed IP back to 192.168.0.104 yeasterday, but all the same. Just checking here: do you run a firewall? Is port 3300 open (besides 6789)? What do you see in the logs on the MDS and the ODSs? There are timers configured in the MON / OSD in case they cannot rea

Re: [ceph-users] Missing Ceph perf-counters in Ceph-Dashboard or Prometheus/InfluxDB...?

2019-12-09 Thread Stefan Kooman
af exporters). > > While changing that is rather trivial, it could make sense to get > users' feedback and come up with a list of missing perf-counters to be > exposed. I made https://tracker.ceph.com/issues/4188 a while ago: missing metrics in all but prometheus module. Gr

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-12-06 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl): > 13.2.6 with this patch is running production now. We will continue the > cleanup process that *might* have triggered this tomorrow morning. For what's worth it ... that process completed succesfully ... Time will tell if it's really fix

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-12-05 Thread Stefan Kooman
Hi, Quoting Yan, Zheng (uker...@gmail.com): > Please check if https://github.com/ceph/ceph/pull/32020 works Thanks! 13.2.6 with this patch is running production now. We will continue the cleanup process that *might* have triggered this tomorrow morning. Gr. Stefan -- | BIT BV ht

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-12-04 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl): > and it crashed again (and again) ... until we stopped the mds and > deleted the mds0_openfiles.0 from the metadata pool. > > Here is the (debug) output: > > A specific workload that *might* have triggered this: recursively deletin

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-12-04 Thread Stefan Kooman
Hi, Quoting Stefan Kooman (ste...@bit.nl): > > please apply following patch, thanks. > > > > diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc > > index c0f72d581d..2ca737470d 100644 > > --- a/src/mds/OpenFileTable.cc > > +++ b/src/mds/Op

Re: [ceph-users] Failed to encode map errors

2019-12-04 Thread Stefan Kooman
emons running with different ceph versions. What does "ceph versions" give you? Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-u

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-11-24 Thread Stefan Kooman
omap_num_items.resize(omap_num_objs); > omap_updates.resize(omap_num_objs); > omap_updates.back().clear = true; It took a while but an MDS server with this debug patch is now live (and up:active). FYI, Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090

Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]

2019-11-13 Thread Stefan Bauer
    "avgtime": 0.004992133 because the communication partner is slow in writing/commiting? Dont want to follow the red hering :/ We have the following times on our 11 osds. Attached image. -Ursprüngliche Nachricht- Von: Paul Emmerich  Gesendet: Donnerstag 7 Novemb

Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]

2019-11-07 Thread Stefan Bauer
  110   110  10 94    94  11 24    24 Stefan Von: Paul Emmerich  You can have a look at subop_latency in "ceph daemon osd.XX perf dump", it tells you how long an OSD took to reply to another OSD. That's usually

[ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]

2019-11-07 Thread Stefan Bauer
+scrubbing+deep     io:     client:   4.99MiB/s rd, 1.36MiB/s wr, 678op/s rd, 105op/s wr Thank you. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
you already have the patch (on github) somewhere? Thanks, Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lis

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
e openfiles list (object) becomes corrupted? As in: have a bugfix in place? Thanks! Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
tate (although it has been crashing for at least 10 times now). Is the following what you want me to do, and safe to do in this situation? 1) Stop running (active) MDS 2) delete object 'mdsX_openfiles.0' from cephfs metadata pool Thanks, Stefan -- | BIT BV https://www.bit.nl/

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-19 Thread Stefan Kooman
Dear list, Quoting Stefan Kooman (ste...@bit.nl): > I wonder if this situation is more likely to be hit on Mimic 13.2.6 than > on any other system. > > Any hints / help to prevent this from happening? We have had this happening another two times now. In both cases the MDS recov

[ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-19 Thread Stefan Kooman
the same issue on a Mimic 13.2.6 system: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036702.html I wonder if this situation is more likely to be hit on Mimic 13.2.6 than on any other system. Any hints / help to prevent this from happening? Thanks, Stefan -- | BIT BV

Re: [ceph-users] ceph version 14.2.3-OSD fails

2019-10-11 Thread Stefan Priebe - Profihost AG
sue or not. > > This time it reminds the issue shared in this mailing list a while ago by > Stefan Priebe. The post caption is "Bluestore OSDs keep crashing in > BlueStore.cc: 8808: FAILED assert(r == 0)" > > So first of all I'd suggest to distinguish these issues

[ceph-users] MDS Stability with lots of CAPS

2019-10-02 Thread Stefan Kooman
start the MDS to make the "mds_cache_memory_limit" effective, is that correct? Gr. Stefan [1]: https://ceph.com/community/nautilus-cephfs/ -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl _

Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-02 Thread Stefan Kooman
d error message is gone. Either way it makes sense to enable the crash module anyway. Thanks, Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ cep

Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-10-01 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl): > Hi List, > > We are planning to move a filesystem workload (currently nfs) to CephFS. > It's around 29 TB. The unusual thing here is the amount of directories > in use to host the files. In order to combat a "too many files in on

Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-01 Thread Stefan Kooman
;, line 214, in gather_crashinfo errno, crashids, err = self.remote('crash', 'do_ls', '', '') File "/usr/lib/ceph/mgr/mgr_module.py", line 845, in remote args, kwargs) ImportError: Module not found Running 13.2.6 on Ub

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-09-15 Thread Stefan Priebe - Profihost AG
Hi Igor, Am 12.09.19 um 19:34 schrieb Igor Fedotov: > Hi Stefan, > > thanks for the update. > > Relevant PR from Paul mentions kernels (4.9+): > https://github.com/ceph/ceph/pull/23273 > > Not sure how correct this is. That's all I have.. > > Try asking Sage

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-09-12 Thread Stefan Priebe - Profihost AG
Hello Igor, i can now confirm that this is indeed a kernel bug. The issue does no longer happen on upgraded nodes. Do you know more about it? I really would like to know in which version it was fixed to prevent rebooting all ceph nodes. Greets, Stefan Am 27.08.19 um 16:20 schrieb Igor Fedotov

[ceph-users] cephfs: apache locks up after parallel reloads on multiple nodes

2019-09-12 Thread Stefan Kooman
o investigate this issue further are highly appreciated. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] units of metrics

2019-09-12 Thread Stefan Kooman
lugin does not (yet?) provide mds metrics though. Ideally we would *only* use the ceph mgr telegraf module to collect *all the things*. Not sure what's the difference in python code between the modules that could explain this. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van

Re: [ceph-users] How to add 100 new OSDs...

2019-09-11 Thread Stefan Kooman
andard deviation. If that is quite high it makes sense to use balancer to equalize to otain higher utilization. Either PG optimized or capactity optimized (or a mix of both, default balancer settings). Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD148

Re: [ceph-users] How to add 100 new OSDs...

2019-09-11 Thread Stefan Kooman
2) The balancer moves the data more efficiently. 3) the balancer will avoid putting PGs on OSDs that are already full ... you might avoid "too full" PG situations. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31

Re: [ceph-users] regurlary 'no space left on device' when deleting on cephfs

2019-09-06 Thread Stefan Kooman
Quoting Kenneth Waegeman (kenneth.waege...@ugent.be): > The cluster is healthy at this moment, and we have certainly enough space > (see also osd df below) It's not well balanced though ... do you use ceph balancer (with balancer in upmap mode)? Gr. Stefan -- | BIT BV https:/

[ceph-users] 14.2.2 -> 14.2.3 upgrade [WRN] failed to encode map e905 with expected crc

2019-09-06 Thread Stefan Kooman
this? Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users

[ceph-users] units of metrics

2019-09-04 Thread Stefan Kooman
'avgtime' in seconds, with "avgtime": 0.000328972 representing 0.328972 ms? As far as I can see the logs collected by the telegraf manager plugin only sends "sum". So how would I calculate the average reply latency for mds requests? Thanks, Gr. Stefan -- | BI

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
occasional invalid > data reads under high memory pressure/swapping: > https://tracker.ceph.com/issues/22464 We have a current 4.19.X kernel and no memory limit. Mem avail is pretty constant at 32GB. Greets, Stefan > > IMO memory usage worth checking as well... > > > Igor

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
see inline Am 27.08.19 um 15:43 schrieb Igor Fedotov: > see inline > > On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote: >> Hi Igor, >> >> Am 27.08.19 um 14:11 schrieb Igor Fedotov: >>> Hi Stefan, >>> >>> this looks like a dupli

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
Hi Igor, Am 27.08.19 um 14:11 schrieb Igor Fedotov: > Hi Stefan, > > this looks like a duplicate for > > https://tracker.ceph.com/issues/37282 > > Actually the root cause selection might be quite wide. > > From HW issues to broken logic in RocksDB/BlueStore/B

[ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
fb1ab2f6494] 5: (clone()+0x3f) [0x7fb1aa37dacf] I already opend up a tracker: https://tracker.ceph.com/issues/41367 Can anybody help? Is this known? Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
fb1ab2f6494] 5: (clone()+0x3f) [0x7fb1aa37dacf] I already opend up a tracker: https://tracker.ceph.com/issues/41367 Can anybody help? Is this known? Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Stefan Kooman
metadata on only HDDs, it's going to be slow. Only SSD for OSD data pool and NVMe for metadata pool, so that should be fine. Besides the initial loading of that many files / directories this workload shouldn't be any problem. Thanks for your feedback. Gr. Stefan -- | BIT BV https://w

Re: [ceph-users] How to add 100 new OSDs...

2019-07-26 Thread Stefan Kooman
you are using cephfs kernel client it might report as not compatible (jewel) but recent linux distributions work well (Ubuntu 18.04 / CentOS 7). Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl

[ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Stefan Kooman
. We are wondering if this kind of directory structure is suitable for CephFS. Might the MDS get difficulties with keeping up that many inodes / dentries or doesn't it care at all? The amount of metadata overhead might be horrible, but we will test that out. Thanks, Stefan -- | BIT

[ceph-users] To backport or not to backport

2019-07-04 Thread Stefan Kooman
ver needs "dangerous" updates. This is my view on the matter, please let me know what you think of this. Gr. Stefan P.s. Just to make things clear: this thread is in _no way_ intended to pick on anybody. [1]: https://pad.ceph.com/p/ceph-day-nl-2019-panel -- | BIT BV https://www.

Re: [ceph-users] Ceph Upgrades - sanity check - MDS steps

2019-06-19 Thread Stefan Kooman
the same active and standby as before the upgrades, both up to date with as little downtime as possible. That said ... I've accidentally updated a standby MDS to a newer version than the Active one ... and this didn't cause any issues (12.2.8 -> 12.2.11) ... but I would not recommen

Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-06-11 Thread Stefan Kooman
Quoting Patrick Donnelly (pdonn...@redhat.com): > Hi Stefan, > > Sorry I couldn't get back to you sooner. NP. > Looks like you hit the infinite loop bug in OpTracker. It was fixed in > 12.2.11: https://tracker.ceph.com/issues/37977 > > The problem was introduced in

Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-10 Thread Stefan Kooman
knowing about it. Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-07 Thread Stefan Kooman
ded before I can do that? It's safe to use in production. We have test clusters running it, and recently put it in production as well. As Igor noted this might not help in your situation, but it might prevent you from running into decreased performance (increased latency) over time.

Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-06 Thread Stefan Kooman
as been identified to be caused by the "stupid allocator" memory allocator. Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-05-27 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl): > Hi Patrick, > > Quoting Stefan Kooman (ste...@bit.nl): > > Quoting Stefan Kooman (ste...@bit.nl): > > > Quoting Patrick Donnelly (pdonn...@redhat.com): > > > > Thanks for the detailed notes. It looks like the MDS is s

Re: [ceph-users] performance in a small cluster

2019-05-27 Thread Stefan Kooman
presentation by Wido/Piotr that might be useful: https://static.sched.com/hosted_files/cephalocon2019/d6/ceph%20on%20nvme%20barcelona%202019.pdf Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl _

Re: [ceph-users] Cephfs free space vs ceph df free space disparity

2019-05-27 Thread Stefan Kooman
4096 bluestore min alloc size hdd = 4096 You will have to rebuild _all_ of your OSDs though. Here is another thread about this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/thread.html#24801 Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-18 Thread Stefan Kooman
also be able to balance way better. Math: (100 (PG/OSD) * 192 (# OSDs)) - 750)) / 3 = 6150 for 3 replica pools. You might have a lot of contention going on on your OSDs, they are probably under performing. Gr. Stefan ___ ceph-users mailing list ceph-users@

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-18 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk): > Dear Yan and Stefan, > > it happened again and there were only very few ops in the queue. I > pulled the ops list and the cache. Please find a zip file here: > "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l"; . &g

Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-16 Thread Stefan Kooman
ds of HEALTH_WARN "clock skew > detected". > > I guess now the workaround now is to ignore the warning, and wait > for two minutes before rebooting another mon. You can tune the "mon_timecheck_skew_interval" which by default is set to 30 seconds.

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk): > Dear Stefan, > > thanks for the fast reply. We encountered the problem again, this time in a > much simpler situation; please see below. However, let me start with your > questions first: > > What bug? -- In a single-active MDS set-

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-14 Thread Stefan Kooman
rring to based on info below. It does seem to work as designed. Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.

Re: [ceph-users] Slow requests from bluestore osds

2019-05-14 Thread Stefan Kooman
r, which (also) might result in slow ops after $period of OSD uptime. Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-use

  1   2   3   4   5   6   >