[ceph-users] cephadm upgrade from v15.11 to pacific fails all the times
Dear gents, to get handy with cephadm upgrade path and in general (we heavily use old style "ceph-deploy" Octopus based production clusters), we decided to do some tests with a vanilla cluster running 15.2.11 based on Centos8 on top of vSphere. Deployment of Octopus cluster runs very well and we are excited about this new technique and all the possibilities. No errors no clues... :-) Unfortunately upgrade fails to Pacific (16.2.0 or 16.2.1) either original docker or quay.ceph.io/ceph-ci/ceph:pacific images all the time. We use a small setup (3 mons, 2 mgrs, some osds) This is the upgrade behaviour: Upgrade of both MGR's seems to be ok but we get this: 2021-04-29T15:35:19.903111+0200 mgr.c0n00.vnxaqu [DBG] daemon mgr.c0n00.vnxaqu container digest correct 2021-04-29T15:35:19.903206+0200 mgr.c0n00.vnxaqu [DBG] daemon mgr.c0n00.vnxaqu deployed by correct version 2021-04-29T15:35:19.903298+0200 mgr.c0n00.vnxaqu [DBG] daemon mgr.c0n01.gstlmw container digest correct 2021-04-29T15:35:19.903378+0200 mgr.c0n00.vnxaqu [DBG] daemon mgr.c0n01.gstlmw *not deployed by correct version* After this the upgrade process stucks completely. Although you have a running cluster (minus one monitor daemon): [root@c0n00 ~]# ceph -s cluster: id: 5541c866-a8fe-11eb-b604-005056b8f1bf health: HEALTH_WARN * 3 hosts fail cephadm check* services: mon: 2 daemons, quorum c0n00,c0n02 (age 68m) mgr: c0n00.bmtvpr(active, since 68m), standbys: c0n01.jwfuca osd: 4 osds: 4 up (since 63m), 4 in (since 62m) [..] progress: Upgrade to 16.2.1-257-g717ce59b (0s) [=...] { "target_image": " quay.ceph.io/ceph-ci/ceph@sha256:d0f624287378fe63fc4c30bccc9f82bfe0e42e62381c0a3d0d3d86d985f5d788", "in_progress": true, "services_complete": [ "mgr" ], "progress": "2/19 ceph daemons upgraded", "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an unexpected exception" [root@c0n00 ~]# ceph orch ps NAME HOST PORTSSTATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID alertmanager.c0n00 c0n00 running (56m)4m ago 16h 0.20.00881eb8f169f 30d9eff06ce2 crash.c0n00 c0n00 running (56m)4m ago 16h 15.2.11 9d01da634b8f 91d3e4d0e14d crash.c0n01 c0n01 host is offline 16h ago16h 15.2.11 9d01da634b8f 0ff4a20021df crash.c0n02 c0n02 host is offline 16h ago16h 15.2.11 9d01da634b8f 0253e6bb29a0 crash.c0n03 c0n03 host is offline 16h ago16h 15.2.11 9d01da634b8f 291ce4f8b854 grafana.c0n00c0n00 running (56m)4m ago 16h 6.7.4 80728b29ad3f 46d77b695da5 mgr.c0n00.bmtvpr c0n00 *:8443,9283 running (56m)4m ago 16h 16.2.1-257-g717ce59b 3be927f015dd 94a7008ccb4f mgr.c0n01.jwfuca c0n01 host is offline 16h ago16h 16.2.1-257-g717ce59b 3be927f015dd 766ada65efa9 mon.c0n00c0n00 running (56m)4m ago 16h 15.2.11 9d01da634b8f b9f270cd99e2 mon.c0n02c0n02 host is offline 16h ago16h 15.2.11 9d01da634b8f a90c21bfd49e node-exporter.c0n00 c0n00 running (56m)4m ago 16h 0.18.1e5a616e4b9cf eb1306811c6c node-exporter.c0n01 c0n01 host is offline 16h ago16h 0.18.1e5a616e4b9cf 093a72542d3e node-exporter.c0n02 c0n02 host is offline 16h ago16h 0.18.1e5a616e4b9cf 785531f5d6cf node-exporter.c0n03 c0n03 host is offline 16h ago16h 0.18.1e5a616e4b9cf 074fac77e17c osd.0c0n02 host is offline 16h ago16h 15.2.11 9d01da634b8f c075bd047c0a osd.1c0n01 host is offline 16h ago16h 15.2.11 9d01da634b8f 616aeda28504 osd.2c0n03 host is offline 16h ago16h 15.2.11 9d01da634b8f b36453730c83 osd.3c0n00 running (56m)4m ago 16h 15.2.11 9d01da634b8f e043abf53206 prometheus.c0n00 c0n00 running (56m)4m ago 16h 2.18.1de242295e225 7cb50c04e26a After some digging into daemon logs we found Tracebacks (please see below). We also noticed that we successfully reach each host per ssh -F !!! We've done tcpdumps while upgrading and every SYN gets its SYNACK... ;-) Because we get no errors while deploying fresh Octopus cluster by cephadm (from https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm and cephadm prepare host is always OK) it might be a missing Python Lib or something that's not checked cephadm itself? Thank you for any hint. Christoph A
[ceph-users] Re: Host ceph version in dashboard incorrect after upgrade
Thank you for the command. I successfully stopped and started the mgr daemon on that node but still the version number on the ceph dashboard is stuck on the old version 15.2.10. On that node I also have the mon daemon running, should I also restart mon? ‐‐‐ Original Message ‐‐‐ On Thursday, April 29, 2021 8:20 PM, Eugen Block wrote: > Try this: > > ceph orch daemon stop mgr. > > and then after another daemon took over its role start it again: > > ceph orch daemon start mgr. > > Zitat von mabim...@protonmail.ch: > > > I also thought about restarting the MGR service but I am new to ceph > > and could not find the "cephadm orch" command in order to do that... > > What would be the command to restart the mgr service on a specific > > node? > > ‐‐‐ Original Message ‐‐‐ > > On Thursday, April 29, 2021 7:23 PM, Eugen Block ebl...@nde.ag wrote: > > > > > I would restart the active MGR, that should resolve it. > > > Zitat von mabi m...@protonmail.ch: > > > > > > > Hello, > > > > I upgraded my Octopus test cluster which has 5 hosts because one of > > > > the node (a mon/mgr node) was still on version 15.2.10 but all the > > > > others on 15.2.11. > > > > For the upgrade I used the following command: > > > > ceph orch upgrade start --ceph-version 15.2.11 > > > > The upgrade worked correctly and I did not see any errors in the > > > > logs but the host version in the ceph dashboard (under the > > > > navigation Cluster -> Hosts) still snows 15.2.10 for that specific > > > > node. > > > > The output of "ceph versions", shows that every component is on > > > > 15.2.11 as you can see below: > > > > { > > > > "mon": { > > > > "ceph version 15.2.11 > > > > (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 3 > > > > }, > > > > "mgr": { > > > > "ceph version 15.2.11 > > > > (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 2 > > > > }, > > > > "osd": { > > > > "ceph version 15.2.11 > > > > (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 2 > > > > }, > > > > "mds": {}, > > > > "overall": { > > > > "ceph version 15.2.11 > > > > (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 7 > > > > } > > > > } > > > > So why is it still stuck on 15.2.10 in the dashboard? > > > > Best regards, > > > > Mabi > > > > ceph-users mailing list -- ceph-users@ceph.io > > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Specify monitor IP when CIDR detection fails
I'm running some specialized routing in my environment such that CIDR detection is failing when trying to add monitors. Is there a way to specify the monitor IP address to bind to when adding a monitor if "public_network = 0.0.0.0/0"? Setting "public_network = 0.0.0.0/0" is the only way I could find to bypass CIDR detection but then new monitors are added with the wrong IP address in the monitor map :( I'm running the latest version of Octopus. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cannot create issue in bugtracker
Hello, Is it only me that's getting Internal error when trying to create issues in the bugtracker for some day or two? https://tracker.ceph.com/issues/new Best regards ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance questions - 4 node (commodity) cluster - what to expect (and what not ;-)
Can you collect the output of this command on all 4 servers while your test is running: iostat -mtxy 1 This should show how busy the CPUs are as well as how busy each drive is. On Thu, Apr 29, 2021 at 7:52 AM Schmid, Michael wrote: > > Hello folks, > > I am new to ceph and at the moment I am doing some performance tests with a 4 > node ceph-cluster (pacific, 16.2.1). > > Node hardware (4 identical nodes): > > * DELL 3620 workstation > * Intel Quad-Core i7-6700@3.4 GHz > * 8 GB RAM > * Debian Buster (base system, installed a dedicated on Patriot Burst 120 > GB SATA-SSD) > * HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s from > node to node) > * 1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss protection > !) > * 3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB) > > After bootstrapping a containerized (docker) ceph-cluster, I did some > performance tests on the NVMe storage by creating a storage pool called > „ssdpool“, consisting of 4 OSDs per (one) NVMe device (per node). A first > write-performance test yields > > = > root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 > for up to 10 seconds or 0 objects > Object prefix: benchmark_data_ceph1_78 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 16301455.99756 0.02099770.493427 > 2 165337 73.990392 0.02643050.692179 > 3 167660 79.9871920.5595050.664204 > 4 169983 82.9879920.6093320.721016 > 5 16 116 100 79.9889680.6860930.698084 > 6 16 132 116 77.322464 1.197150.731808 > 7 16 153 137 78.2741840.6226460.755812 > 8 16 171 15577.48672 0.254090.764022 > 9 16 192 176 78.2076840.9683210.775292 >10 16 214 198 79.1856880.4013390.766764 >11 1 214 213 77.4408600.9696930.784002 > Total time run: 11.0698 > Total writes made: 214 > Write size: 4194304 > Object size:4194304 > Bandwidth (MB/sec): 77.3272 > Stddev Bandwidth: 13.7722 > Max bandwidth (MB/sec): 92 > Min bandwidth (MB/sec): 56 > Average IOPS: 19 > Stddev IOPS:3.44304 > Max IOPS: 23 > Min IOPS: 14 > Average Latency(s): 0.785372 > Stddev Latency(s): 0.49011 > Max latency(s): 2.16532 > Min latency(s): 0.0144995 > = > > ... and I think that 80 MB/s throughput is a very poor result in conjunction > with NVMe devices and 10 GBit nics. > > A bare write-test (with fsync=0 option) of the NVMe drives yields a write > throughput of round about 800 MB/s per device ... the second test (with > fsync=1) drops performance to 200 MB/s. > > = > root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write --bs=1024k > --direct=1 --filename=/dev/nvme0n1 --numjobs=4 --ioengine=libaio --iodepth=32 > --refill_buffers --group_reporting --runtime=30 --time_based --fsync=0 > IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, > (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32... > fio-3.12 > Starting 4 processes > Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s] > IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03 2021 > write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone resets > slat (usec): min=16, max=810, avg=106.48, stdev=30.48 > clat (msec): min=7, max=1110, avg=172.09, stdev=120.18 > lat (msec): min=7, max=1110, avg=172.19, stdev=120.18 > clat percentiles (msec): > | 1.00th=[ 32], 5.00th=[ 48], 10.00th=[ 53], 20.00th=[ 63], > | 30.00th=[ 115], 40.00th=[ 161], 50.00th=[ 169], 60.00th=[ 178], > | 70.00th=[ 190], 80.00th=[ 220], 90.00th=[ 264], 95.00th=[ 368], > | 99.00th=[ 667], 99.50th=[ 751], 99.90th=[ 894], 99.95th=[ 986], > | 99.99th=[ 1036] >bw ( KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94, > stdev=113845.69, samples=240 >iops: min= 22, max= 624, avg=185.11, stdev=111.18, samples=240 > lat (msec) : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52% > lat (msec) : 500=8.21%, 750=2.85%, 1000=0.47% > cpu : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45 > IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%, >=64=0.0% > submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete
[ceph-users] Re: Specify monitor IP when CIDR detection fails
At the moment I'm using "ceph orch mon apply mon1,mon2,mon3" and hostnames "mon1,mon2,mon3" on all nodes resolve to the IP address I would like the monitor to bind to. mon1 is the initial bootstrap monitor which is being created with "--mon-ip" (It in turn binds to the appropriate IP). Is there a way to specify "--public-addr" when using the orchestrator plugin and adding a monitor? - Original message -From: Michael Moyles To: Stephen Smith6 Cc:Subject: [EXTERNAL] Re: [ceph-users] Specify monitor IP when CIDR detection failsDate: Fri, Apr 30, 2021 9:33 AM What do the monitor logs say? I would think that 0.0.0.0/0 tells the monitor that it can bind to any address it finds on its host. If you know the specific interface or address you want it to bind to you can pass that with --public-addr. ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd What do the monitor logs say? I would think that 0.0.0.0/0 tells the monitor that it can bind to any address it finds on its host. If you know the specific interface or address you want it to bind to you can pass that with --public-addr. On Fri, 30 Apr 2021 at 13:50, Stephen Smith6wrote: I'm running some specialized routing in my environment such that CIDR detection is failing when trying to add monitors. Is there a way to specify the monitor IP address to bind to when adding a monitor if "public_network = 0.0.0.0/0"? Setting "public_network = 0.0.0.0/0" is the only way I could find to bypass CIDR detection but then new monitors are added with the wrong IP address in the monitor map :( I'm running the latest version of Octopus.___ceph-users mailing list -- ceph-users@ceph.ioTo unsubscribe send an email to ceph-users-le...@ceph.io -- Michael Moyles | Linux Engineermichael.moy...@mavensecurities.com Maven Securities Ltd140 Leadenhall Street, London, EC3V 4QTmavensecurities.com This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. Maven Securities Ltd is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion and other statement contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company. This e-mail together with any attachments (the "Message") is confidential and may contain privileged information. If you are not the intended recipient or if you have received this e-mail in error, please notify the sender immediately and permanently delete this Message from your system. Do not copy, disclose or distribute the information contained in this Message. Maven Investment Partners Ltd (No. 07511928), Maven Investment Partners US Ltd (No. 11494299), Maven Europe Ltd (No. 08966), Maven Derivatives Asia Limited (No.10361312) & Maven Securities Holding Ltd (No. 07505438) are registered as companies in England and Wales and their registered address is Level 3, 6 Bevis Marks, London EC3A 7BA, United Kingdom. The companies’ VAT No. is 135539016. Maven Asia (Hong Kong) Ltd (No. 2444041) is registered in Hong Kong and its registered address is 20/F, Tai Tung Building, 8 Fleming Road, Wan Chai, Hong Kong. Maven Derivatives Amsterdam B.V. (71291377) is registered in the Netherlands and its registered address is 12.02, Spaces, Barbara Strozzilaan 101, Amsterdam, 1083 HN, Netherlands. Maven Europe Ltd is authorised and regulated by the Financial Conduct Authority (FRN:770542). Maven Asia (Hong Kong) Ltd is registered and regulated by the Securities and Futures Commission (CE No: BJF060). Maven Derivatives Amsterdam B.V. is licensed and regulated by the Dutch Authority for the Financial Markets. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Failed cephadm Upgrade - ValueError
Hello All,I was running 15.2.8 via cephadm on docker Ubuntu 20.04I just attempted to upgrade to 16.2.1 via the automated method, it successfully upgraded the mon/mgr/mds and some OSD's, however it then failed on an OSD and hasn't been able to pass even after stopping and restarting the upgrade.It reported the following ""message": "Error: UPGRADEREDEPLOYDAEMON: Upgrading daemon osd.35 on host sn-s01 failed.""If I run 'ceph health detail' I get lot's of the following error : "ValueError: not enough values to unpack (expected 2, got 1)" throughout the detail reportUpon googling, it looks like I am hitting something along the lines of https://158.69.68.89/issues/48924 & https://tracker.ceph.com/issues/49522What do I need to do to either get around this bug, or a way I can manually upgrade the remaining ceph OSD's to 16.2.1, currently my cluster is working but the last OSD it failed to upgrade is currently offline (I guess as no image attached to it now as it failed to pull it), and I ha ve a cluster with OSD's from not 15.2.8 and 16.2.1Thanks Sent via MXlogin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance questions - 4 node (commodity) cluster - what to expect (and what not ;-)
On 29/04/2021 11:52 pm, Schmid, Michael wrote: I am new to ceph and at the moment I am doing some performance tests with a 4 node ceph-cluster (pacific, 16.2.1). Ceph doesn't do well with small numbers, 4 OSD's is really marginal. Your latency isn't crash hot either. What size are you running on the pool? The amount of RAM per node (8GB) would be the bare minimum as well, your ceph setup is really constrained. Do your OSD's have access to the raw device? are they bluestore? Same test on my Cluster * 5 Node, * 20 OSD's (Total) o Mix of SATA and SAS Spinners o WAL/DB on SSD * 64GB RAM Per node * 4 * 1GB Bond rados bench -p ceph 10 write -b 4M -t 16 --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_vnh_3642327 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 58 42 167.99 168 0.21848 0.329228 2 16 102 86 171.986 176 0.456715 0.325869 3 16 154 138 183.983 208 0.109888 0.319586 4 16 206 190 189.981 208 0.188891 0.320275 5 16 258 242 193.581 208 0.261014 0.319318 6 16 308 292 194.647 200 0.450672 0.319268 7 16 358 342 195.408 200 0.127415 0.316999 8 16 406 390 194.98 192 0.176382 0.321384 9 16 456 440 195.535 200 0.287347 0.318749 10 16 508 492 196.779 208 0.279796 0.318067 Total time run: 10.2741 Total writes made: 508 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 197.78 Stddev Bandwidth: 14.2111 Max bandwidth (MB/sec): 208 Min bandwidth (MB/sec): 168 Average IOPS: 49 Stddev IOPS: 3.55278 Max IOPS: 52 Min IOPS: 42 Average Latency(s): 0.318968 Stddev Latency(s): 0.137534 Max latency(s): 0.913779 Min latency(s): 0.0933294 -- Lindsay ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: one of 3 monitors keeps going down
Have you checked for disk failure? dmesg, smartctl etc. ? Zitat von "Robert W. Eckert" : I worked through that workflow- but it seems like the one monitor will run for a while - anywhere from an hour to a day, then just stop. This machine is running on AMD hardware (3600X CPU on X570 chipset) while my other two are running on old intel. I did find this in the service logs 2021-04-30T16:02:40.135+ 7f5d0a94f700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 395334538, got 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734 size 84730 code = 2 Rocksdb transaction: I am attaching the output of journalctl -u ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08...@mon.cube.service The error appears to be here: Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-61> 2021-04-30T16:02:38.700+ 7f5d21332700 4 mon.cube@-1(???).mgr e702 active server: [v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157) Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-60> 2021-04-30T16:02:38.700+ 7f5d21332700 4 mon.cube@-1(???).mgr e702 mkfs or daemon transitioned to available, loading commands Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-59> 2021-04-30T16:02:38.701+ 7f5d21332700 4 set_mon_vals no callback set Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-58> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals client_cache_size = 32768 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-57> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals container_image = docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-56> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals log_to_syslog = true Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-55> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals mon_data_avail_warn = 10 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-54> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals mon_warn_on_insecure_global_id_reclaim_allowed = true Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-53> 2021-04-30T16:02:38.701+ 7f5d21332700 4 set_mon_vals no callback set Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-52> 2021-04-30T16:02:38.702+ 7f5d21332700 2 auth: KeyRing::load: loaded key file /var/lib/ceph/mon/ceph-cube/keyring Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-51> 2021-04-30T16:02:38.702+ 7f5d1095b700 3 rocksdb: [db_impl/db_impl_compaction_flush.cc:2808] Compaction error: Corruption: block checksum mismatch: expected 395334538, got 4289108204 in /var/lib/ceph/mon/ceph- cube/store.db/073501.sst offset 36769734 size 84730 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-50> 2021-04-30T16:02:38.702+ 7f5d21332700 5 asok(0x56327d226000) register_command compact hook 0x56327e028700 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-49> 2021-04-30T16:02:38.702+ 7f5d1095b700 4 rocksdb: (Original Log Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760] [default] compacted to: base level 6 level multiplier 10.00 max bytes base 268435456 files[5 0 0 0 0 0 2] max score 0.00, MB/sec: 11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1, 126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0) Corruption: block checksum mismatch: expected 395334538, got 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734 size 84730, records in: 7670, records dropped: 6759 output_compres Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-48> 2021-04-30T16:02:38.702+ 7f5d1095b700 4 rocksdb: (Original Log Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros": 1619798558703277, "job": 3, "event": "compaction_finished", "compaction_time_micros": 15085, "compaction_time_cpu_micros": 11937, "output_level": 6, "num_output_files": 1, "total_output_size": 12627499, "num_input_records": 7670, "num_output_records": 911, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [5, 0, 0, 0, 0, 0, 2]} Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-47> 2021-04-30T16:02:38.702+ 7f5d1095b700 2 rocksdb: [db_impl/db_impl_compaction_flush.cc:2344] Waiting after background compaction error: Corruption: block checksum mismatch: expected 395334538, got 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734 size 84730, Accumulated background error counts: 1 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-46> 2021-04-30T16:02:38.702+ 7f5d21332700 5 asok(0x56327d226000) register_command smart hook 0x56327e028700 This is run
[ceph-users] Re: one of 3 monitors keeps going down
Nothing is appearing in dmesg. Smartctl shows no issues either. I did find this issue https://tracker.ceph.com/issues/24968 which showed something that may be memory related, so I will try testing that next. -Original Message- From: Eugen Block Sent: Friday, April 30, 2021 1:36 PM To: Robert W. Eckert Cc: ceph-users@ceph.io; Sebastian Wagner Subject: Re: [ceph-users] Re: one of 3 monitors keeps going down Have you checked for disk failure? dmesg, smartctl etc. ? Zitat von "Robert W. Eckert" : > I worked through that workflow- but it seems like the one monitor will > run for a while - anywhere from an hour to a day, then just stop. > > This machine is running on AMD hardware (3600X CPU on X570 chipset) > while my other two are running on old intel. > > I did find this in the service logs > > 2021-04-30T16:02:40.135+ 7f5d0a94f700 -1 rocksdb: submit_common > error: Corruption: block checksum mismatch: expected 395334538, got > 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst > offset 36769734 size 84730 code = 2 Rocksdb transaction: > > I am attaching the output of > journalctl -u > ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08...@mon.cube.service > > The error appears to be here: > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-61> > 2021-04-30T16:02:38.700+ 7f5d21332700 4 mon.cube@-1(???).mgr > e702 active server: > [v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157) > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-60> > 2021-04-30T16:02:38.700+ 7f5d21332700 4 mon.cube@-1(???).mgr > e702 mkfs or daemon transitioned to available, loading commands > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-59> > 2021-04-30T16:02:38.701+ 7f5d21332700 4 set_mon_vals no callback > set > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-58> > 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals > client_cache_size = 32768 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-57> > 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals > container_image = > docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-56> > 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals > log_to_syslog = true > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-55> > 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals > mon_data_avail_warn = 10 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-54> > 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals > mon_warn_on_insecure_global_id_reclaim_allowed = true > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-53> > 2021-04-30T16:02:38.701+ 7f5d21332700 4 set_mon_vals no callback > set > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-52> > 2021-04-30T16:02:38.702+ 7f5d21332700 2 auth: KeyRing::load: > loaded key file /var/lib/ceph/mon/ceph-cube/keyring > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-51> > 2021-04-30T16:02:38.702+ 7f5d1095b700 3 rocksdb: > [db_impl/db_impl_compaction_flush.cc:2808] Compaction error: > Corruption: block checksum mismatch: expected 395334538, got > 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst > offset 36769734 size 84730 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-50> > 2021-04-30T16:02:38.702+ 7f5d21332700 5 asok(0x56327d226000) > register_command compact hook 0x56327e028700 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-49> > 2021-04-30T16:02:38.702+ 7f5d1095b700 4 rocksdb: (Original Log > Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760] > [default] compacted to: base level 6 level multiplier 10.00 max > bytes base 268435456 files[5 00 0 0 0 2] max score 0.00, MB/sec: > 11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1, > 126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0) > Corruption: block checksum mismatch: expected 395334538, got > 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst > offset 36769734 size 84730, records in: 7670, records dropped: 6759 > output_compres > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-48> > 2021-04-30T16:02:38.702+ 7f5d1095b700 4 rocksdb: (Original Log > Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros": > 1619798558703277, "job": 3, "event": "compaction_finished", > "compaction_time_micros": 15085, "compaction_time_cpu_micros": > 11937, "output_level": 6, "num_output_files": 1, > "total_output_size": 12627499, "num_input_records": 7670, > "num_output_records": 911, "num_subcompactions": 1, > "output_compression": "NoCompression", > "num_single_delete_mismatches": 0,"num_single_delete_fallthrough": > 0, "ls
[ceph-users] Best distro to run ceph.
I'm trying to set up a new ceph cluster, and I've hit a bit of a blank. I started off with centos7 and cephadm. Worked fine to a point, except I had to upgrade podman but it mostly worked with octopus. Since this is a fresh cluster and hence no data at risk, I decided to jump straight into Pacific when it came out and upgrade. Which is where my trouble began. Mostly because Pacific needs a version on lvm later than what's in centos7. I can't upgrade to centos8 as my boot drives are not supported by centos8 due to the way redhst disabled lots of disk drivers. I think I'm looking at Ubuntu or debian. Given cephadm has a very limited set of depends it would be good to have a supported matrix, it would also be good to have a check in cephadm on upgrade, that says no I won't upgrade if the version of lvm2 is too low on any host and let's the admin fix the issue and try again. I was thinking to upgrade to centos8 for this project anyway until I relised that centos8 can't support my hardware I've inherited. But currently I've got a broken cluster unless I can workout some way to upgrade lvm in centos7. Peter. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Large OSD Performance: osd_op_num_shards, osd_op_num_threads_per_shard
Hello, I noticed a couple unanswered questions on this topic from a while back. It seems, however, worth asking whether adjusting either or both of the subject attributes could improve performance with large HDD OSDs (mine are 12TB SAS). In the previous posts on this topic the writers indicated that they had experimented with increasing either or both of osd_op_num_shards and osd_op_num_threads_per_shard and had seen performance improvement. Like myself, the writers wondering about any limitations or pitfalls relating to such adjustments. Since I would rather not take chances with a 500TB production cluster I am asking for guidance from this list. BTW, my cluster is currently running Nautilus 14.2.6 (stock Debian packages). Thank you. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Best distro to run ceph.
I've had good luck with the Ubuntu LTS releases - no need to add extra repos. 20.04 uses Octopus. On Fri, Apr 30, 2021 at 1:14 PM Peter Childs wrote: > > I'm trying to set up a new ceph cluster, and I've hit a bit of a blank. > > I started off with centos7 and cephadm. Worked fine to a point, except I > had to upgrade podman but it mostly worked with octopus. > > Since this is a fresh cluster and hence no data at risk, I decided to jump > straight into Pacific when it came out and upgrade. Which is where my > trouble began. Mostly because Pacific needs a version on lvm later than > what's in centos7. > > I can't upgrade to centos8 as my boot drives are not supported by centos8 > due to the way redhst disabled lots of disk drivers. I think I'm looking at > Ubuntu or debian. > > Given cephadm has a very limited set of depends it would be good to have a > supported matrix, it would also be good to have a check in cephadm on > upgrade, that says no I won't upgrade if the version of lvm2 is too low on > any host and let's the admin fix the issue and try again. > > I was thinking to upgrade to centos8 for this project anyway until I > relised that centos8 can't support my hardware I've inherited. But > currently I've got a broken cluster unless I can workout some way to > upgrade lvm in centos7. > > Peter. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io