[ceph-users] cephadm upgrade from v15.11 to pacific fails all the times

2021-04-30 Thread Ackermann, Christoph
Dear gents,

to get handy with cephadm upgrade path and in general (we heavily use old
style "ceph-deploy" Octopus based production clusters), we decided to do
some tests with a vanilla cluster running 15.2.11 based on Centos8 on top
of vSphere.  Deployment of Octopus cluster runs very well and  we are
excited about this new technique and all the possibilities.  No errors no
clues... :-)

Unfortunately upgrade fails to Pacific (16.2.0 or 16.2.1) either original
docker or quay.ceph.io/ceph-ci/ceph:pacific images all the time.  We use a
small setup (3 mons, 2 mgrs, some osds) This is the upgrade behaviour:

Upgrade of both MGR's seems to be ok but we get this:

2021-04-29T15:35:19.903111+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n00.vnxaqu container digest correct
2021-04-29T15:35:19.903206+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n00.vnxaqu deployed by correct version
2021-04-29T15:35:19.903298+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n01.gstlmw container digest correct
2021-04-29T15:35:19.903378+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n01.gstlmw *not deployed by correct version*

After this the upgrade process stucks completely.  Although you have a
running cluster (minus one  monitor daemon):

[root@c0n00 ~]# ceph -s
  cluster:
id: 5541c866-a8fe-11eb-b604-005056b8f1bf
health: HEALTH_WARN
   * 3 hosts fail cephadm check*
  services:
mon: 2 daemons, quorum c0n00,c0n02 (age 68m)
mgr: c0n00.bmtvpr(active, since 68m), standbys: c0n01.jwfuca
osd: 4 osds: 4 up (since 63m), 4 in (since 62m)
[..]
  progress:
Upgrade to 16.2.1-257-g717ce59b (0s)

  [=...]


{

"target_image": "
quay.ceph.io/ceph-ci/ceph@sha256:d0f624287378fe63fc4c30bccc9f82bfe0e42e62381c0a3d0d3d86d985f5d788",


"in_progress": true,

"services_complete": [
"mgr"


],


"progress": "2/19 ceph daemons upgraded",


"message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
unexpected exception"

[root@c0n00 ~]# ceph orch ps
NAME HOST   PORTSSTATUS   REFRESHED  AGE
 VERSION   IMAGE ID  CONTAINER ID
alertmanager.c0n00   c0n00   running (56m)4m ago 16h
 0.20.00881eb8f169f  30d9eff06ce2
crash.c0n00  c0n00   running (56m)4m ago 16h
 15.2.11   9d01da634b8f  91d3e4d0e14d
crash.c0n01  c0n01   host is offline  16h ago16h
 15.2.11   9d01da634b8f  0ff4a20021df
crash.c0n02  c0n02   host is offline  16h ago16h
 15.2.11   9d01da634b8f  0253e6bb29a0
crash.c0n03  c0n03   host is offline  16h ago16h
 15.2.11   9d01da634b8f  291ce4f8b854
grafana.c0n00c0n00   running (56m)4m ago 16h
 6.7.4 80728b29ad3f  46d77b695da5
mgr.c0n00.bmtvpr c0n00  *:8443,9283  running (56m)4m ago 16h
 16.2.1-257-g717ce59b  3be927f015dd  94a7008ccb4f
mgr.c0n01.jwfuca c0n01   host is offline  16h ago16h
 16.2.1-257-g717ce59b  3be927f015dd  766ada65efa9
mon.c0n00c0n00   running (56m)4m ago 16h
 15.2.11   9d01da634b8f  b9f270cd99e2
mon.c0n02c0n02   host is offline  16h ago16h
 15.2.11   9d01da634b8f  a90c21bfd49e
node-exporter.c0n00  c0n00   running (56m)4m ago 16h
 0.18.1e5a616e4b9cf  eb1306811c6c
node-exporter.c0n01  c0n01   host is offline  16h ago16h
 0.18.1e5a616e4b9cf  093a72542d3e
node-exporter.c0n02  c0n02   host is offline  16h ago16h
 0.18.1e5a616e4b9cf  785531f5d6cf
node-exporter.c0n03  c0n03   host is offline  16h ago16h
 0.18.1e5a616e4b9cf  074fac77e17c
osd.0c0n02   host is offline  16h ago16h
 15.2.11   9d01da634b8f  c075bd047c0a
osd.1c0n01   host is offline  16h ago16h
 15.2.11   9d01da634b8f  616aeda28504
osd.2c0n03   host is offline  16h ago16h
 15.2.11   9d01da634b8f  b36453730c83
osd.3c0n00   running (56m)4m ago 16h
 15.2.11   9d01da634b8f  e043abf53206
prometheus.c0n00 c0n00   running (56m)4m ago 16h
 2.18.1de242295e225  7cb50c04e26a

After some digging into daemon logs we found Tracebacks (please see below).
We also noticed that we successfully reach each host per ssh -F   !!!
We've done tcpdumps while upgrading and every SYN gets its SYNACK... ;-)

Because we get  no errors while deploying fresh Octopus cluster by
cephadm (from
https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm  and cephadm
prepare host is always OK)  it might be a missing Python Lib  or something
that's not checked cephadm itself?

Thank you for any hint.

Christoph A

[ceph-users] Re: Host ceph version in dashboard incorrect after upgrade

2021-04-30 Thread mabi
Thank you for the command. I successfully stopped and started the mgr daemon on 
that node but still the version number on the ceph dashboard is stuck on the 
old version 15.2.10. On that node I also have the mon daemon running, should I 
also restart mon?

‐‐‐ Original Message ‐‐‐
On Thursday, April 29, 2021 8:20 PM, Eugen Block  wrote:

> Try this:
>
> ceph orch daemon stop mgr.
>
> and then after another daemon took over its role start it again:
>
> ceph orch daemon start mgr.
>
> Zitat von mabim...@protonmail.ch:
>
> > I also thought about restarting the MGR service but I am new to ceph
> > and could not find the "cephadm orch" command in order to do that...
> > What would be the command to restart the mgr service on a specific
> > node?
> > ‐‐‐ Original Message ‐‐‐
> > On Thursday, April 29, 2021 7:23 PM, Eugen Block ebl...@nde.ag wrote:
> >
> > > I would restart the active MGR, that should resolve it.
> > > Zitat von mabi m...@protonmail.ch:
> > >
> > > > Hello,
> > > > I upgraded my Octopus test cluster which has 5 hosts because one of
> > > > the node (a mon/mgr node) was still on version 15.2.10 but all the
> > > > others on 15.2.11.
> > > > For the upgrade I used the following command:
> > > > ceph orch upgrade start --ceph-version 15.2.11
> > > > The upgrade worked correctly and I did not see any errors in the
> > > > logs but the host version in the ceph dashboard (under the
> > > > navigation Cluster -> Hosts) still snows 15.2.10 for that specific
> > > > node.
> > > > The output of "ceph versions", shows that every component is on
> > > > 15.2.11 as you can see below:
> > > > {
> > > > "mon": {
> > > > "ceph version 15.2.11
> > > > (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 3
> > > > },
> > > > "mgr": {
> > > > "ceph version 15.2.11
> > > > (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 2
> > > > },
> > > > "osd": {
> > > > "ceph version 15.2.11
> > > > (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 2
> > > > },
> > > > "mds": {},
> > > > "overall": {
> > > > "ceph version 15.2.11
> > > > (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 7
> > > > }
> > > > }
> > > > So why is it still stuck on 15.2.10 in the dashboard?
> > > > Best regards,
> > > > Mabi
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Specify monitor IP when CIDR detection fails

2021-04-30 Thread Stephen Smith6
I'm running some specialized routing in my environment such that CIDR detection is failing when trying to add monitors. Is there a way to specify the monitor IP address to bind to when adding a monitor if "public_network = 0.0.0.0/0"? Setting "public_network = 0.0.0.0/0" is the only way I could find to bypass CIDR detection but then new monitors are added with the wrong IP address in the monitor map :( I'm running the latest version of Octopus.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cannot create issue in bugtracker

2021-04-30 Thread Tobias Urdin
Hello,


Is it only me that's getting Internal error when trying to create issues in the 
bugtracker for some day or two?

https://tracker.ceph.com/issues/new


Best regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance questions - 4 node (commodity) cluster - what to expect (and what not ;-)

2021-04-30 Thread Mark Lehrer
Can you collect the output of this command on all 4 servers while your
test is running:

iostat -mtxy 1

This should show how busy the CPUs are as well as how busy each drive is.


On Thu, Apr 29, 2021 at 7:52 AM Schmid, Michael
 wrote:
>
> Hello folks,
>
> I am new to ceph and at the moment I am doing some performance tests with a 4 
> node ceph-cluster (pacific, 16.2.1).
>
> Node hardware (4 identical nodes):
>
>   *   DELL 3620 workstation
>   *   Intel Quad-Core i7-6700@3.4 GHz
>   *   8 GB RAM
>   *   Debian Buster (base system, installed a dedicated on Patriot Burst 120 
> GB SATA-SSD)
>   *   HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s from 
> node to node)
>   *   1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss protection 
> !)
>   *   3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB)
>
> After bootstrapping a containerized (docker) ceph-cluster, I did some 
> performance tests on the NVMe storage by creating a storage pool called 
> „ssdpool“, consisting of 4 OSDs per (one) NVMe device (per node). A first 
> write-performance test yields
>
> =
> root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
> for up to 10 seconds or 0 objects
> Object prefix: benchmark_data_ceph1_78
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  16301455.99756   0.02099770.493427
> 2  165337   73.990392   0.02643050.692179
> 3  167660   79.9871920.5595050.664204
> 4  169983   82.9879920.6093320.721016
> 5  16   116   100   79.9889680.6860930.698084
> 6  16   132   116   77.322464 1.197150.731808
> 7  16   153   137   78.2741840.6226460.755812
> 8  16   171   15577.48672 0.254090.764022
> 9  16   192   176   78.2076840.9683210.775292
>10  16   214   198   79.1856880.4013390.766764
>11   1   214   213   77.4408600.9696930.784002
> Total time run: 11.0698
> Total writes made:  214
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 77.3272
> Stddev Bandwidth:   13.7722
> Max bandwidth (MB/sec): 92
> Min bandwidth (MB/sec): 56
> Average IOPS:   19
> Stddev IOPS:3.44304
> Max IOPS:   23
> Min IOPS:   14
> Average Latency(s): 0.785372
> Stddev Latency(s):  0.49011
> Max latency(s): 2.16532
> Min latency(s): 0.0144995
> =
>
> ... and I think that 80 MB/s throughput is a very poor result in conjunction 
> with NVMe devices and 10 GBit nics.
>
> A bare write-test (with fsync=0 option) of the NVMe drives yields a write 
> throughput of round about 800 MB/s per device ... the second test (with 
> fsync=1) drops performance to 200 MB/s.
>
> =
> root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write --bs=1024k 
> --direct=1 --filename=/dev/nvme0n1 --numjobs=4 --ioengine=libaio --iodepth=32 
> --refill_buffers --group_reporting --runtime=30 --time_based --fsync=0
> IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, 
> (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32...
> fio-3.12
> Starting 4 processes
> Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s]
> IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03 2021
>   write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone resets
> slat (usec): min=16, max=810, avg=106.48, stdev=30.48
> clat (msec): min=7, max=1110, avg=172.09, stdev=120.18
>  lat (msec): min=7, max=1110, avg=172.19, stdev=120.18
> clat percentiles (msec):
>  |  1.00th=[   32],  5.00th=[   48], 10.00th=[   53], 20.00th=[   63],
>  | 30.00th=[  115], 40.00th=[  161], 50.00th=[  169], 60.00th=[  178],
>  | 70.00th=[  190], 80.00th=[  220], 90.00th=[  264], 95.00th=[  368],
>  | 99.00th=[  667], 99.50th=[  751], 99.90th=[  894], 99.95th=[  986],
>  | 99.99th=[ 1036]
>bw (  KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94, 
> stdev=113845.69, samples=240
>iops: min=   22, max=  624, avg=185.11, stdev=111.18, samples=240
>   lat (msec)   : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52%
>   lat (msec)   : 500=8.21%, 750=2.85%, 1000=0.47%
>   cpu  : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  

[ceph-users] Re: Specify monitor IP when CIDR detection fails

2021-04-30 Thread Stephen Smith6
At the moment I'm using "ceph orch mon apply mon1,mon2,mon3" and hostnames "mon1,mon2,mon3" on all nodes resolve to the IP address I would like the monitor to bind to.
 
mon1 is the initial bootstrap monitor which is being created with "--mon-ip" (It in turn binds to the appropriate IP).
 
Is there a way to specify "--public-addr" when using the orchestrator plugin and adding a monitor?
 
- Original message -From: Michael Moyles To: Stephen Smith6 Cc:Subject: [EXTERNAL] Re: [ceph-users] Specify monitor IP when CIDR detection failsDate: Fri, Apr 30, 2021 9:33 AM  What do the monitor logs say? I would think that 0.0.0.0/0 tells the monitor that it can bind to any address it finds on its host. If you know the specific interface or address you want it to bind to you can pass that with --public-addr. ‍ZjQcmQRYFpfptBannerStart  

This Message Is From an External Sender
This message came from outside your organization. ZjQcmQRYFpfptBannerEnd 
What do the monitor logs say? I would think that 0.0.0.0/0 tells the monitor that it can bind to any address it finds on its host. If you know the specific interface or address you want it to bind to you can pass that with --public-addr.

  

On Fri, 30 Apr 2021 at 13:50, Stephen Smith6  wrote:
I'm running some specialized routing in my environment such that CIDR detection is failing when trying to add monitors. Is there a way to specify the monitor IP address to bind to when adding a monitor if "public_network = 0.0.0.0/0"? Setting "public_network = 0.0.0.0/0" is the only way I could find to bypass CIDR detection but then new monitors are added with the wrong IP address in the monitor map :( I'm running the latest version of Octopus.___ceph-users mailing list -- ceph-users@ceph.ioTo unsubscribe send an email to ceph-users-le...@ceph.io 

 --


Michael Moyles | Linux Engineermichael.moy...@mavensecurities.com
Maven Securities Ltd140 Leadenhall Street, London, EC3V 4QTmavensecurities.com


This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. Maven Securities Ltd is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion and other statement contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company. 

This e-mail together with any attachments (the "Message") is confidential and may contain privileged information. If you are not the intended recipient or if you have received this e-mail in error, please notify the sender immediately and permanently delete this Message from your system. Do not copy, disclose or distribute the information contained in this Message.
Maven Investment Partners Ltd (No. 07511928), Maven Investment Partners US Ltd (No. 11494299), Maven Europe Ltd (No. 08966), Maven Derivatives Asia Limited (No.10361312) & Maven Securities Holding Ltd (No. 07505438) are registered as companies in England and Wales and their registered address is Level 3, 6 Bevis Marks, London EC3A 7BA, United Kingdom. The companies’ VAT No. is 135539016. Maven Asia (Hong Kong) Ltd (No. 2444041) is registered in Hong Kong and its registered address is 20/F, Tai Tung Building, 8 Fleming Road, Wan Chai, Hong Kong. Maven Derivatives Amsterdam B.V. (71291377) is registered in the Netherlands and its registered address is 12.02, Spaces, Barbara Strozzilaan 101, Amsterdam, 1083 HN, Netherlands. Maven Europe Ltd is authorised and regulated by the Financial Conduct Authority (FRN:770542). Maven Asia (Hong Kong) Ltd is registered and regulated by the Securities and Futures Commission (CE No: BJF060). Maven Derivatives Amsterdam B.V. is licensed and regulated by the Dutch Authority for the Financial Markets.
 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Failed cephadm Upgrade - ValueError

2021-04-30 Thread Ashley Merrick
Hello All,I was running 15.2.8 via cephadm on docker Ubuntu 20.04I just 
attempted to upgrade to 16.2.1 via the automated method, it successfully 
upgraded the mon/mgr/mds and some OSD's, however it then failed on an OSD and 
hasn't been able to pass even after stopping and restarting the upgrade.It 
reported the following ""message": "Error: UPGRADEREDEPLOYDAEMON: Upgrading 
daemon osd.35 on host sn-s01 failed.""If I run 'ceph health detail' I get lot's 
of the following error : "ValueError: not enough values to unpack (expected 2, 
got 1)" throughout the detail reportUpon googling, it looks like I am hitting 
something along the lines of https://158.69.68.89/issues/48924 & 
https://tracker.ceph.com/issues/49522What do I need to do to either get around 
this bug, or a way I can manually upgrade the remaining ceph OSD's to 16.2.1, 
currently my cluster is working but the last OSD it failed to upgrade is 
currently offline (I guess as no image attached to it now as it failed to pull 
it), and I ha
 ve a cluster with OSD's from not 15.2.8 and 16.2.1Thanks
 
Sent via MXlogin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance questions - 4 node (commodity) cluster - what to expect (and what not ;-)

2021-04-30 Thread Lindsay Mathieson

On 29/04/2021 11:52 pm, Schmid, Michael wrote:

I am new to ceph and at the moment I am doing some performance tests with a 4 
node ceph-cluster (pacific, 16.2.1).


Ceph doesn't do well with small numbers, 4 OSD's is really marginal. 
Your latency isn't crash hot either. What size are you running on the 
pool? The amount of RAM per node (8GB) would be the bare minimum as 
well, your ceph setup is really constrained.



Do your OSD's have access to the raw device? are they bluestore?


Same test on my Cluster

 * 5 Node,
 * 20 OSD's (Total)
 o Mix of SATA and SAS Spinners
 o WAL/DB on SSD
 * 64GB RAM Per node
 * 4 * 1GB Bond


rados bench -p ceph 10 write -b 4M -t 16 --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 
4194304 for up to 10 seconds or 0 objects

Object prefix: benchmark_data_vnh_3642327
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)

    0   0 0 0 0 0   -   0
    1  16    58    42 167.99   168 0.21848    0.329228
    2  16   102    86 171.986   176    0.456715    0.325869
    3  16   154   138 183.983   208    0.109888    0.319586
    4  16   206   190 189.981   208    0.188891    0.320275
    5  16   258   242 193.581   208    0.261014    0.319318
    6  16   308   292 194.647   200    0.450672    0.319268
    7  16   358   342 195.408   200    0.127415    0.316999
    8  16   406   390 194.98   192    0.176382    0.321384
    9  16   456   440 195.535   200    0.287347    0.318749
   10  16   508   492 196.779   208    0.279796    0.318067
Total time run: 10.2741
Total writes made:  508
Write size: 4194304
Object size:    4194304
Bandwidth (MB/sec): 197.78
Stddev Bandwidth:   14.2111
Max bandwidth (MB/sec): 208
Min bandwidth (MB/sec): 168
Average IOPS:   49
Stddev IOPS:    3.55278
Max IOPS:   52
Min IOPS:   42
Average Latency(s): 0.318968
Stddev Latency(s):  0.137534
Max latency(s): 0.913779
Min latency(s): 0.0933294


--
Lindsay

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: one of 3 monitors keeps going down

2021-04-30 Thread Eugen Block

Have you checked for disk failure? dmesg, smartctl etc. ?


Zitat von "Robert W. Eckert" :

I worked through that workflow- but it seems like the one monitor  
will run for a while - anywhere from an hour to a day, then just stop.


This machine is running on AMD hardware (3600X CPU on X570 chipset)  
while my other two are running on old intel.


I did find this in the service logs

2021-04-30T16:02:40.135+ 7f5d0a94f700 -1 rocksdb: submit_common  
error: Corruption: block checksum mismatch: expected 395334538, got  
4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst  
offset 36769734 size 84730 code = 2 Rocksdb transaction:


I am attaching the output of
journalctl -u ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08...@mon.cube.service

The error appears to be here:
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-61>  
2021-04-30T16:02:38.700+ 7f5d21332700  4 mon.cube@-1(???).mgr  
e702 active server:  
[v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157)
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-60>  
2021-04-30T16:02:38.700+ 7f5d21332700  4 mon.cube@-1(???).mgr  
e702 mkfs or daemon transitioned to available, loading commands
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-59>  
2021-04-30T16:02:38.701+ 7f5d21332700  4 set_mon_vals no  
callback set
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-58>  
2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals  
client_cache_size = 32768
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-57>  
2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals  
container_image =  
docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-56>  
2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals  
log_to_syslog = true
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-55>  
2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals  
mon_data_avail_warn = 10
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-54>  
2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals  
mon_warn_on_insecure_global_id_reclaim_allowed = true
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-53>  
2021-04-30T16:02:38.701+ 7f5d21332700  4 set_mon_vals no  
callback set
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-52>  
2021-04-30T16:02:38.702+ 7f5d21332700  2 auth: KeyRing::load:  
loaded key file /var/lib/ceph/mon/ceph-cube/keyring
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-51>  
2021-04-30T16:02:38.702+ 7f5d1095b700  3 rocksdb:  
[db_impl/db_impl_compaction_flush.cc:2808] Compaction error:  
Corruption: block checksum mismatch: expected 395334538, got  
4289108204  in /var/lib/ceph/mon/ceph-	cube/store.db/073501.sst  
offset 36769734 size 84730
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-50>  
2021-04-30T16:02:38.702+ 7f5d21332700  5 asok(0x56327d226000)  
register_command compact hook 0x56327e028700
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-49>  
2021-04-30T16:02:38.702+ 7f5d1095b700  4 rocksdb: (Original Log  
Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760]  
[default] compacted to: base level 6 level multiplier 10.00 max  
bytes base 268435456 files[5 0 	0 0 0 0 2] max score 0.00, MB/sec:  
11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1,  
126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0)  
Corruption: block checksum mismatch: expected 395334538, got  
4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst  
offset 36769734 size 	84730, records in: 7670, records dropped: 6759  
output_compres
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-48>  
2021-04-30T16:02:38.702+ 7f5d1095b700  4 rocksdb: (Original Log  
Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros":  
1619798558703277, "job": 3, "event": "compaction_finished",  
"compaction_time_micros": 15085, 	"compaction_time_cpu_micros":  
11937, "output_level": 6, "num_output_files": 1,  
"total_output_size": 12627499, "num_input_records": 7670,  
"num_output_records": 911, "num_subcompactions": 1,  
"output_compression": "NoCompression",  
"num_single_delete_mismatches": 0, 	"num_single_delete_fallthrough":  
0, "lsm_state": [5, 0, 0, 0, 0, 0, 2]}
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-47>  
2021-04-30T16:02:38.702+ 7f5d1095b700  2 rocksdb:  
[db_impl/db_impl_compaction_flush.cc:2344] Waiting after background  
compaction error: Corruption: block checksum mismatch: expected  
395334538, got 4289108204  in  
	/var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734  
size 84730, Accumulated background error counts: 1
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-46>  
2021-04-30T16:02:38.702+ 7f5d21332700  5 asok(0x56327d226000)  
register_command smart hook 0x56327e028700



This is run

[ceph-users] Re: one of 3 monitors keeps going down

2021-04-30 Thread Robert W. Eckert
Nothing is appearing in dmesg.  Smartctl shows no issues either.  

I did find this issue https://tracker.ceph.com/issues/24968 which showed 
something that may be memory related, so I will try testing that next.


-Original Message-
From: Eugen Block  
Sent: Friday, April 30, 2021 1:36 PM
To: Robert W. Eckert 
Cc: ceph-users@ceph.io; Sebastian Wagner 
Subject: Re: [ceph-users] Re: one of 3 monitors keeps going down

Have you checked for disk failure? dmesg, smartctl etc. ?


Zitat von "Robert W. Eckert" :

> I worked through that workflow- but it seems like the one monitor will 
> run for a while - anywhere from an hour to a day, then just stop.
>
> This machine is running on AMD hardware (3600X CPU on X570 chipset) 
> while my other two are running on old intel.
>
> I did find this in the service logs
>
> 2021-04-30T16:02:40.135+ 7f5d0a94f700 -1 rocksdb: submit_common
> error: Corruption: block checksum mismatch: expected 395334538, got
> 4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst
> offset 36769734 size 84730 code = 2 Rocksdb transaction:
>
> I am attaching the output of
> journalctl -u 
> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08...@mon.cube.service
>
> The error appears to be here:
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-61>  
> 2021-04-30T16:02:38.700+ 7f5d21332700  4 mon.cube@-1(???).mgr
> e702 active server:  
> [v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157)
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-60>  
> 2021-04-30T16:02:38.700+ 7f5d21332700  4 mon.cube@-1(???).mgr
> e702 mkfs or daemon transitioned to available, loading commands
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-59>  
> 2021-04-30T16:02:38.701+ 7f5d21332700  4 set_mon_vals no callback 
> set
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-58>  
> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals 
> client_cache_size = 32768
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-57>  
> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals 
> container_image =
> docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-56>  
> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals 
> log_to_syslog = true
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-55>  
> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals 
> mon_data_avail_warn = 10
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-54>  
> 2021-04-30T16:02:38.701+ 7f5d21332700 10 set_mon_vals 
> mon_warn_on_insecure_global_id_reclaim_allowed = true
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-53>  
> 2021-04-30T16:02:38.701+ 7f5d21332700  4 set_mon_vals no callback 
> set
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-52>  
> 2021-04-30T16:02:38.702+ 7f5d21332700  2 auth: KeyRing::load:  
> loaded key file /var/lib/ceph/mon/ceph-cube/keyring
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-51>  
> 2021-04-30T16:02:38.702+ 7f5d1095b700  3 rocksdb:  
> [db_impl/db_impl_compaction_flush.cc:2808] Compaction error:  
> Corruption: block checksum mismatch: expected 395334538, got  
> 4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst  
> offset 36769734 size 84730
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-50>  
> 2021-04-30T16:02:38.702+ 7f5d21332700  5 asok(0x56327d226000) 
> register_command compact hook 0x56327e028700
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-49>  
> 2021-04-30T16:02:38.702+ 7f5d1095b700  4 rocksdb: (Original Log 
> Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760]
> [default] compacted to: base level 6 level multiplier 10.00 max  
> bytes base 268435456 files[5 00 0 0 0 2] max score 0.00, MB/sec:  
> 11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1,
> 126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0)
> Corruption: block checksum mismatch: expected 395334538, got
> 4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst  
> offset 36769734 size  84730, records in: 7670, records dropped: 6759  
> output_compres
>   Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug-48>  
> 2021-04-30T16:02:38.702+ 7f5d1095b700  4 rocksdb: (Original Log 
> Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros":
> 1619798558703277, "job": 3, "event": "compaction_finished",  
> "compaction_time_micros": 15085,  "compaction_time_cpu_micros":  
> 11937, "output_level": 6, "num_output_files": 1,
> "total_output_size": 12627499, "num_input_records": 7670,
> "num_output_records": 911, "num_subcompactions": 1,
> "output_compression": "NoCompression",  
> "num_single_delete_mismatches": 0,"num_single_delete_fallthrough":  
> 0, "ls

[ceph-users] Best distro to run ceph.

2021-04-30 Thread Peter Childs
I'm trying to set up a new ceph cluster, and I've hit a bit of a blank.

I started off with centos7 and cephadm. Worked fine to a point, except I
had to upgrade podman but it mostly worked with octopus.

Since this is a fresh cluster and hence no data at risk, I decided to jump
straight into Pacific when it came out and upgrade. Which is where my
trouble began. Mostly because Pacific needs a version on lvm later than
what's in centos7.

I can't upgrade to centos8 as my boot drives are not supported by centos8
due to the way redhst disabled lots of disk drivers. I think I'm looking at
Ubuntu or debian.

Given cephadm has a very limited set of depends it would be good to have a
supported matrix, it would also be good to have a check in cephadm on
upgrade, that says no I won't upgrade if the version of lvm2 is too low on
any host and let's the admin fix the issue and try again.

I was thinking to upgrade to centos8 for this project anyway until I
relised that centos8 can't support my hardware I've inherited. But
currently I've got a broken cluster unless I can workout some way to
upgrade lvm in centos7.

Peter.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Large OSD Performance: osd_op_num_shards, osd_op_num_threads_per_shard

2021-04-30 Thread Dave Hall
Hello,

I noticed a couple unanswered questions on this topic from a while back.
It seems, however, worth asking whether adjusting either or both of the
subject attributes could improve performance with large HDD OSDs (mine are
12TB SAS).

In the previous posts on this topic the writers indicated that they had
experimented with increasing either or both of osd_op_num_shards and
osd_op_num_threads_per_shard and had seen performance improvement.  Like
myself, the writers wondering about any limitations or pitfalls relating to
such adjustments.

Since I would rather not take chances with a 500TB production cluster I am
asking for guidance from this list.

BTW, my cluster is currently running Nautilus 14.2.6 (stock Debian
packages).

Thank you.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best distro to run ceph.

2021-04-30 Thread Mark Lehrer
I've had good luck with the Ubuntu LTS releases - no need to add extra
repos.  20.04 uses Octopus.

On Fri, Apr 30, 2021 at 1:14 PM Peter Childs  wrote:
>
> I'm trying to set up a new ceph cluster, and I've hit a bit of a blank.
>
> I started off with centos7 and cephadm. Worked fine to a point, except I
> had to upgrade podman but it mostly worked with octopus.
>
> Since this is a fresh cluster and hence no data at risk, I decided to jump
> straight into Pacific when it came out and upgrade. Which is where my
> trouble began. Mostly because Pacific needs a version on lvm later than
> what's in centos7.
>
> I can't upgrade to centos8 as my boot drives are not supported by centos8
> due to the way redhst disabled lots of disk drivers. I think I'm looking at
> Ubuntu or debian.
>
> Given cephadm has a very limited set of depends it would be good to have a
> supported matrix, it would also be good to have a check in cephadm on
> upgrade, that says no I won't upgrade if the version of lvm2 is too low on
> any host and let's the admin fix the issue and try again.
>
> I was thinking to upgrade to centos8 for this project anyway until I
> relised that centos8 can't support my hardware I've inherited. But
> currently I've got a broken cluster unless I can workout some way to
> upgrade lvm in centos7.
>
> Peter.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io