[ceph-users] Re: bluefs _allocate unable to allocate

2021-10-12 Thread José H . Freidhof
Hello Igor

"Does single OSD startup (after if's experiencing "unable to allocate)
takes 20 mins as well?"
A: YES

Here the example log of the startup and recovery of a problematic osd.
https://paste.ubuntu.com/p/2WVJbg7cBy/

Here the example log of a problematic osd
https://paste.ubuntu.com/p/qbB6y7663f/

I found this post about a similar error and a bug in 16.2.4... we are
running 16.2.5...maybe the bug is not really fixed???
https://tracker.ceph.com/issues/50656
https://forum.proxmox.com/threads/ceph-16-2-pacific-cluster-crash.92367/



Am Mo., 11. Okt. 2021 um 11:53 Uhr schrieb Igor Fedotov <
igor.fedo...@croit.io>:

> hmm... so it looks like RocksDB still doesn't perform WAL cleanup during
> regular operation but applies it on OSD startup
>
> Does single OSD startup (after if's experiencing "unable to allocate)
> takes 20 mins as well?
>
> Could you please share OSD log containing both that long startup and
> following (e.g. 1+ hour) regular operation?
>
> Preferable for OSD.2  (or whatever one which has been using default
> settings from the deployment).
>
>
> Thanks,
>
> Igor
>
>
> On 10/9/2021 12:18 AM, José H. Freidhof wrote:
>
> Hi Igor,
>
> "And was osd.2 redeployed AFTER settings had been reset to defaults ?"
> A: YES
>
> "Anything particular about current cluster use cases?"
> A: we are using it temporary as a iscsi target for a vmware esxi cluster
> with 6 hosts. We created two 10tb iscsi images/luns for vmware, because the
> other datastore are at 90%.
> We plan in the future, after ceph is working right, and stable to install
> openstack and kvm and we want to convert all vms into rbd images.
> Like i told you is a three osd nodes cluster with 32 cores and 256gb ram
> and two 10g bond network cards on a 10g network
>
> "E.g. is it a sort of regular usage (with load flukes and peak) or may be
> some permanently running stress load testing. The latter might tend to hold
> the resources and e.g. prevent from internal house keeping...
> A: Its a SAN for vmware and there are running 43 VMs at the moment... at
> the daytime is more stress on the disks because the people are working and
> in the afternoon the iops goes down because the users are at home
> noting speculative...
>
> There is something else that i noticed... if i reboot one osd with
> 20osds then it takes 20min to come up... if i tail the logs of the osd i
> can see a lot of " recovery log mode 2" on all osd
> after the 20min the osd comes one after one up and the waldb are small and
> no error in the logs about bluefs _allocate unable to allocate...
>
> it seems that the problem is rocking up after a longer time (12h)
>
>
> Am Fr., 8. Okt. 2021 um 15:24 Uhr schrieb Igor Fedotov <
> igor.fedo...@croit.io>:
>
>> And was osd.2 redeployed AFTER settings had been reset to defaults ?
>>
>> Anything particular about current cluster use cases?
>>
>> E.g. is it a sort of regular usage (with load flukes and peak) or may be
>> some permanently running stress load testing. The latter might tend to hold
>> the resources and e.g. prevent from internal house keeping...
>>
>> Igor
>>
>>
>> On 10/8/2021 12:16 AM, José H. Freidhof wrote:
>>
>> Hi Igor,
>>
>> yes the same problem is on osd.2
>>
>> we have 3 OSD Nodes... Each Node has 20 Bluestore OSDs ... in total we
>> have 60 OSDs
>> i checked right now one node... and 15 of 20 OSDs have this problem and
>> error in the log.
>>
>> the settings that you have complained some emails ago .. i have reverted
>> them to default.
>>
>> ceph.conf file:
>>
>> [global]
>> fsid = 462c44b4-eed6-11eb-8b2c-a1ad45f88a97
>> mon_host = [v2:10.50.50.21:3300/0,v1:10.50.50.21:6789/0] [v2:
>> 10.50.50.22:3300/0,v1:10.50.50.22:6789/0] [v2:
>> 10.50.50.20:3300/0,v1:10.50.50.20:6789/0]
>> log file = /var/log/ceph/$cluster-$type-$id.log
>> max open files = 131072
>> mon compact on trim = False
>> osd deep scrub interval = 137438953472
>> osd max scrubs = 16
>> osd objectstore = bluestore
>> osd op threads = 2
>> osd scrub load threshold = 0.01
>> osd scrub max interval = 137438953472
>> osd scrub min interval = 137438953472
>> perf = True
>> rbd readahead disable after bytes = 0
>> rbd readahead max bytes = 4194304
>> throttler perf counter = False
>>
>> [client]
>> rbd cache = False
>>
>>
>> [mon]
>> mon health preluminous compat = True
>> mon osd down out interval = 300
>>
>> [osd]
>> bluestore cache autotune = 0
>> bluestore cache kv ratio = 0.2
>> bluestore cache meta ratio = 0.8
>> bluestore extent map shard max size = 200
>> bluestore extent map shard min size = 50
>> bluestore extent map shard target size = 100
>> bluestore rocksdb options =
>> compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_siz

[ceph-users] Where is my free space?

2021-10-12 Thread Szabo, Istvan (Agoda)
Hi,

377TiB is the total cluster size, data pool 4:2 ec, stored 66TiB, how can be 
the data pool on 60% used??!!


Some output:
ceph df detail
--- RAW STORAGE ---
CLASS  SIZE AVAILUSED RAW USED  %RAW USED
nvme12 TiB   11 TiB  128 MiB   1.2 TiB   9.81
ssd377 TiB  269 TiB  100 TiB   108 TiB  28.65
TOTAL  389 TiB  280 TiB  100 TiB   109 TiB  28.06

--- POOLS ---
POOLID  PGS  STORED   (DATA)   (OMAP)   OBJECTS  USED 
(DATA)   (OMAP)   %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY   USED 
COMPR  UNDER COMPR
device_health_metrics11   49 MiB  0 B   49 MiB   50   98 MiB
  0 B   98 MiB  0 73 TiB  N/AN/A  50 0 
B  0 B
.rgw.root2   32  1.1 MiB  1.1 MiB  4.5 KiB  159  3.9 MiB  
3.9 MiB   12 KiB  0 56 TiB  N/AN/A 159 
0 B  0 B
ash.rgw.log  6   32  1.8 GiB   46 KiB  1.8 GiB   73.83k  4.3 GiB  
4.4 MiB  4.3 GiB  0 59 TiB  N/AN/A  73.83k 
0 B  0 B
ash.rgw.control  7   32  2.9 KiB  0 B  2.9 KiB8  7.7 KiB
  0 B  7.7 KiB  0 56 TiB  N/AN/A   8 0 
B  0 B
ash.rgw.meta 88  554 KiB  531 KiB   23 KiB1.93k   22 MiB   
22 MiB   70 KiB  03.4 TiB  N/AN/A   1.93k 0 
B  0 B
ash.rgw.buckets.index   10  128  406 GiB  0 B  406 GiB   58.69k  1.2 TiB
  0 B  1.2 TiB  10.333.4 TiB  N/AN/A  58.69k 0 
B  0 B
ash.rgw.buckets.data11   32   66 TiB   66 TiB  0 B1.21G   86 TiB   
86 TiB  0 B  37.16111 TiB  N/AN/A   1.21G 0 
B  0 B
ash.rgw.buckets.non-ec  15   32  8.4 MiB653 B  8.4 MiB   22   23 MiB  
264 KiB   23 MiB  0 54 TiB  N/AN/A  22 
0 B  0 B




rados df
POOL_NAME  USED OBJECTS  CLONES  COPIES  
MISSING_ON_PRIMARY  UNFOUND   DEGRADED   RD_OPS   RD   WR_OPS   
WR  USED COMPR  UNDER COMPR
.rgw.root   3.9 MiB 159   0 477 
  00 60  8905420   20 GiB 8171   19 MiB 0 B 
 0 B
ash.rgw.buckets.data 86 TiB  1205539864   0  7233239184 
  00  904168110  36125678580  153 TiB  55724221429  174 TiB 0 B 
 0 B
ash.rgw.buckets.index   1.2 TiB   58688   0  176064 
  00  0  65848675184   62 TiB  10672532772  6.8 TiB 0 B 
 0 B
ash.rgw.buckets.non-ec   23 MiB  22   0  66 
  00  6  3999256  2.3 GiB  1369730  944 MiB 0 B 
 0 B
ash.rgw.control 7.7 KiB   8   0  24 
  00  30  0 B8  0 B 0 B 
 0 B
ash.rgw.log 4.3 GiB   73830   0  221490 
  00  39282  36922450608   34 TiB   5420884130  1.8 TiB 0 B 
 0 B
ash.rgw.meta 22 MiB1931   05793 
  00  0692302142  528 GiB  4274154  2.0 GiB 0 B 
 0 B
device_health_metrics98 MiB  50   0 150 
  00 5013588   40 MiB17758   46 MiB 0 B 
 0 B

total_objects1205674552
total_used   109 TiB
total_avail  280 TiB
total_space  389 TiB



4 osd down because migrating the db to block.

ceph osd tree
ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
-1 398.17001  root default
-11  61.12257  host server01
24   nvme1.74660  osd.24up   1.0  1.0
  0ssd   14.84399  osd.0   down   1.0  1.0
10ssd   14.84399  osd.10  down   1.0  1.0
14ssd   14.84399  osd.14  down   1.0  1.0
20ssd   14.84399  osd.20  down   1.0  1.0
-5  61.12257  host server02
25   nvme1.74660  osd.25up   1.0  1.0
  1ssd   14.84399  osd.1 up   1.0  1.0
  7ssd   14.84399  osd.7 up   1.0  1.0
13ssd   14.84399  osd.13up   1.0  1.0
19ssd   14.84399  osd.19up   1.0  1.0
-9  61.12257  host server03
26   nvme1.74660  osd.26up   1.0  1.0
  3ssd   14.84399  osd.3 up   1.0  1.0
  9ssd   14.84399  osd.9 up   1.0  1.0
16ssd   14.84399  osd.16up   1.0  1

[ceph-users] Re: bluefs _allocate unable to allocate

2021-10-12 Thread José H . Freidhof
Hi Igor,

Thx for checking the logs.. but what the hell is going on here? :-)
Yes its true i tested the and created the osd´s with three
different rockdb options.
I can not understand why the osd dont have the same rockdb option, because
i have created ALL OSDs new after set and test those settings.

Maybe i do something wrong with the re-deployment of the osds?
What i do:
ceph osd out osd.x
ceph osd down osd.x
systemctl stop ceph-osd@x
ceph osd rm osd.x
ceph osd crush rm osd.x
ceph auth del osd.x
ceph-volume lvm zap --destroy /dev/ceph-block-0/block-0 (lvm hdd partition)
ceph-volume lvm zap --destroy /dev/ceph-db-0/db-0 (lvm ssd partition)
ceph-volume lvm zap --destroy /dev/ceph-wal-0/wal-db-0 (lvm nvme  partition)
...

Later i recreate the osds with:
cephadm shell -m /var/lib/ceph
ceph auth export client.bootstrap-osd
vi /var/lib/ceph/bootstrap-osd/ceph.keyring
ceph-volume lvm prepare --no-systemd --bluestore --data
ceph-block-4/block-4 --block.wal ceph-wal-0/waldb-4 --block.db
ceph-db-0/db-4
cp -r /var/lib/ceph/osd /mnt/ceph/
Exit the shell in the container.
cephadm --image ceph/ceph:v16.2.5 adopt --style legacy --name osd.X
systemctl start ceph-462c44b4-eed6-11eb-8b2c-a1ad45f88...@osd.xx.service


Igor one question:
is there actually an easier way to recreate the osd? maybe over the
dashboard?
can you recommend something?

i have no problem to create the osd on the nodes again, but i need to be
sure that no old setting stays on the osd.



Am Di., 12. Okt. 2021 um 12:03 Uhr schrieb Igor Fedotov <
igor.fedo...@croit.io>:

> Hey Jose,
>
> your rocksdb settings are still different from the default ones.
>
> These are options you shared originally:
>
>
> compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB
>
> These are ones I could find in  osd.5 startup log, note e.g.
> max_write_buffer_number:
>
> Oct 12 09:09:30 cd88-ceph-osdh-01 bash[1572206]: debug
> 2021-10-12T07:09:30.686+ 7f16d24a0080  1
> bluestore(/var/lib/ceph/osd/ceph-5) _open_db opened rocksdb path db options
> compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
>
> And here are the ones I'd expect as defaults - again please note
> max_write_buffer_number:
>
>
> compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824
>
>
> And here is the source code for v16.2.5 where the expected default line
> comes from:
>
>
> https://github.com/ceph/ceph/blob/0883bdea7337b95e4b611c768c0279868462204a/src/common/options.cc#L4644
>
>
> Not that I'm absolutely sure this is the actual  root cause but I'd
> suggest to revert back to the baseline prior to proceeding with the
> troubleshooting...
>
> So please adjust properly and restart OSDs!!! Hopefully it wouldn't need a
> redeployment...
>
>
> As for https://tracker.ceph.com/issues/50656 - it's irrelevant to your
> case. It was unexpected ENOSPC result from an allocator which still had
> enough free space. But in your case bluefs allocator doesn't have free
> space at all as the latter is totally wasted by tons of WAL files.
>
>
> Thanks,
>
> Igor
>
>
>
> On 10/12/2021 10:51 AM, José H. Freidhof wrote:
>
> Hello Igor
>
> "Does single OSD startup (after if's experiencing "unable to allocate)
> takes 20 mins as well?"
> A: YES
>
> Here the example log of the startup and recovery of a problematic osd.
> https://paste.ubuntu.com/p/2WVJbg7cBy/
>
> Here the example log of a problematic osd
> https://paste.ubuntu.com/p/qbB6y7663f/
>
> I found this post about a similar error and a bug in 16.2.4... we are
> running 16.2.5...maybe the bug is not really fixed???
> https://tracker.ceph.com/issues/50656
> https://forum.proxmox.com/threads/ceph-16-2-pacific-cluster-crash.92367/
>
>
>
> Am Mo., 11. Okt. 2021 um 11:53 Uhr schrieb Igor Fedotov <
> igor.fedo...@croit.io>:
>
>> hmm... so it looks like RocksDB still doesn't perform WAL cleanup during
>> regular operation but applies it on OSD startup
>>
>> Does single OSD startup (after if's experiencing "unable to allocate)
>> takes 20 mins as well?
>>
>> Could you please s

[ceph-users] get_health_metrics reporting slow ops and gw outage

2021-10-12 Thread Szabo, Istvan (Agoda)
Hi,

Many of my osds having this issue which causes 10-15ms osd write operation 
latency and more than 60ms read operation latency.
This causes rgw wait for operations and after a while rgw just restarted (all 
of them in my cluster) and only available after slow ops disappeared.

I see similar issue but haven't really seen solution anywhere: 
https://tracker.ceph.com/issues/44184

I'm facing this issue in 2 of my cluster's from my 3 clusters multisite 
environment (octopus 15.2.14). Some background information, where I'm facing 
this issues, before I had many flapping osds even some unfound objects, not 
sure would that be related to this.

2021-10-12T09:59:45.542+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)
2021-10-12T09:59:46.583+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)
2021-10-12T09:59:47.581+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)
2021-10-12T09:59:48.551+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)
2021-10-12T09:59:49.592+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)

Haven't really fund anybody in the maillist also about this :/

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluefs _allocate unable to allocate

2021-10-12 Thread José H . Freidhof
Hi Igor

the reason why i tested differekt rocksdb options is that was having a
really bad write performance with the default settings (30-60mb/s) on the
cluster...

actually i have 200mb/s read and 180mb/s write performance

now i dont now which of the booth settings are the good ones

Another question:
which of the booth can you recommend?

https://gist.github.com/likid0/1b52631ff5d0d649a22a3f30106ccea7
bluestore rocksdb options =
compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB

https://yourcmc.ru/wiki/Ceph_performance
bluestore_rocksdb_options =
compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB



Am Di., 12. Okt. 2021 um 12:35 Uhr schrieb José H. Freidhof <
harald.freid...@googlemail.com>:

> Hi Igor,
>
> Thx for checking the logs.. but what the hell is going on here? :-)
> Yes its true i tested the and created the osd´s with three
> different rockdb options.
> I can not understand why the osd dont have the same rockdb option, because
> i have created ALL OSDs new after set and test those settings.
>
> Maybe i do something wrong with the re-deployment of the osds?
> What i do:
> ceph osd out osd.x
> ceph osd down osd.x
> systemctl stop ceph-osd@x
> ceph osd rm osd.x
> ceph osd crush rm osd.x
> ceph auth del osd.x
> ceph-volume lvm zap --destroy /dev/ceph-block-0/block-0 (lvm hdd partition)
> ceph-volume lvm zap --destroy /dev/ceph-db-0/db-0 (lvm ssd partition)
> ceph-volume lvm zap --destroy /dev/ceph-wal-0/wal-db-0 (lvm nvme
> partition)
> ...
>
> Later i recreate the osds with:
> cephadm shell -m /var/lib/ceph
> ceph auth export client.bootstrap-osd
> vi /var/lib/ceph/bootstrap-osd/ceph.keyring
> ceph-volume lvm prepare --no-systemd --bluestore --data
> ceph-block-4/block-4 --block.wal ceph-wal-0/waldb-4 --block.db
> ceph-db-0/db-4
> cp -r /var/lib/ceph/osd /mnt/ceph/
> Exit the shell in the container.
> cephadm --image ceph/ceph:v16.2.5 adopt --style legacy --name osd.X
> systemctl start ceph-462c44b4-eed6-11eb-8b2c-a1ad45f88...@osd.xx.service
>
>
> Igor one question:
> is there actually an easier way to recreate the osd? maybe over the
> dashboard?
> can you recommend something?
>
> i have no problem to create the osd on the nodes again, but i need to be
> sure that no old setting stays on the osd.
>
>
>
> Am Di., 12. Okt. 2021 um 12:03 Uhr schrieb Igor Fedotov <
> igor.fedo...@croit.io>:
>
>> Hey Jose,
>>
>> your rocksdb settings are still different from the default ones.
>>
>> These are options you shared originally:
>>
>>
>> compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB
>>
>> These are ones I could find in  osd.5 startup log, note e.g.
>> max_write_buffer_number:
>>
>> Oct 12 09:09:30 cd88-ceph-osdh-01 bash[1572206]: debug
>> 2021-10-12T07:09:30.686+ 7f16d24a0080  1
>> bluestore(/var/lib/ceph/osd/ceph-5) _open_db opened rocksdb path db options
>> compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
>>
>> And here are the ones I'd expect as defaults - again please note
>> max_write_buffer_number:
>>
>>
>> compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824
>>
>>
>> And here is the source code for v16.2.5 where the expected default line
>> comes from:
>>
>>
>> https://github.com/ceph/ceph

[ceph-users] Re: Where is my free space?

2021-10-12 Thread Stefan Kooman

On 10/12/21 07:21, Szabo, Istvan (Agoda) wrote:

Hi,

377TiB is the total cluster size, data pool 4:2 ec, stored 66TiB, how can be 
the data pool on 60% used??!!


Space amplification? It depends on, among others (like object size), the 
min_alloc size you use for the OSDs. See this thread [1], and this 
spreadsheet [2].


Gr. Stefan

[1]: 
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/NIVVTSR2YW22VELM4BW4S6NQUCS3T4XW/
[2]: 
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Igor Fedotov

Istvan,

you're bitten by https://github.com/ceph/ceph/pull/43140

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus 
minor release. Please do not use 'migrate' command from WAL/DB to slow 
volume if some data is already present there...


Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:

Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; Eugen Block ; 胡 玮文 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) ; 胡 玮文 

Cc: ceph-users@ceph.io; Eugen Block 

Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
 "backtrace": [
 "(()+0x12b20) [0x7f310aa49b20]",
 "(gsignal()+0x10f) [0x7f31096aa37f]",
 "(abort()+0x127) [0x7f3109694db5]",
 "(()+0x9009b) [0x7f310a06209b]",
 "(()+0x9653c) [0x7f310a06853c]",
 "(()+0x95559) [0x7f310a067559]",
 "(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
 "(()+0x10b03) [0x7f3109a48b03]",
 "(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
 "(__cxa_throw()+0x3b) [0x7f310a0687eb]",
 "(()+0x19fa4) [0x7f310b7b6fa4]",
 "(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
 "(()+0x10d0f8e) [0x55ffa520df8e]",
 "(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
 "(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
 "(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
 "(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
 "(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
 "(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
 "(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
 "(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
 "(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
 "(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, std::__cxx11::basic_string, 
std::allocator > const&, std::vector > const&, std::vector >*, rocksdb::DB**, bool)+0x1089) [0x55ffa51a57e9]",
 "(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector > 
const*)+0x14ca) [0x55ffa51285ca]",
 "(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
 "(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
 "(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
 "(OSD::init()+0x380) [0x55ffa4753a70]",
 "(main()+0x47f1) [0x55ffa46a6901]",
 "(__libc_start_main()+0xf3) [0x7f3109696493]",
 "(_start()+0x2e) [0x55ffa46d4e3e]"
 ],
 "ceph_version": "15.2.14",
 "crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
 "entity_name": "osd.48",
 "os_id": "centos",
 "os_name": "CentOS Linux",
 "os_version": "8",
 "os_version_id": "8",
 "process_name": "ceph-osd",
 "stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
 "timestamp": "2021-10-05T13:31:28.513463Z",
 "utsname_hostname": "server-2s07",
 "utsname_machine": "x86_64",
 "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
 "utsname_sysname": "Linux

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Igor Fedotov
You mean you run migrate for these 72 OSDs and all of them aren't 
starting any more? Or you just upgraded them to Octopus and experiencing 
performance issues.


In the latter case and if you have enough space at DB device you might 
want to try to migrate data from slow to db first. Run fsck (just in 
case) and then migrate from DB/WAl back to slow.


Theoretically this should help in avoiding the before-mentioned bug. 
But  I haven't try that personally...


And this wouldn't fix the corrupted OSDs if any though...


Thanks,

Igor

On 10/12/2021 2:36 PM, Szabo, Istvan (Agoda) wrote:

Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).
What should I do then? 12 left (altogether 36). In my case slow device 
is faster in random write iops than the one which is serving it.


Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com 
---


On 2021. Oct 12., at 13:21, Igor Fedotov  wrote:

Email received from the internet. If in doubt, don't click any link 
nor open any attachment !



Istvan,

you're bitten by

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus
minor release. Please do not use 'migrate' command from WAL/DB to slow
volume if some data is already present there...

Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:

Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; Eugen Block ; 胡 玮文 

Subject: Re: [ceph-users] Re: is it possible to remove the db+wal 
from an external device (nvme)


Email received from the internet. If in doubt, don't click any link 
nor open any attachment !



No,

that's just backtrace of the crash - I'd like to see the full OSD 
log from the process startup till the crash instead...

On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) 
; 胡 玮文 

Cc: ceph-users@ceph.io; Eugen Block 

Subject: Re: [ceph-users] Re: is it possible to remove the db+wal 
from an external device (nvme)


Email received from the internet. If in doubt, don't click any link 
nor open any attachment !



Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still 
coredumped ☹
Is there any special thing that we need to do before we migrate db 
next to the block? Our osds are using dmcrypt, is it an issue?


{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned 
long)+0x146) [0x7f310b7d8c96]",

"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) 
[0x55ffa52f0568]",

"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) 
[0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, 
std::allocator > const&, 
std::vectorstd::allocator > const&, 
st

[ceph-users] ceph full-object read crc != expected on xxx:head

2021-10-12 Thread Frank Schilder
Is there a way (mimic latest) to find out which PG contains the object that 
caused this error:

2021-10-11 23:46:19.631006 osd.335 osd.335 192.168.32.87:6838/8605 623 : 
cluster [ERR]  full-object read crc 0x6c3a7719 != expected 0xd27f7a2c on 
19:28b9843f:::3b43237.:head

In all references I could find the error message contains the PG. The above 
doesn't. There is no additional information in the OSD log of 335.

The above read error did not create a health warn/error state. Is this error 
automatically fixed?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Where is my free space?

2021-10-12 Thread Szabo, Istvan (Agoda)
I see, I'm using ssds so it shouldn't be a problem I guess, because the : 
"bluestore_min_alloc_size": "0" is overwritten with the:
"bluestore_min_alloc_size_ssd": "4096"  ?

-Original Message-
From: Stefan Kooman  
Sent: Tuesday, October 12, 2021 2:19 PM
To: Szabo, Istvan (Agoda) ; ceph-users@ceph.io
Subject: Re: [ceph-users] Where is my free space?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On 10/12/21 07:21, Szabo, Istvan (Agoda) wrote:
> Hi,
>
> 377TiB is the total cluster size, data pool 4:2 ec, stored 66TiB, how can be 
> the data pool on 60% used??!!

Space amplification? It depends on, among others (like object size), the 
min_alloc size you use for the OSDs. See this thread [1], and this spreadsheet 
[2].

Gr. Stefan

[1]:
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/NIVVTSR2YW22VELM4BW4S6NQUCS3T4XW/
[2]:
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Metrics for object sizes

2021-10-12 Thread Szabo, Istvan (Agoda)
Hi,

Just got the chance to have a look, but I see lua scripting is new in version 
pacific ☹
I have octopus 15.2.14, will it be backported or no chance?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Yuval Lifshitz 
Sent: Tuesday, September 14, 2021 7:38 PM
To: Szabo, Istvan (Agoda) 
Cc: Wido den Hollander ; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Metrics for object sizes

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Hi Istvan,
Hope this is still relevant... but you may want to have a look at this example:

https://github.com/ceph/ceph/blob/master/examples/lua/prometheus_adapter.lua
https://github.com/ceph/ceph/blob/master/examples/lua/prometheus_adapter.md

where we log RGW object sizes to Prometheus.
would be easy to change it so it is per bucket and not per operation type.

Yuval

On Fri, Apr 23, 2021 at 4:53 AM Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:
Objects inside RGW buckets like in couch base software they have their own 
metrics and has this information.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com>
---

On 2021. Apr 22., at 14:00, Wido den Hollander 
mailto:w...@42on.com>> wrote:



On 21/04/2021 11:46, Szabo, Istvan (Agoda) wrote:
Hi,
Is there any clusterwise metric regarding object sizes?
I'd like to collect some information about the users what is the object sizes 
in their buckets.

Are you talking about RADOS objects or objects inside RGW buckets?

I think you are talking about RGW, but I just wanted to check.

Afaik this information is not available for both RADOS and RGW.

Do keep in mind that small objects are much more expensive then large objects. 
The metadata overhead becomes costly and can even become problematic if you 
have millions of tiny (few kb) objects.

Wido


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Igor Fedotov

Istvan,

So things with migrations are clear at the moment, right? As I mentioned 
the migrate command in 15.2.14 has a bug which causes corrupted OSD if 
db->slow migration occurs on spilled over OSD. To work around that you 
might want to migrate slow to db first or try manual compaction. Please 
make sure there is no spilled over data left after any of them via 
bluestore-tool's bluestore-bdev-sizes command before proceeding with 
db->slow migrate...


just a side note - IMO it sounds a bit controversial that you're 
expecting/experiencing better performance without standalone DB and at 
the same time spillovers cause performance issues... Spillover means 
some data goes to main device (which you're trying to achieve by 
migrating as well) hence it would rather improve things... Or the root 
cause of your performace issues is different... Just want to share my 
thoughts - I don't have any better ideas about that so far...



Thanks,

Igor

On 10/12/2021 2:54 PM, Szabo, Istvan (Agoda) wrote:


I’m having 1 billions of objects in the cluster and we are still 
increasing and faced spillovers allover the clusters.


After 15-18 spilledover osds (out of the 42-50) the osds started to 
die, flapping.


Tried to compact manually the spilleovered ones, but didn’t help, 
however the not spilled osds less frequently crashed.


In our design 3 ssd was used 1 nvme for db+wal, but this nvme has 30k 
iops on random write, however the ssds behind this nvme have 
individually 67k so actually the SSDs are faster in write than the 
nvme which means our config suboptimal.


I’ve decided to update the cluster to 15.2.14 to be able to run this 
ceph-volume lvm migrate command and started to use it.


10-20% is the failed migration at the moment, 80-90% is successful.

I want to avoid this spillover in the future so I’ll use bare SSDs as 
osds without wal+db. At the moment my iowait decreased  a lot without 
nvme drives, I just hope didn’t do anything wrong with this migration 
right?


The failed ones I’m removing from the cluster and add it back after 
cleaned up.


Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com 
---

*From:* Igor Fedotov 
*Sent:* Tuesday, October 12, 2021 6:45 PM
*To:* Szabo, Istvan (Agoda) 
*Cc:* ceph-users@ceph.io; 胡 玮文 
*Subject:* Re: [ceph-users] Re: is it possible to remove the db+wal 
from an external device (nvme)


Email received from the internet. If in doubt, don't click any link 
nor open any attachment !




You mean you run migrate for these 72 OSDs and all of them aren't 
starting any more? Or you just upgraded them to Octopus and 
experiencing performance issues.


In the latter case and if you have enough space at DB device you might 
want to try to migrate data from slow to db first. Run fsck (just in 
case) and then migrate from DB/WAl back to slow.


Theoretically this should help in avoiding the before-mentioned bug. 
But  I haven't try that personally...


And this wouldn't fix the corrupted OSDs if any though...

Thanks,

Igor

On 10/12/2021 2:36 PM, Szabo, Istvan (Agoda) wrote:

Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).

What should I do then? 12 left (altogether 36). In my case slow
device is faster in random write iops than the one which is
serving it.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com 
---



On 2021. Oct 12., at 13:21, Igor Fedotov
  wrote:

Email received from the internet. If in doubt, don't click
any link nor open any attachment !


Istvan,

you're bitten by

It's not fixed in 15.2.14. This has got a backport to upcoming
Octopus
minor release. Please do not use 'migrate' command from WAL/DB
to slow
volume if some data is already present there...

Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:

Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo

Senior Infrastructure Engineer

---

Agoda Services Co., Ltd.

e: istvan.sz...@agoda.com



---

From: Igor Fedotov 


Sent: Monday, October 11, 2021 10:40 

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Szabo, Istvan (Agoda)
Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; Eugen Block ; 胡 玮文 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) 
; 胡 玮文 

Cc: ceph-users@ceph.io; Eugen Block 

Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
"ceph_version": "15.2.14",
"crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
"entity_name": "osd.48",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
"timestamp": "2021-10-05T13:31:28.513463Z",
"utsname_hostname": "server-2s07",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
}
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: 胡 玮文 
Sent: Monday, October 4, 2021 12:13 AM

[ceph-users] Announcing go-ceph v0.12.0

2021-10-12 Thread John Mulligan
I'm happy to announce another release of the go-ceph API 
library. This is a regular release following our every-two-months release 
cadence.

https://github.com/ceph/go-ceph/releases/tag/v0.12.0

Changes include additions to the rbd, rbd admin, and rgw admin 
packages. More details are available at the link above.

The library includes bindings that aim to play a similar role to the "pybind" 
python bindings in the ceph tree but for the Go language. The library also 
includes additional APIs that can be used to administer cephfs, rbd, and rgw 
subsystems.
There are already a few consumers of this library in the wild, including the 
ceph-csi project.


-- 
John Mulligan

phlogistonj...@asynchrono.us
jmulli...@redhat.com




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph cluster Sync

2021-10-12 Thread Michel Niyoyita
Dear team

I want to build two different cluster: one for primary site and the second
for DR site. I would like to ask if these two cluster can
communicate(synchronized) each other and data written to the PR site be
synchronized to the DR site ,  if once we got trouble for the PR site the
DR automatically takeover.

Please help me for the solution or advise me how to proceed

Best Regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Szabo, Istvan (Agoda)
Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).
What should I do then? 12 left (altogether 36). In my case slow device is 
faster in random write iops than the one which is serving it.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 12., at 13:21, Igor Fedotov  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Istvan,

you're bitten by https://github.com/ceph/ceph/pull/43140

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus
minor release. Please do not use 'migrate' command from WAL/DB to slow
volume if some data is already present there...

Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; Eugen Block ; 胡 玮文 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) 
; 胡 玮文 

Cc: ceph-users@ceph.io; Eugen Block 

Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
   

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Szabo, Istvan (Agoda)
I’m having 1 billions of objects in the cluster and we are still increasing and 
faced spillovers allover the clusters.
After 15-18 spilledover osds (out of the 42-50) the osds started to die, 
flapping.
Tried to compact manually the spilleovered ones, but didn’t help, however the 
not spilled osds less frequently crashed.
In our design 3 ssd was used 1 nvme for db+wal, but this nvme has 30k iops on 
random write, however the ssds behind this nvme have individually 67k so 
actually the SSDs are faster in write than the nvme which means our config 
suboptimal.

I’ve decided to update the cluster to 15.2.14 to be able to run this 
ceph-volume lvm migrate command and started to use it.

10-20% is the failed migration at the moment, 80-90% is successful.
I want to avoid this spillover in the future so I’ll use bare SSDs as osds 
without wal+db. At the moment my iowait decreased  a lot without nvme drives, I 
just hope didn’t do anything wrong with this migration right?

The failed ones I’m removing from the cluster and add it back after cleaned up.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Tuesday, October 12, 2021 6:45 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; 胡 玮文 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


You mean you run migrate for these 72 OSDs and all of them aren't starting any 
more? Or you just upgraded them to Octopus and experiencing performance issues.

In the latter case and if you have enough space at DB device you might want to 
try to migrate data from slow to db first. Run fsck (just in case) and then 
migrate from DB/WAl back to slow.
Theoretically this should help in avoiding the before-mentioned bug. But  I 
haven't try that personally...

And this wouldn't fix the corrupted OSDs if any though...



Thanks,

Igor
On 10/12/2021 2:36 PM, Szabo, Istvan (Agoda) wrote:
Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).
What should I do then? 12 left (altogether 36). In my case slow device is 
faster in random write iops than the one which is serving it.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---


On 2021. Oct 12., at 13:21, Igor Fedotov 
 wrote:
Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Istvan,

you're bitten by

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus
minor release. Please do not use 'migrate' command from WAL/DB to slow
volume if some data is already present there...

Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:

Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 

Cc: ceph-users@ceph.io; Eugen Block 
; 胡 玮文 

Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com
---

From: Igor Fedotov 

Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) 
;
 胡 玮文 

Cc: 
ceph-users@ceph.io

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Szabo, Istvan (Agoda)
One more thing, what I’m doing at the moment:

Noout norebalance on 1 host
Stop all osd
Compact all the osds
Migrate the db 1 by 1
Start the osds 1 by 1

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Szabo, Istvan (Agoda)
Sent: Tuesday, October 12, 2021 6:54 PM
To: Igor Fedotov 
Cc: ceph-users@ceph.io; 胡 玮文 
Subject: RE: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

I’m having 1 billions of objects in the cluster and we are still increasing and 
faced spillovers allover the clusters.
After 15-18 spilledover osds (out of the 42-50) the osds started to die, 
flapping.
Tried to compact manually the spilleovered ones, but didn’t help, however the 
not spilled osds less frequently crashed.
In our design 3 ssd was used 1 nvme for db+wal, but this nvme has 30k iops on 
random write, however the ssds behind this nvme have individually 67k so 
actually the SSDs are faster in write than the nvme which means our config 
suboptimal.

I’ve decided to update the cluster to 15.2.14 to be able to run this 
ceph-volume lvm migrate command and started to use it.

10-20% is the failed migration at the moment, 80-90% is successful.
I want to avoid this spillover in the future so I’ll use bare SSDs as osds 
without wal+db. At the moment my iowait decreased  a lot without nvme drives, I 
just hope didn’t do anything wrong with this migration right?

The failed ones I’m removing from the cluster and add it back after cleaned up.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov mailto:igor.fedo...@croit.io>>
Sent: Tuesday, October 12, 2021 6:45 PM
To: Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>
Cc: ceph-users@ceph.io; 胡 玮文 
mailto:huw...@outlook.com>>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


You mean you run migrate for these 72 OSDs and all of them aren't starting any 
more? Or you just upgraded them to Octopus and experiencing performance issues.

In the latter case and if you have enough space at DB device you might want to 
try to migrate data from slow to db first. Run fsck (just in case) and then 
migrate from DB/WAl back to slow.
Theoretically this should help in avoiding the before-mentioned bug. But  I 
haven't try that personally...

And this wouldn't fix the corrupted OSDs if any though...



Thanks,

Igor
On 10/12/2021 2:36 PM, Szabo, Istvan (Agoda) wrote:
Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).
What should I do then? 12 left (altogether 36). In my case slow device is 
faster in random write iops than the one which is serving it.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 12., at 13:21, Igor Fedotov 
 wrote:
Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Istvan,

you're bitten by

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus
minor release. Please do not use 'migrate' command from WAL/DB to slow
volume if some data is already present there...

Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 

Cc: ceph-users@ceph.io; Eugen Block 
; 胡 玮文 

Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking 

[ceph-users] Re: Ceph cluster Sync

2021-10-12 Thread DHilsbos
Michel;

I am neither a Ceph evangelist, nor a Ceph expert, but here is my current 
understanding:
Ceph clusters do not have in-built cross cluster synchronization.  That said, 
there are several things which might meet your needs.

1) If you're just planning your Ceph deployment, then the latest release 
(Pacific) introduced the concept of a stretch cluster, essentially a cluster 
which is stretched across datacenters (i.e. a relatively low-bandwidth, 
high-latency link)[1].

2) RADOSGW allows for uni-directional as well as bi-directional synchronization 
of the data that it handles.[2]

3) RBD provides mirroring functionality for the data it handles.[3]

Thank you,

Dominic L. Hilsbos, MBA
Vice President - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com

[1] https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
[2] https://docs.ceph.com/en/latest/radosgw/sync-modules/
[3] https://docs.ceph.com/en/latest/rbd/rbd-mirroring/


-Original Message-
From: Michel Niyoyita [mailto:mico...@gmail.com] 
Sent: Tuesday, October 12, 2021 8:35 AM
To: ceph-users
Subject: [ceph-users] Ceph cluster Sync

Dear team

I want to build two different cluster: one for primary site and the second
for DR site. I would like to ask if these two cluster can
communicate(synchronized) each other and data written to the PR site be
synchronized to the DR site ,  if once we got trouble for the PR site the
DR automatically takeover.

Please help me for the solution or advise me how to proceed

Best Regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD Crashes in 16.2.6

2021-10-12 Thread Marco Pizzolo
Hello everyone,

We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
w/Podman.

We have OSDs that fail after <24 hours and I'm not sure why.

Seeing this:

ceph crash info
2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
{
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
"pthread_cond_wait()",

"(std::condition_variable::wait(std::unique_lock&)+0x10)
[0x7f4d306de8f0]",
"(Throttle::_wait(long, std::unique_lock&)+0x10d)
[0x55c52a0f077d]",
"(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
"(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
BlueStore::TransContext&, std::chrono::time_point > >)+0x29)
[0x55c529f362c9]",

"(BlueStore::queue_transactions(boost::intrusive_ptr&,
std::vector
>&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)
[0x55c529fb7664]",
"(non-virtual thunk to
PrimaryLogPG::queue_transactions(std::vector >&,
boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
"(ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&, std::unique_ptr >&&, eversion_t const&, eversion_t
const&, std::vector >&&,
std::optional&, Context*, unsigned long, osd_reqid_t,
boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
"(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
"(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
[0x55c529bd65ed]",
"(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
[0x55c529bdf162]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
"(OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
[0x55c529a6f1b9]",
"(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68) [0x55c529ccc868]",
"(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x55c52a0fa6c4]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
[0x55c52a0fd364]",
"/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
"clone()"
],
"ceph_version": "16.2.6",
"crash_id":
"2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
"entity_name": "osd.14",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig":
"46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
"timestamp": "2021-10-12T14:32:49.169552Z",
"utsname_hostname": "",
"utsname_machine": "x86_64",
"utsname_release": "5.11.0-37-generic",
"utsname_sysname": "Linux",
"utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021"

dmesg on host shows:

[66258.080040] BUG: kernel NULL pointer dereference, address:
00c0
[66258.080067] #PF: supervisor read access in kernel mode
[66258.080081] #PF: error_code(0x) - not-present page
[66258.080093] PGD 0 P4D 0
[66258.080105] Oops:  [#1] SMP NOPTI
[66258.080115] CPU: 35 PID: 4955 Comm: zabbix_agentd Not tainted
5.11.0-37-generic #41~20.04.2-Ubuntu
[66258.080137] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
3.3 02/21/2020
[66258.080154] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
[66258.080171] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b
41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
<48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
[66258.080210] RSP: 0018:a251249fbbc0 EFLAGS: 00010283
[66258.080224] RAX:  RBX: a251249fbc48 RCX:
0002
[66258.080240] RDX: 0001 RSI: 0202 RDI:
9382f4d8e000
[66258.080256] RBP: a251249fbbf8 R08:  R09:
0035
[66258.080272] R10: abcc77118461cefd R11: 93b382257076 R12:
9382f4d8e000
[66258.080288] R13: 9382f5670c00 R14:  R15:
0001
[66258.080304] FS:  7fedc0ef46c0() GS:93e13f4c()
knlGS:
[66258.080322] CS:  0010 DS:  ES:  CR0: 80050033
[66258.080335] CR2: 00c0 CR3: 00011cc44005 CR4:
007706e0
[66258.080351] DR0:  DR1:  DR2:

[66258.080367] DR3:  DR6: fffe0ff0 DR7:
0400
[66258.080383] PKRU: 5554
[66258.080391] Call Trace:
[66258.080401]  ? bt_iter+0x54/0x90
[66258.080413]  blk_mq_queue_tag_busy_iter+0x18b/0x2d0
[66258.080427]  ? blk_mq_hctx_mark_pending+0x70/0x70
[66258.080440]  ? blk_mq_hctx_mark_pending+0x70/0x70
[66258.080452]  blk_mq_in_flight+0x38/0x60
[66258.080463]  diskstats_show+0x75/0x2b0
[66258.080475]  traverse+0x78/0x200
[66258.080485]  seq_lseek+0x61/0xd0
[66258.080495]  proc_reg_llseek+0x77/0xa0
[66258.08

[ceph-users] Re: OSD Crashes in 16.2.6

2021-10-12 Thread Zakhar Kirpichenko
Hi,

This could be kernel-related, as I've seen similar reports in Proxmox
forum. Specifically, 5.11.x with Ceph seems to be hitting kernel NULL
pointer dereference. Perhaps a newer kernel would help. If not, I'm running
16.2.6 with kernel 5.4.x without any issues.

Best regards,
Z

On Tue, Oct 12, 2021 at 8:31 PM Marco Pizzolo 
wrote:

> Hello everyone,
>
> We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
> w/Podman.
>
> We have OSDs that fail after <24 hours and I'm not sure why.
>
> Seeing this:
>
> ceph crash info
> 2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
> {
> "backtrace": [
> "/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
> "pthread_cond_wait()",
>
> "(std::condition_variable::wait(std::unique_lock&)+0x10)
> [0x7f4d306de8f0]",
> "(Throttle::_wait(long, std::unique_lock&)+0x10d)
> [0x55c52a0f077d]",
> "(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
> "(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
> BlueStore::TransContext&, std::chrono::time_point std::chrono::duration > >)+0x29)
> [0x55c529f362c9]",
>
>
> "(BlueStore::queue_transactions(boost::intrusive_ptr&,
> std::vector
> >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)
> [0x55c529fb7664]",
> "(non-virtual thunk to
> PrimaryLogPG::queue_transactions(std::vector std::allocator >&,
> boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
> "(ReplicatedBackend::submit_transaction(hobject_t const&,
> object_stat_sum_t const&, eversion_t const&, std::unique_ptr std::default_delete >&&, eversion_t const&, eversion_t
> const&, std::vector >&&,
> std::optional&, Context*, unsigned long, osd_reqid_t,
> boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
> "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
> "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
> [0x55c529bd65ed]",
> "(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
> [0x55c529bdf162]",
> "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
> "(OSD::dequeue_op(boost::intrusive_ptr,
> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
> [0x55c529a6f1b9]",
> "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68) [0x55c529ccc868]",
> "(OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
> "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x55c52a0fa6c4]",
> "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
> [0x55c52a0fd364]",
> "/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
> "clone()"
> ],
> "ceph_version": "16.2.6",
> "crash_id":
> "2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
> "entity_name": "osd.14",
> "os_id": "centos",
> "os_name": "CentOS Linux",
> "os_version": "8",
> "os_version_id": "8",
> "process_name": "ceph-osd",
> "stack_sig":
> "46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
> "timestamp": "2021-10-12T14:32:49.169552Z",
> "utsname_hostname": "",
> "utsname_machine": "x86_64",
> "utsname_release": "5.11.0-37-generic",
> "utsname_sysname": "Linux",
> "utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC
> 2021"
>
> dmesg on host shows:
>
> [66258.080040] BUG: kernel NULL pointer dereference, address:
> 00c0
> [66258.080067] #PF: supervisor read access in kernel mode
> [66258.080081] #PF: error_code(0x) - not-present page
> [66258.080093] PGD 0 P4D 0
> [66258.080105] Oops:  [#1] SMP NOPTI
> [66258.080115] CPU: 35 PID: 4955 Comm: zabbix_agentd Not tainted
> 5.11.0-37-generic #41~20.04.2-Ubuntu
> [66258.080137] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
> 3.3 02/21/2020
> [66258.080154] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
> [66258.080171] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b
> 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
> <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
> [66258.080210] RSP: 0018:a251249fbbc0 EFLAGS: 00010283
> [66258.080224] RAX:  RBX: a251249fbc48 RCX:
> 0002
> [66258.080240] RDX: 0001 RSI: 0202 RDI:
> 9382f4d8e000
> [66258.080256] RBP: a251249fbbf8 R08:  R09:
> 0035
> [66258.080272] R10: abcc77118461cefd R11: 93b382257076 R12:
> 9382f4d8e000
> [66258.080288] R13: 9382f5670c00 R14:  R15:
> 0001
> [66258.080304] FS:  7fedc0ef46c0() GS:93e13f4c()
> knlGS:
> [66258.080322] CS:  0010 DS:  ES:  CR0: 80050033
> [66258.080335] CR2: 00c0 CR3: 00011cc44005 CR4:
> 007706e0
> [66258.08035

[ceph-users] Re: OSD Crashes in 16.2.6

2021-10-12 Thread Marco Pizzolo
Hi Zakhar,

Thanks for the quick response.  I was coming across some of those Proxmox
forum posts as well.  I'm not sure if going to the 5.4 kernel will create
any other challenges for us, as we're using dual port mellanox connectx-6
200G nics in the hosts, but it is definitely something we can try.

Marco

On Tue, Oct 12, 2021 at 1:53 PM Zakhar Kirpichenko  wrote:

> Hi,
>
> This could be kernel-related, as I've seen similar reports in Proxmox
> forum. Specifically, 5.11.x with Ceph seems to be hitting kernel NULL
> pointer dereference. Perhaps a newer kernel would help. If not, I'm running
> 16.2.6 with kernel 5.4.x without any issues.
>
> Best regards,
> Z
>
> On Tue, Oct 12, 2021 at 8:31 PM Marco Pizzolo 
> wrote:
>
>> Hello everyone,
>>
>> We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
>> w/Podman.
>>
>> We have OSDs that fail after <24 hours and I'm not sure why.
>>
>> Seeing this:
>>
>> ceph crash info
>> 2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
>> {
>> "backtrace": [
>> "/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
>> "pthread_cond_wait()",
>>
>> "(std::condition_variable::wait(std::unique_lock&)+0x10)
>> [0x7f4d306de8f0]",
>> "(Throttle::_wait(long, std::unique_lock&)+0x10d)
>> [0x55c52a0f077d]",
>> "(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
>> "(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
>> BlueStore::TransContext&, std::chrono::time_point> std::chrono::duration >
>> >)+0x29)
>> [0x55c529f362c9]",
>>
>>
>> "(BlueStore::queue_transactions(boost::intrusive_ptr&,
>> std::vector
>> >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)
>> [0x55c529fb7664]",
>> "(non-virtual thunk to
>> PrimaryLogPG::queue_transactions(std::vector> std::allocator >&,
>> boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
>> "(ReplicatedBackend::submit_transaction(hobject_t const&,
>> object_stat_sum_t const&, eversion_t const&,
>> std::unique_ptr> std::default_delete >&&, eversion_t const&, eversion_t
>> const&, std::vector >&&,
>> std::optional&, Context*, unsigned long,
>> osd_reqid_t,
>> boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
>> "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
>> PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
>> "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
>> [0x55c529bd65ed]",
>> "(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
>> [0x55c529bdf162]",
>> "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
>> ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
>> "(OSD::dequeue_op(boost::intrusive_ptr,
>> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
>> [0x55c529a6f1b9]",
>> "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
>> boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68) [0x55c529ccc868]",
>> "(OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
>> "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
>> [0x55c52a0fa6c4]",
>> "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
>> [0x55c52a0fd364]",
>> "/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
>> "clone()"
>> ],
>> "ceph_version": "16.2.6",
>> "crash_id":
>> "2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
>> "entity_name": "osd.14",
>> "os_id": "centos",
>> "os_name": "CentOS Linux",
>> "os_version": "8",
>> "os_version_id": "8",
>> "process_name": "ceph-osd",
>> "stack_sig":
>> "46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
>> "timestamp": "2021-10-12T14:32:49.169552Z",
>> "utsname_hostname": "",
>> "utsname_machine": "x86_64",
>> "utsname_release": "5.11.0-37-generic",
>> "utsname_sysname": "Linux",
>> "utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC
>> 2021"
>>
>> dmesg on host shows:
>>
>> [66258.080040] BUG: kernel NULL pointer dereference, address:
>> 00c0
>> [66258.080067] #PF: supervisor read access in kernel mode
>> [66258.080081] #PF: error_code(0x) - not-present page
>> [66258.080093] PGD 0 P4D 0
>> [66258.080105] Oops:  [#1] SMP NOPTI
>> [66258.080115] CPU: 35 PID: 4955 Comm: zabbix_agentd Not tainted
>> 5.11.0-37-generic #41~20.04.2-Ubuntu
>> [66258.080137] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
>> 3.3 02/21/2020
>> [66258.080154] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
>> [66258.080171] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b
>> 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
>> <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
>> [66258.080210] RSP: 0018:a251249fbbc0 EFLAGS: 00010283
>> [66258.080224] RAX:  RBX: a251249fbc48 RCX:
>> 0002
>> [66258.080240] RDX: 0001 RSI: 0202 RDI:
>> 9382f4d8e000
>> [66258.0802

[ceph-users] Re: OSD Crashes in 16.2.6

2021-10-12 Thread Igor Fedotov

Hi Marco,

this reminds me the following ticket: https://tracker.ceph.com/issues/52234


Unfortunately that's all we have so far about that issue. Could you 
please answer some questions:


1) Is this a new or upgraded cluster?

2) If you upgraded it - what was the previous Ceph version  and did you 
see the bug before?


3) How often does it fail this way? OSDs are able to recover afterwards 
I presume, aren't they?


4) Would you share performance counters dump for some of your OSDs after 
they have been working for a while?



Thanks in advance,

Igor


On 10/12/2021 8:30 PM, Marco Pizzolo wrote:

Hello everyone,

We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
w/Podman.

We have OSDs that fail after <24 hours and I'm not sure why.

Seeing this:

ceph crash info
2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
{
 "backtrace": [
 "/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
 "pthread_cond_wait()",

"(std::condition_variable::wait(std::unique_lock&)+0x10)
[0x7f4d306de8f0]",
 "(Throttle::_wait(long, std::unique_lock&)+0x10d)
[0x55c52a0f077d]",
 "(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
 "(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
BlueStore::TransContext&, std::chrono::time_point > >)+0x29)
[0x55c529f362c9]",

"(BlueStore::queue_transactions(boost::intrusive_ptr&,
std::vector

&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)

[0x55c529fb7664]",
 "(non-virtual thunk to
PrimaryLogPG::queue_transactions(std::vector >&,
boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
 "(ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&, std::unique_ptr >&&, eversion_t const&, eversion_t
const&, std::vector >&&,
std::optional&, Context*, unsigned long, osd_reqid_t,
boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
 "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
 "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
[0x55c529bd65ed]",
 "(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
[0x55c529bdf162]",
 "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
 "(OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
[0x55c529a6f1b9]",
 "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68) [0x55c529ccc868]",
 "(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
 "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x55c52a0fa6c4]",
 "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
[0x55c52a0fd364]",
 "/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
 "clone()"
 ],
 "ceph_version": "16.2.6",
 "crash_id":
"2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
 "entity_name": "osd.14",
 "os_id": "centos",
 "os_name": "CentOS Linux",
 "os_version": "8",
 "os_version_id": "8",
 "process_name": "ceph-osd",
 "stack_sig":
"46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
 "timestamp": "2021-10-12T14:32:49.169552Z",
 "utsname_hostname": "",
 "utsname_machine": "x86_64",
 "utsname_release": "5.11.0-37-generic",
 "utsname_sysname": "Linux",
 "utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021"

dmesg on host shows:

[66258.080040] BUG: kernel NULL pointer dereference, address:
00c0
[66258.080067] #PF: supervisor read access in kernel mode
[66258.080081] #PF: error_code(0x) - not-present page
[66258.080093] PGD 0 P4D 0
[66258.080105] Oops:  [#1] SMP NOPTI
[66258.080115] CPU: 35 PID: 4955 Comm: zabbix_agentd Not tainted
5.11.0-37-generic #41~20.04.2-Ubuntu
[66258.080137] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
3.3 02/21/2020
[66258.080154] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
[66258.080171] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b
41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
<48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
[66258.080210] RSP: 0018:a251249fbbc0 EFLAGS: 00010283
[66258.080224] RAX:  RBX: a251249fbc48 RCX:
0002
[66258.080240] RDX: 0001 RSI: 0202 RDI:
9382f4d8e000
[66258.080256] RBP: a251249fbbf8 R08:  R09:
0035
[66258.080272] R10: abcc77118461cefd R11: 93b382257076 R12:
9382f4d8e000
[66258.080288] R13: 9382f5670c00 R14:  R15:
0001
[66258.080304] FS:  7fedc0ef46c0() GS:93e13f4c()
knlGS:
[66258.080322] CS:  0010 DS:  ES:  CR0: 80050033
[66258.080335] CR2: 00c0 CR3: 00011cc44005 C

[ceph-users] Re: OSD Crashes in 16.2.6

2021-10-12 Thread Zakhar Kirpichenko
Can't say much about kernel 5.4 and connectx-6, as we have no experience
with this combination. 5.4 + connectx-5 works well though :-)

/ Z

On Tue, Oct 12, 2021 at 9:06 PM Marco Pizzolo 
wrote:

> Hi Zakhar,
>
> Thanks for the quick response.  I was coming across some of those Proxmox
> forum posts as well.  I'm not sure if going to the 5.4 kernel will create
> any other challenges for us, as we're using dual port mellanox connectx-6
> 200G nics in the hosts, but it is definitely something we can try.
>
> Marco
>
> On Tue, Oct 12, 2021 at 1:53 PM Zakhar Kirpichenko 
> wrote:
>
>> Hi,
>>
>> This could be kernel-related, as I've seen similar reports in Proxmox
>> forum. Specifically, 5.11.x with Ceph seems to be hitting kernel NULL
>> pointer dereference. Perhaps a newer kernel would help. If not, I'm running
>> 16.2.6 with kernel 5.4.x without any issues.
>>
>> Best regards,
>> Z
>>
>> On Tue, Oct 12, 2021 at 8:31 PM Marco Pizzolo 
>> wrote:
>>
>>> Hello everyone,
>>>
>>> We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
>>> w/Podman.
>>>
>>> We have OSDs that fail after <24 hours and I'm not sure why.
>>>
>>> Seeing this:
>>>
>>> ceph crash info
>>> 2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
>>> {
>>> "backtrace": [
>>> "/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
>>> "pthread_cond_wait()",
>>>
>>> "(std::condition_variable::wait(std::unique_lock&)+0x10)
>>> [0x7f4d306de8f0]",
>>> "(Throttle::_wait(long, std::unique_lock&)+0x10d)
>>> [0x55c52a0f077d]",
>>> "(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
>>>
>>> "(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
>>> BlueStore::TransContext&, std::chrono::time_point>> std::chrono::duration >
>>> >)+0x29)
>>> [0x55c529f362c9]",
>>>
>>>
>>> "(BlueStore::queue_transactions(boost::intrusive_ptr&,
>>> std::vector
>>> >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)
>>> [0x55c529fb7664]",
>>> "(non-virtual thunk to
>>> PrimaryLogPG::queue_transactions(std::vector>> std::allocator >&,
>>> boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
>>> "(ReplicatedBackend::submit_transaction(hobject_t const&,
>>> object_stat_sum_t const&, eversion_t const&,
>>> std::unique_ptr>> std::default_delete >&&, eversion_t const&, eversion_t
>>> const&, std::vector >&&,
>>> std::optional&, Context*, unsigned long,
>>> osd_reqid_t,
>>> boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
>>> "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
>>> PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
>>> "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
>>> [0x55c529bd65ed]",
>>> "(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
>>> [0x55c529bdf162]",
>>> "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
>>> ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
>>> "(OSD::dequeue_op(boost::intrusive_ptr,
>>> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
>>> [0x55c529a6f1b9]",
>>> "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
>>> boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68)
>>> [0x55c529ccc868]",
>>> "(OSD::ShardedOpWQ::_process(unsigned int,
>>> ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
>>> "(ShardedThreadPool::shardedthreadpool_worker(unsigned
>>> int)+0x5c4)
>>> [0x55c52a0fa6c4]",
>>> "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
>>> [0x55c52a0fd364]",
>>> "/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
>>> "clone()"
>>> ],
>>> "ceph_version": "16.2.6",
>>> "crash_id":
>>> "2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
>>> "entity_name": "osd.14",
>>> "os_id": "centos",
>>> "os_name": "CentOS Linux",
>>> "os_version": "8",
>>> "os_version_id": "8",
>>> "process_name": "ceph-osd",
>>> "stack_sig":
>>> "46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
>>> "timestamp": "2021-10-12T14:32:49.169552Z",
>>> "utsname_hostname": "",
>>> "utsname_machine": "x86_64",
>>> "utsname_release": "5.11.0-37-generic",
>>> "utsname_sysname": "Linux",
>>> "utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC
>>> 2021"
>>>
>>> dmesg on host shows:
>>>
>>> [66258.080040] BUG: kernel NULL pointer dereference, address:
>>> 00c0
>>> [66258.080067] #PF: supervisor read access in kernel mode
>>> [66258.080081] #PF: error_code(0x) - not-present page
>>> [66258.080093] PGD 0 P4D 0
>>> [66258.080105] Oops:  [#1] SMP NOPTI
>>> [66258.080115] CPU: 35 PID: 4955 Comm: zabbix_agentd Not tainted
>>> 5.11.0-37-generic #41~20.04.2-Ubuntu
>>> [66258.080137] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
>>> 3.3 02/21/2020
>>> [66258.080154] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
>>> [66258.080171] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff
>>> 5b
>>> 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00

[ceph-users] Re: OSD Crashes in 16.2.6

2021-10-12 Thread Igor Fedotov
FYI: telemetry reports that triggered the above-mentioned ticket 
creation indicate kernel v4.18...


"utsname_release": "4.18.0-305.10.2.el8_4.x86_64"


On 10/12/2021 8:53 PM, Zakhar Kirpichenko wrote:

Hi,

This could be kernel-related, as I've seen similar reports in Proxmox
forum. Specifically, 5.11.x with Ceph seems to be hitting kernel NULL
pointer dereference. Perhaps a newer kernel would help. If not, I'm running
16.2.6 with kernel 5.4.x without any issues.

Best regards,
Z

On Tue, Oct 12, 2021 at 8:31 PM Marco Pizzolo 
wrote:


Hello everyone,

We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
w/Podman.

We have OSDs that fail after <24 hours and I'm not sure why.

Seeing this:

ceph crash info
2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
{
 "backtrace": [
 "/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
 "pthread_cond_wait()",

"(std::condition_variable::wait(std::unique_lock&)+0x10)
[0x7f4d306de8f0]",
 "(Throttle::_wait(long, std::unique_lock&)+0x10d)
[0x55c52a0f077d]",
 "(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
 "(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
BlueStore::TransContext&, std::chrono::time_point > >)+0x29)
[0x55c529f362c9]",


"(BlueStore::queue_transactions(boost::intrusive_ptr&,
std::vector

&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)

[0x55c529fb7664]",
 "(non-virtual thunk to
PrimaryLogPG::queue_transactions(std::vector >&,
boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
 "(ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&, std::unique_ptr >&&, eversion_t const&, eversion_t
const&, std::vector >&&,
std::optional&, Context*, unsigned long, osd_reqid_t,
boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
 "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
 "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
[0x55c529bd65ed]",
 "(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
[0x55c529bdf162]",
 "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
 "(OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
[0x55c529a6f1b9]",
 "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68) [0x55c529ccc868]",
 "(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
 "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x55c52a0fa6c4]",
 "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
[0x55c52a0fd364]",
 "/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
 "clone()"
 ],
 "ceph_version": "16.2.6",
 "crash_id":
"2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
 "entity_name": "osd.14",
 "os_id": "centos",
 "os_name": "CentOS Linux",
 "os_version": "8",
 "os_version_id": "8",
 "process_name": "ceph-osd",
 "stack_sig":
"46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
 "timestamp": "2021-10-12T14:32:49.169552Z",
 "utsname_hostname": "",
 "utsname_machine": "x86_64",
 "utsname_release": "5.11.0-37-generic",
 "utsname_sysname": "Linux",
 "utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC
2021"

dmesg on host shows:

[66258.080040] BUG: kernel NULL pointer dereference, address:
00c0
[66258.080067] #PF: supervisor read access in kernel mode
[66258.080081] #PF: error_code(0x) - not-present page
[66258.080093] PGD 0 P4D 0
[66258.080105] Oops:  [#1] SMP NOPTI
[66258.080115] CPU: 35 PID: 4955 Comm: zabbix_agentd Not tainted
5.11.0-37-generic #41~20.04.2-Ubuntu
[66258.080137] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
3.3 02/21/2020
[66258.080154] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
[66258.080171] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b
41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
<48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
[66258.080210] RSP: 0018:a251249fbbc0 EFLAGS: 00010283
[66258.080224] RAX:  RBX: a251249fbc48 RCX:
0002
[66258.080240] RDX: 0001 RSI: 0202 RDI:
9382f4d8e000
[66258.080256] RBP: a251249fbbf8 R08:  R09:
0035
[66258.080272] R10: abcc77118461cefd R11: 93b382257076 R12:
9382f4d8e000
[66258.080288] R13: 9382f5670c00 R14:  R15:
0001
[66258.080304] FS:  7fedc0ef46c0() GS:93e13f4c()
knlGS:
[66258.080322] CS:  0010 DS:  ES:  CR0: 80050033
[66258.080335] CR2: 00c0 CR3: 00011cc44005 CR4:
007706e0
[66258.080351] DR0:  DR1: 0

[ceph-users] Re: OSD Crashes in 16.2.6

2021-10-12 Thread Igor Fedotov

Zakhar,

could you please point me to the similar reports at Proxmox forum?

Curious what's the Ceph release mentioned there...

Thanks,

Igor

On 10/12/2021 8:53 PM, Zakhar Kirpichenko wrote:

Hi,

This could be kernel-related, as I've seen similar reports in Proxmox
forum. Specifically, 5.11.x with Ceph seems to be hitting kernel NULL
pointer dereference. Perhaps a newer kernel would help. If not, I'm running
16.2.6 with kernel 5.4.x without any issues.

Best regards,
Z

On Tue, Oct 12, 2021 at 8:31 PM Marco Pizzolo 
wrote:


Hello everyone,

We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
w/Podman.

We have OSDs that fail after <24 hours and I'm not sure why.

Seeing this:

ceph crash info
2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
{
 "backtrace": [
 "/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
 "pthread_cond_wait()",

"(std::condition_variable::wait(std::unique_lock&)+0x10)
[0x7f4d306de8f0]",
 "(Throttle::_wait(long, std::unique_lock&)+0x10d)
[0x55c52a0f077d]",
 "(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
 "(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
BlueStore::TransContext&, std::chrono::time_point > >)+0x29)
[0x55c529f362c9]",


"(BlueStore::queue_transactions(boost::intrusive_ptr&,
std::vector

&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)

[0x55c529fb7664]",
 "(non-virtual thunk to
PrimaryLogPG::queue_transactions(std::vector >&,
boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
 "(ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&, std::unique_ptr >&&, eversion_t const&, eversion_t
const&, std::vector >&&,
std::optional&, Context*, unsigned long, osd_reqid_t,
boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
 "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
 "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
[0x55c529bd65ed]",
 "(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
[0x55c529bdf162]",
 "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
 "(OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
[0x55c529a6f1b9]",
 "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68) [0x55c529ccc868]",
 "(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
 "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x55c52a0fa6c4]",
 "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
[0x55c52a0fd364]",
 "/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
 "clone()"
 ],
 "ceph_version": "16.2.6",
 "crash_id":
"2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
 "entity_name": "osd.14",
 "os_id": "centos",
 "os_name": "CentOS Linux",
 "os_version": "8",
 "os_version_id": "8",
 "process_name": "ceph-osd",
 "stack_sig":
"46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
 "timestamp": "2021-10-12T14:32:49.169552Z",
 "utsname_hostname": "",
 "utsname_machine": "x86_64",
 "utsname_release": "5.11.0-37-generic",
 "utsname_sysname": "Linux",
 "utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC
2021"

dmesg on host shows:

[66258.080040] BUG: kernel NULL pointer dereference, address:
00c0
[66258.080067] #PF: supervisor read access in kernel mode
[66258.080081] #PF: error_code(0x) - not-present page
[66258.080093] PGD 0 P4D 0
[66258.080105] Oops:  [#1] SMP NOPTI
[66258.080115] CPU: 35 PID: 4955 Comm: zabbix_agentd Not tainted
5.11.0-37-generic #41~20.04.2-Ubuntu
[66258.080137] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
3.3 02/21/2020
[66258.080154] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
[66258.080171] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b
41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
<48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
[66258.080210] RSP: 0018:a251249fbbc0 EFLAGS: 00010283
[66258.080224] RAX:  RBX: a251249fbc48 RCX:
0002
[66258.080240] RDX: 0001 RSI: 0202 RDI:
9382f4d8e000
[66258.080256] RBP: a251249fbbf8 R08:  R09:
0035
[66258.080272] R10: abcc77118461cefd R11: 93b382257076 R12:
9382f4d8e000
[66258.080288] R13: 9382f5670c00 R14:  R15:
0001
[66258.080304] FS:  7fedc0ef46c0() GS:93e13f4c()
knlGS:
[66258.080322] CS:  0010 DS:  ES:  CR0: 80050033
[66258.080335] CR2: 00c0 CR3: 00011cc44005 CR4:
007706e0
[66258.080351] DR0:  DR1: 00

[ceph-users] Re: OSD Crashes in 16.2.6

2021-10-12 Thread Marco Pizzolo
Igor,

Thanks for the response.  One that I found was:
https://forum.proxmox.com/threads/pve-7-0-bug-kernel-null-pointer-dereference-address-00c0-pf-error_code-0x-no-web-access-no-ssh.96598/

In regards to your questions, this is a new cluster deployed at 16.2.6.

It currently has less than 40TB of data, and is dedicated to CephFS.  We
are copying data over from a second Ceph cluster running Nautilus to
mitigate the risk of an in place upgrade.



On Tue, Oct 12, 2021 at 2:22 PM Igor Fedotov  wrote:

> Zakhar,
>
> could you please point me to the similar reports at Proxmox forum?
>
> Curious what's the Ceph release mentioned there...
>
> Thanks,
>
> Igor
>
> On 10/12/2021 8:53 PM, Zakhar Kirpichenko wrote:
> > Hi,
> >
> > This could be kernel-related, as I've seen similar reports in Proxmox
> > forum. Specifically, 5.11.x with Ceph seems to be hitting kernel NULL
> > pointer dereference. Perhaps a newer kernel would help. If not, I'm
> running
> > 16.2.6 with kernel 5.4.x without any issues.
> >
> > Best regards,
> > Z
> >
> > On Tue, Oct 12, 2021 at 8:31 PM Marco Pizzolo 
> > wrote:
> >
> >> Hello everyone,
> >>
> >> We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
> >> w/Podman.
> >>
> >> We have OSDs that fail after <24 hours and I'm not sure why.
> >>
> >> Seeing this:
> >>
> >> ceph crash info
> >> 2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
> >> {
> >>  "backtrace": [
> >>  "/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
> >>  "pthread_cond_wait()",
> >>
> >> "(std::condition_variable::wait(std::unique_lock&)+0x10)
> >> [0x7f4d306de8f0]",
> >>  "(Throttle::_wait(long, std::unique_lock&)+0x10d)
> >> [0x55c52a0f077d]",
> >>  "(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
> >>
> "(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
> >> BlueStore::TransContext&, std::chrono::time_point >> std::chrono::duration >
> >)+0x29)
> >> [0x55c529f362c9]",
> >>
> >>
> >>
> "(BlueStore::queue_transactions(boost::intrusive_ptr&,
> >> std::vector
> >>> &, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)
> >> [0x55c529fb7664]",
> >>  "(non-virtual thunk to
> >> PrimaryLogPG::queue_transactions(std::vector >> std::allocator >&,
> >> boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
> >>  "(ReplicatedBackend::submit_transaction(hobject_t const&,
> >> object_stat_sum_t const&, eversion_t const&,
> std::unique_ptr >> std::default_delete >&&, eversion_t const&, eversion_t
> >> const&, std::vector >&&,
> >> std::optional&, Context*, unsigned long,
> osd_reqid_t,
> >> boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
> >>  "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> >> PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
> >>  "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
> >> [0x55c529bd65ed]",
> >>  "(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
> >> [0x55c529bdf162]",
> >>  "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
> >> ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
> >>  "(OSD::dequeue_op(boost::intrusive_ptr,
> >> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
> >> [0x55c529a6f1b9]",
> >>  "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> >> boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68)
> [0x55c529ccc868]",
> >>  "(OSD::ShardedOpWQ::_process(unsigned int,
> >> ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
> >>  "(ShardedThreadPool::shardedthreadpool_worker(unsigned
> int)+0x5c4)
> >> [0x55c52a0fa6c4]",
> >>  "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
> >> [0x55c52a0fd364]",
> >>  "/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
> >>  "clone()"
> >>  ],
> >>  "ceph_version": "16.2.6",
> >>  "crash_id":
> >> "2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
> >>  "entity_name": "osd.14",
> >>  "os_id": "centos",
> >>  "os_name": "CentOS Linux",
> >>  "os_version": "8",
> >>  "os_version_id": "8",
> >>  "process_name": "ceph-osd",
> >>  "stack_sig":
> >> "46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
> >>  "timestamp": "2021-10-12T14:32:49.169552Z",
> >>  "utsname_hostname": "",
> >>  "utsname_machine": "x86_64",
> >>  "utsname_release": "5.11.0-37-generic",
> >>  "utsname_sysname": "Linux",
> >>  "utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC
> >> 2021"
> >>
> >> dmesg on host shows:
> >>
> >> [66258.080040] BUG: kernel NULL pointer dereference, address:
> >> 00c0
> >> [66258.080067] #PF: supervisor read access in kernel mode
> >> [66258.080081] #PF: error_code(0x) - not-present page
> >> [66258.080093] PGD 0 P4D 0
> >> [66258.080105] Oops:  [#1] SMP NOPTI
> >> [66258.080115] CPU: 35 PID: 4955 Comm: zabbix_agentd Not tainted
> >> 5.11.0-37-generic #41~20.04.2-Ubuntu
> >> [66258.080137

[ceph-users] Re: ceph full-object read crc != expected on xxx:head

2021-10-12 Thread Gregory Farnum
On Tue, Oct 12, 2021 at 12:52 AM Frank Schilder  wrote:
>
> Is there a way (mimic latest) to find out which PG contains the object that 
> caused this error:
>
> 2021-10-11 23:46:19.631006 osd.335 osd.335 192.168.32.87:6838/8605 623 : 
> cluster [ERR]  full-object read crc 0x6c3a7719 != expected 0xd27f7a2c on 
> 19:28b9843f:::3b43237.:head

19:28b9843f:::3b43237. contains the pool (19), the object
name (3b43237.), and I don't remember if the middle bit
28b9843f is the pg or the hash or the nibble-reversed hash — but you
should be able to figure it out by looking at which of those
characters actually exist in pg names in pool 19.

>
> In all references I could find the error message contains the PG. The above 
> doesn't. There is no additional information in the OSD log of 335.
>
> The above read error did not create a health warn/error state. Is this error 
> automatically fixed?

Which version are you running? I *think* this marks the object as
needing repair and goes off to do so, but it may depend on the release
you're running.
-Greg

>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD Crashes in 16.2.6

2021-10-12 Thread Zakhar Kirpichenko
Indeed, this is the PVE forum post I saw earlier.

/Z

On Tue, Oct 12, 2021 at 9:27 PM Marco Pizzolo 
wrote:

> Igor,
>
> Thanks for the response.  One that I found was:
> https://forum.proxmox.com/threads/pve-7-0-bug-kernel-null-pointer-dereference-address-00c0-pf-error_code-0x-no-web-access-no-ssh.96598/
>
> In regards to your questions, this is a new cluster deployed at 16.2.6.
>
> It currently has less than 40TB of data, and is dedicated to CephFS.  We
> are copying data over from a second Ceph cluster running Nautilus to
> mitigate the risk of an in place upgrade.
>
>
>
> On Tue, Oct 12, 2021 at 2:22 PM Igor Fedotov 
> wrote:
>
>> Zakhar,
>>
>> could you please point me to the similar reports at Proxmox forum?
>>
>> Curious what's the Ceph release mentioned there...
>>
>> Thanks,
>>
>> Igor
>>
>> On 10/12/2021 8:53 PM, Zakhar Kirpichenko wrote:
>> > Hi,
>> >
>> > This could be kernel-related, as I've seen similar reports in Proxmox
>> > forum. Specifically, 5.11.x with Ceph seems to be hitting kernel NULL
>> > pointer dereference. Perhaps a newer kernel would help. If not, I'm
>> running
>> > 16.2.6 with kernel 5.4.x without any issues.
>> >
>> > Best regards,
>> > Z
>> >
>> > On Tue, Oct 12, 2021 at 8:31 PM Marco Pizzolo 
>> > wrote:
>> >
>> >> Hello everyone,
>> >>
>> >> We are seeing instability in 20.04.3 using HWE kernel and Ceph 16.2.6
>> >> w/Podman.
>> >>
>> >> We have OSDs that fail after <24 hours and I'm not sure why.
>> >>
>> >> Seeing this:
>> >>
>> >> ceph crash info
>> >> 2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763
>> >> {
>> >>  "backtrace": [
>> >>  "/lib64/libpthread.so.0(+0x12b20) [0x7f4d31099b20]",
>> >>  "pthread_cond_wait()",
>> >>
>> >> "(std::condition_variable::wait(std::unique_lock&)+0x10)
>> >> [0x7f4d306de8f0]",
>> >>  "(Throttle::_wait(long, std::unique_lock&)+0x10d)
>> >> [0x55c52a0f077d]",
>> >>  "(Throttle::get(long, long)+0xb9) [0x55c52a0f1199]",
>> >>
>> "(BlueStore::BlueStoreThrottle::try_start_transaction(KeyValueDB&,
>> >> BlueStore::TransContext&, std::chrono::time_point> >> std::chrono::duration >
>> >)+0x29)
>> >> [0x55c529f362c9]",
>> >>
>> >>
>> >>
>> "(BlueStore::queue_transactions(boost::intrusive_ptr&,
>> >> std::vector> std::allocator
>> >>> &, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x854)
>> >> [0x55c529fb7664]",
>> >>  "(non-virtual thunk to
>> >> PrimaryLogPG::queue_transactions(std::vector> >> std::allocator >&,
>> >> boost::intrusive_ptr)+0x58) [0x55c529c0ee98]",
>> >>  "(ReplicatedBackend::submit_transaction(hobject_t const&,
>> >> object_stat_sum_t const&, eversion_t const&,
>> std::unique_ptr> >> std::default_delete >&&, eversion_t const&, eversion_t
>> >> const&, std::vector >&&,
>> >> std::optional&, Context*, unsigned long,
>> osd_reqid_t,
>> >> boost::intrusive_ptr)+0xcad) [0x55c529dfbedd]",
>> >>  "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
>> >> PrimaryLogPG::OpContext*)+0xcf0) [0x55c529b7a630]",
>> >>  "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d)
>> >> [0x55c529bd65ed]",
>> >>
>> "(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2de2)
>> >> [0x55c529bdf162]",
>> >>  "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
>> >> ThreadPool::TPHandle&)+0xd1c) [0x55c529be64ac]",
>> >>  "(OSD::dequeue_op(boost::intrusive_ptr,
>> >> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
>> >> [0x55c529a6f1b9]",
>> >>  "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
>> >> boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68)
>> [0x55c529ccc868]",
>> >>  "(OSD::ShardedOpWQ::_process(unsigned int,
>> >> ceph::heartbeat_handle_d*)+0xa58) [0x55c529a8f1e8]",
>> >>  "(ShardedThreadPool::shardedthreadpool_worker(unsigned
>> int)+0x5c4)
>> >> [0x55c52a0fa6c4]",
>> >>  "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
>> >> [0x55c52a0fd364]",
>> >>  "/lib64/libpthread.so.0(+0x814a) [0x7f4d3108f14a]",
>> >>  "clone()"
>> >>  ],
>> >>  "ceph_version": "16.2.6",
>> >>  "crash_id":
>> >> "2021-10-12T14:32:49.169552Z_d1ee94f7-1aaa-4221-abeb-68bd56d3c763",
>> >>  "entity_name": "osd.14",
>> >>  "os_id": "centos",
>> >>  "os_name": "CentOS Linux",
>> >>  "os_version": "8",
>> >>  "os_version_id": "8",
>> >>  "process_name": "ceph-osd",
>> >>  "stack_sig":
>> >> "46b81ca079908da081327cbc114a9c1801dfdbb81303b85fff0d4107a1aeeabe",
>> >>  "timestamp": "2021-10-12T14:32:49.169552Z",
>> >>  "utsname_hostname": "",
>> >>  "utsname_machine": "x86_64",
>> >>  "utsname_release": "5.11.0-37-generic",
>> >>  "utsname_sysname": "Linux",
>> >>  "utsname_version": "#41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC
>> >> 2021"
>> >>
>> >> dmesg on host shows:
>> >>
>> >> [66258.080040] BUG: kernel NULL pointer dereference, address:
>> >> 00c0
>> >> [66258.080067] #PF: supervisor read access in kernel mode
>> >> [

[ceph-users] Re: Where is my free space?

2021-10-12 Thread Gregory Farnum
On Mon, Oct 11, 2021 at 10:22 PM Szabo, Istvan (Agoda)
 wrote:
>
> Hi,
>
> 377TiB is the total cluster size, data pool 4:2 ec, stored 66TiB, how can be 
> the data pool on 60% used??!!

Since you have an EC pool, you presumably have a CRUSH rule demanding
6 hosts. Among your seven hosts, 2 of them have only 3 SSDs for a
per-host size of ~44TB. Ceph is accounting for that imbalance and
apparently can only actually store ~110TiB while satisfying your
placement rules.
Plus one of your hosts has all the SSD OSDs down, so if is down long
enough to be marked out you're going to become severely constricted in
usage.
-Greg

>
>
> Some output:
> ceph df detail
> --- RAW STORAGE ---
> CLASS  SIZE AVAILUSED RAW USED  %RAW USED
> nvme12 TiB   11 TiB  128 MiB   1.2 TiB   9.81
> ssd377 TiB  269 TiB  100 TiB   108 TiB  28.65
> TOTAL  389 TiB  280 TiB  100 TiB   109 TiB  28.06
>
> --- POOLS ---
> POOLID  PGS  STORED   (DATA)   (OMAP)   OBJECTS  USED 
> (DATA)   (OMAP)   %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY   USED 
> COMPR  UNDER COMPR
> device_health_metrics11   49 MiB  0 B   49 MiB   50   98 MiB  
> 0 B   98 MiB  0 73 TiB  N/AN/A  50
>  0 B  0 B
> .rgw.root2   32  1.1 MiB  1.1 MiB  4.5 KiB  159  3.9 MiB  
> 3.9 MiB   12 KiB  0 56 TiB  N/AN/A 159
>  0 B  0 B
> ash.rgw.log  6   32  1.8 GiB   46 KiB  1.8 GiB   73.83k  4.3 GiB  
> 4.4 MiB  4.3 GiB  0 59 TiB  N/AN/A  73.83k
>  0 B  0 B
> ash.rgw.control  7   32  2.9 KiB  0 B  2.9 KiB8  7.7 KiB  
> 0 B  7.7 KiB  0 56 TiB  N/AN/A   8
>  0 B  0 B
> ash.rgw.meta 88  554 KiB  531 KiB   23 KiB1.93k   22 MiB  
>  22 MiB   70 KiB  03.4 TiB  N/AN/A   1.93k
>  0 B  0 B
> ash.rgw.buckets.index   10  128  406 GiB  0 B  406 GiB   58.69k  1.2 TiB  
> 0 B  1.2 TiB  10.333.4 TiB  N/AN/A  58.69k
>  0 B  0 B
> ash.rgw.buckets.data11   32   66 TiB   66 TiB  0 B1.21G   86 TiB  
>  86 TiB  0 B  37.16111 TiB  N/AN/A   1.21G
>  0 B  0 B
> ash.rgw.buckets.non-ec  15   32  8.4 MiB653 B  8.4 MiB   22   23 MiB  
> 264 KiB   23 MiB  0 54 TiB  N/AN/A  22
>  0 B  0 B
>
>
>
>
> rados df
> POOL_NAME  USED OBJECTS  CLONES  COPIES  
> MISSING_ON_PRIMARY  UNFOUND   DEGRADED   RD_OPS   RD   WR_OPS 
>   WR  USED COMPR  UNDER COMPR
> .rgw.root   3.9 MiB 159   0 477   
> 00 60  8905420   20 GiB 8171   19 MiB 
> 0 B  0 B
> ash.rgw.buckets.data 86 TiB  1205539864   0  7233239184   
> 00  904168110  36125678580  153 TiB  55724221429  174 TiB 
> 0 B  0 B
> ash.rgw.buckets.index   1.2 TiB   58688   0  176064   
> 00  0  65848675184   62 TiB  10672532772  6.8 TiB 
> 0 B  0 B
> ash.rgw.buckets.non-ec   23 MiB  22   0  66   
> 00  6  3999256  2.3 GiB  1369730  944 MiB 
> 0 B  0 B
> ash.rgw.control 7.7 KiB   8   0  24   
> 00  30  0 B8  0 B 
> 0 B  0 B
> ash.rgw.log 4.3 GiB   73830   0  221490   
> 00  39282  36922450608   34 TiB   5420884130  1.8 TiB 
> 0 B  0 B
> ash.rgw.meta 22 MiB1931   05793   
> 00  0692302142  528 GiB  4274154  2.0 GiB 
> 0 B  0 B
> device_health_metrics98 MiB  50   0 150   
> 00 5013588   40 MiB17758   46 MiB 
> 0 B  0 B
>
> total_objects1205674552
> total_used   109 TiB
> total_avail  280 TiB
> total_space  389 TiB
>
>
>
> 4 osd down because migrating the db to block.
>
> ceph osd tree
> ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
> -1 398.17001  root default
> -11  61.12257  host server01
> 24   nvme1.74660  osd.24up   1.0  1.0
>   0ssd   14.84399  osd.0   down   1.0  1.0
> 10ssd   14.84399  osd.10  down   1.0  1.0
> 14ssd   14.84399  osd.14  down   1.0  1.0
> 20ssd   14.84399  osd.20  down   1.0  1.0
> -5  61.12257  host server02
> 25   nvme1.74660  

[ceph-users] Re: Ceph cluster Sync

2021-10-12 Thread Manuel Holtgrewe
To chime in here, there is

https://github.com/45Drives/cephgeorep

That allows cephfs replication pre pacific.

There is a mail thread somewhere on the list where a ceph developer warns
about semantics issues of recursive mtime even on pacific. However,
according to 45 drives they have never had an issue so YMMD.

HTH

 schrieb am Di., 12. Okt. 2021, 18:55:

> Michel;
>
> I am neither a Ceph evangelist, nor a Ceph expert, but here is my current
> understanding:
> Ceph clusters do not have in-built cross cluster synchronization.  That
> said, there are several things which might meet your needs.
>
> 1) If you're just planning your Ceph deployment, then the latest release
> (Pacific) introduced the concept of a stretch cluster, essentially a
> cluster which is stretched across datacenters (i.e. a relatively
> low-bandwidth, high-latency link)[1].
>
> 2) RADOSGW allows for uni-directional as well as bi-directional
> synchronization of the data that it handles.[2]
>
> 3) RBD provides mirroring functionality for the data it handles.[3]
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Vice President - Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
> [1] https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
> [2] https://docs.ceph.com/en/latest/radosgw/sync-modules/
> [3] https://docs.ceph.com/en/latest/rbd/rbd-mirroring/
>
>
> -Original Message-
> From: Michel Niyoyita [mailto:mico...@gmail.com]
> Sent: Tuesday, October 12, 2021 8:35 AM
> To: ceph-users
> Subject: [ceph-users] Ceph cluster Sync
>
> Dear team
>
> I want to build two different cluster: one for primary site and the second
> for DR site. I would like to ask if these two cluster can
> communicate(synchronized) each other and data written to the PR site be
> synchronized to the DR site ,  if once we got trouble for the PR site the
> DR automatically takeover.
>
> Please help me for the solution or advise me how to proceed
>
> Best Regards
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Broken mon state after (attempted) 16.2.5 -> 16.2.6 upgrade

2021-10-12 Thread Patrick Donnelly
I found the problem, thanks.

There is a tracker ticket: https://tracker.ceph.com/issues/52820

On Fri, Oct 8, 2021 at 8:01 AM Jonathan D. Proulx  wrote:
>
> Hi Patrick,
>
> Yes we had been successfully running on Pacific  v16.2.5
>
> Thanks for the pointer to the bug, we eventually ended up taking
> eveything down and rebuilding the monstore using
> monstore-tool. Perhaps a longer and less pleasant path than necessary
> but it was effective.
>
> -Jon
>
> On Thu, Oct 07, 2021 at 09:11:21PM -0400, Patrick Donnelly wrote:
> :Hello Jonathan,
> :
> :On Tue, Oct 5, 2021 at 9:13 AM Jonathan D. Proulx  wrote:
> :>
> :> In the middle of a normal cephadm upgrade from 16.2.5 to 16.2.6, after the 
> mgrs had successfully upgraded, 2/5 mons didn’t come back up (and the upgrade 
> stopped at that point). Attempting to manually restart the crashed mons 
> resulted in **all** of the other mons crashing too, usually with:
> :>
> :> terminate called after throwing an instance of 
> 'ceph::buffer::v15_2_0::malformed_input' what(): void 
> FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&) no longer 
> understand old encoding version v < 7: Malformed input
> :
> :You upgraded from v16.2.5 and not Octopus? I would expect your cluster
> :to crash when upgrading to any version of Pacific:
> :
> :https://tracker.ceph.com/issues/51673
> :
> :Only the crash error has changed from an assertion to an exception.
> :
> :--
> :Patrick Donnelly, Ph.D.
> :He / Him / His
> :Principal Software Engineer
> :Red Hat Sunnyvale, CA
> :GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> :
>
> --
>


-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph cluster Sync

2021-10-12 Thread Tony Liu
For PR-DR case, I am using RGW multi-site support to replicate backup image.

Tony

From: Manuel Holtgrewe 
Sent: October 12, 2021 11:40 AM
To: dhils...@performair.com
Cc: mico...@gmail.com; ceph-users
Subject: [ceph-users] Re: Ceph cluster Sync

To chime in here, there is

https://github.com/45Drives/cephgeorep

That allows cephfs replication pre pacific.

There is a mail thread somewhere on the list where a ceph developer warns
about semantics issues of recursive mtime even on pacific. However,
according to 45 drives they have never had an issue so YMMD.

HTH

 schrieb am Di., 12. Okt. 2021, 18:55:

> Michel;
>
> I am neither a Ceph evangelist, nor a Ceph expert, but here is my current
> understanding:
> Ceph clusters do not have in-built cross cluster synchronization.  That
> said, there are several things which might meet your needs.
>
> 1) If you're just planning your Ceph deployment, then the latest release
> (Pacific) introduced the concept of a stretch cluster, essentially a
> cluster which is stretched across datacenters (i.e. a relatively
> low-bandwidth, high-latency link)[1].
>
> 2) RADOSGW allows for uni-directional as well as bi-directional
> synchronization of the data that it handles.[2]
>
> 3) RBD provides mirroring functionality for the data it handles.[3]
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Vice President - Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
> [1] https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
> [2] https://docs.ceph.com/en/latest/radosgw/sync-modules/
> [3] https://docs.ceph.com/en/latest/rbd/rbd-mirroring/
>
>
> -Original Message-
> From: Michel Niyoyita [mailto:mico...@gmail.com]
> Sent: Tuesday, October 12, 2021 8:35 AM
> To: ceph-users
> Subject: [ceph-users] Ceph cluster Sync
>
> Dear team
>
> I want to build two different cluster: one for primary site and the second
> for DR site. I would like to ask if these two cluster can
> communicate(synchronized) each other and data written to the PR site be
> synchronized to the DR site ,  if once we got trouble for the PR site the
> DR automatically takeover.
>
> Please help me for the solution or advise me how to proceed
>
> Best Regards
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io