Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Harald Staub



On 13.06.19 00:29, Sage Weil wrote:

On Thu, 13 Jun 2019, Simon Leinen wrote:

Sage Weil writes:

2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
[default]
Unrecognized command: stats
ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
rocksdb::Version::~Version(): Assertion `path_id < 
cfd_->ioptions()->cf_paths.size()' failed.
*** Caught signal (Aborted) **



Ah, this looks promising.. it looks like it got it open and has some
problem with teh error/teardown path.



Try 'compact' instead of 'stats'?


That run for a while and then crashed, also in the destructor for
rocksdb::Version, but with an otherwise different backtrace.  I'm
attaching the log again.


Hmm, I'm pretty sure this is a shutdown problem, but not certain.  If you
do

  ceph-kvstore-tool rocksdb /mnt/ceph/db list > keys

is the keys file huge?  Can you send the head and tail of it so we can
make sure it looks complete?


It wrote 16GB, then it aborted. Attached key-list.log, keys.tail.
keys.head separately:
ceph-post-file: 5ee5ff3c-ec89-4c2c-aaf9-1ae8ab3f59e3


One last thing to check:

  ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-NNN list > keys

and see if that behaves similarly or crashes in the way it did before when
the OSD was starting.


This aborts without writing any keys. Attached keys-bs.log.

I tried this once more, after having added a big swap partition, but it 
looks very much the same. It aborted after exactly the same time. So it 
is not clear if more RAM would help?


Thanks!
 Harry


If the exported version looks intact, I have a workaround that will
make the osd use that external rocksdb db instead of the embedded one...
basically,

  - symlink the db, db.wal, db.slow files from the osd dir
(/var/lib/ceph/osd/ceph-NNN/db -> ... etc)
  - ceph-bluestore-tool --dev /var/lib/ceph/osd/ceph-NNN/block set-label-key -k 
bluefs -v 0
  - start osd

but be warned this is fragile: there isn't a bluefs import function, so
this OSD will be permanently in that weird state.  The goal will be to get
it up and the PG/cluster behaving, and then eventually let rados recover
elsewhere and reprovision this osd.

But first, let's make sure the external rocksdb has a complete set of
keys!

sage



root@unil0047:/mnt/ceph# ceph-kvstore-tool rocksdb /mnt/ceph/db list > keys
2019-06-13 07:46:52.905 7f070f8c30c0  1 rocksdb: do_open column families: [default]
ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: rocksdb::Version::~Version(): Assertion `path_id < cfd_->ioptions()->cf_paths.size()' failed.
*** Caught signal (Aborted) **
 in thread 7f070f8c30c0 thread_name:ceph-kvstore-to
 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
 1: (()+0x12890) [0x7f07052b3890]
 2: (gsignal()+0xc7) [0x7f07041a3e97]
 3: (abort()+0x141) [0x7f07041a5801]
 4: (()+0x3039a) [0x7f070419539a]
 5: (()+0x30412) [0x7f0704195412]
 6: (rocksdb::Version::~Version()+0x224) [0x557b0961ffe4]
 7: (rocksdb::Version::Unref()+0x35) [0x557b09620065]
 8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x557b09705328]
 9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x557b097083d4]
 10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x557b09708ba8]
 11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x557b09613a5d]
 12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x557b09540868]
 13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x557b0954bdeb]
 14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x557b0954be21]
 15: (RocksDBStore::~RocksDBStore()+0xe9) [0x557b0935b349]
 16: (RocksDBStore::~RocksDBStore()+0x9) [0x557b0935b599]
 17: (main()+0x307) [0x557b091abfb7]
 18: (__libc_start_main()+0xe7) [0x7f0704186b97]
 19: (_start()+0x2a) [0x557b0928403a]
2019-06-13 07:55:42.091 7f070f8c30c0 -1 *** Caught signal (Aborted) **
 in thread 7f070f8c30c0 thread_name:ceph-kvstore-to

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
 1: (()+0x12890) [0x7f07052b3890]
 2: (gsignal()+0xc7) [0x7f07041a3e97]
 3: (abort()+0x141) [0x7f07041a5801]
 4: (()+0x3039a) [0x7f070419539a]
 5: (()+0x30412) [0x7f0704195412]
 6: (rocksdb::Version::~Version()+0x224) [0x557b0961ffe4]
 7: (rocksdb::Version::Unref()+0x35) [0x557b09620065]
 8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x557b09705328]
 9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x557b097083d4]
 10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x557b09708ba8]
 11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x557b09613a5d]
 12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x557b09540868]
 13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x557b0954bdeb]
 14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x557b0954be21]
 15: (RocksDBStore::~RocksDBStore()+0xe9) [0x557b0935b349]
 16: (RocksDBStore::~RocksDBStore()+0x9) [0x557b0935b599]
 17: (main()+0x307) [0x557b091abfb7]
 18: (__libc_start_main()+0xe7) [0x7f0704186b97]
 19: (_start()+0x2a) [0x557b0928403a]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Harald Staub

On 13.06.19 00:33, Sage Weil wrote:
[...]

One other thing to try before taking any drastic steps (as described
below):

  ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck


This gives: fsck success

and the large alloc warnings:

tcmalloc: large alloc 2145263616 bytes == 0x562412e1 @ 
0x7fed890d6887 0x562385370229 0x5623853703a3 0x5623856c51ec 
0x56238566dce2 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a
tcmalloc: large alloc 4290519040 bytes == 0x562492bf2000 @ 
0x7fed890d6887 0x562385370229 0x5623853703a3 0x5623856c51ec 
0x56238566dce2 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a
tcmalloc: large alloc 8581029888 bytes == 0x562593068000 @ 
0x7fed890d6887 0x562385370229 0x5623853703a3 0x5623856c51ec 
0x56238566dce2 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a
tcmalloc: large alloc 17162051584 bytes == 0x562792fea000 @ 
0x7fed890d6887 0x562385370229 0x5623853703a3 0x5623856c51ec 
0x56238566dce2 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a
tcmalloc: large alloc 13559291904 bytes == 0x562b92eec000 @ 
0x7fed890d6887 0x562385370229 0x56238537181b 0x562385723a99 
0x56238566dd25 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a


Thanks!
 Harry

[...]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] num of objects degraded

2019-06-13 Thread zhanrzh...@teamsun.com.cn
hi everyone,
I am a bit confused about num of objects degraded that ceph -s show when 
ceph  recovery.

ceph -s as flollow:
[root@ceph-25 src]# ./ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
cluster 3d52f70a-d82f-46e3-9f03-be03e5e68e33
 health HEALTH_WARN
127 pgs degraded
56 pgs recovering
127 pgs stuck unclean
recovery 55478/60201 objects degraded (92.155%)
 monmap e1: 3 mons at 
{a=172.30.250.25:6789/0,b=172.30.250.25:6790/0,c=172.30.250.25:6791/0}
election epoch 8, quorum 0,1,2 a,b,c
 mdsmap e12: 3/3/3 up {0=a=up:active,1=c=up:active,2=b=up:active}
 osdmap e85: 18 osds: 18 up, 18 in
  pgmap v381: 144 pgs, 3 pools, 20012 MB data, 20067 objects
1401 GB used, 18688 GB / 20089 GB avail
55478/60201 objects degraded (92.155%)
  71 active+degraded
  56 active+recovering+degraded
  17 active+clean
recovery io 36040 kB/s, 35 objects/s

There are 3 pools in cluster 
[root@ceph-25 src]# ./ceph osd lspools
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
0 rbd,1 cephfs_data,2 cephfs_metadata,

And total num of objects are 20067
[root@ceph-25 src]# ./rados -p rbd ls| wc -l
20013
[root@ceph-25 src]# ./rados -p cephfs_data ls | wc -l
0
[root@ceph-25 src]# ./rados -p cephfs_metadata ls | wc -l
54

But the num of objects that ceph -s show  are 60201.
I can't understand it .Can someone explain it to me?
thanks!!!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] num of objects degraded

2019-06-13 Thread Simon Ironside

Hi,

20067 objects actual data
with 3x replication = 60201 objects

On 13/06/2019 08:36, zhanrzh...@teamsun.com.cn wrote:


And total num of objects are 20067
/[root@ceph-25 src]# ./rados -p rbd ls| wc -l/
/20013/
/[root@ceph-25 src]# ./rados -p cephfs_data ls | wc -l/
/0/
/[root@ceph-25 src]# ./rados -p cephfs_metadata ls | wc -l/
/54/

But the num of objects that ceph -s show  are 60201.
I can't understand it .Can someone explain it to me?
thanks!!!



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] one pg blocked at ctive+undersized+degraded+remapped+backfilling

2019-06-13 Thread Brian Chang-Chien
We want to change index pool(radosgw) rule from sata to ssd, when we run
ceph osd pool set default.rgw.buckets.index crush_ruleset x
All of index pg migrated to ssd, but only one pg is still stuck in sata and
cannot be migrated
and it status is active+undersized+degraded+remapped+backfilling


ceph version : 10.2.5
default.rgw.buckets.index  size=2 min_size=1


How can I solve the problem of continuous backfilling?


 health HEALTH_WARN
 1 pgs backfilling
 1 pgs degraded
 1 pgs stuck unclean
 1 pgs undersized
 1 requests are blocked > 32 sec
 recovery 13/548944664 objects degraded (0.000%)
 recovery 31/548944664 objects misplaced (0.000%)
 monmap e1: 3 mons at {sa101=
192.168.8.71:6789/0,sa102=192.168.8.72:6789/0,sa103=192.168.8.73:6789/0}
election epoch 198, quorum 0,1,2 sa101,sa102,sa103
 osdmap e113094: 311 osds: 295 up, 295 in; 1 remapped pgs
flags noout,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v62454723: 4752 pgs, 15 pools, 134 TB data, 174 Mobjects
409 TB used, 1071 TB / 1481 TB avail
13/548944664 objects degraded (0.000%)
31/548944664 objects misplaced (0.000%)
4751 active+clean
   1 active+undersized+degraded+remapped+backfilling

[sa101 ~]# ceph pg map 11.28
osdmap e113094 pg 11.28 (11.28) -> up [251,254] acting [192]

[sa101 ~]# ceph health detail
HEALTH_WARN 1 pgs backfilling; 1 pgs degraded; 1 pgs stuck unclean; 1 pgs
undersized; 1 requests are blocked > 32 sec; 1 osds have slow requests;
recovery 13/548949428 objects degraded (0.000%); recovery 31/548949428
objects misplaced (0.000%);
noout,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds flag(s) set
pg 11.28 is stuck unclean for 624019.077931, current state
active+undersized+degraded+remapped+backfilling, last acting [192]
pg 11.28 is active+undersized+degraded+remapped+backfilling, acting [192]
1 ops are blocked > 32.768 sec on osd.192
1 osds have slow requests
recovery 13/548949428 objects degraded (0.000%)
recovery 31/548949428 objects misplaced (0.000%)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Harald Staub

Idea received from Wido den Hollander:
bluestore rocksdb options = "compaction_readahead_size=0"

With this option, I just tried to start 1 of the 3 crashing OSDs, and it 
came up! I did with "ceph osd set noin" for now.


Later it aborted:

2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7f2a19f5f700' had timed out after 15
2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7f2a19f5f700' had suicide timed out after 150
2019-06-13 13:11:11.862 7f2a37982700  0 --1- 
v1:[2001:620:5ca1:201::119]:6809/3426631 >> 
v1:[2001:620:5ca1:201::144]:6821/3627456 conn(0x564f65c0c000 
0x564f26d6d800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=18075 cs=1 
l=0).handle_connect_reply_2 connect got RESETSESSION

2019-06-13 13:11:11.862 7f2a19f5f700 -1 *** Caught signal (Aborted) **
 in thread 7f2a19f5f700 thread_name:tp_osd_tp

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)

 1: (()+0x12890) [0x7f2a3a818890]
 2: (pthread_kill()+0x31) [0x7f2a3a8152d1]
 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char 
const*, unsigned long)+0x24b) [0x564d732ca2bb]
 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, 
unsigned long, unsigned long)+0x255) [0x564d732ca895]
 5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5a0) 
[0x564d732eb560]

 6: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564d732ed5d0]
 7: (()+0x76db) [0x7f2a3a80d6db]
 8: (clone()+0x3f) [0x7f2a395ad88f]

I guess that this is because of pending backfilling and the noin flag? 
Afterwards it restarted by itself and came up. I stopped it again for now.


It looks healthy so far:
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
fsck success

Now we have to choose how to continue, trying to reduce the risk of 
losing data (most bucket indexes are intact currently). My guess would 
be to let this OSD (which was not the primary) go in and hope that it 
recovers. In case of a problem, maybe we could still use the other OSDs 
"somehow"? In case of success, we would bring back the other OSDs as well?


OTOH we could try to continue with the key dump from earlier today.

Any opinions?

Thanks!
 Harry

On 13.06.19 09:32, Harald Staub wrote:

On 13.06.19 00:33, Sage Weil wrote:
[...]

One other thing to try before taking any drastic steps (as described
below):

  ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck


This gives: fsck success

and the large alloc warnings:

tcmalloc: large alloc 2145263616 bytes == 0x562412e1 @ 
0x7fed890d6887 0x562385370229 0x5623853703a3 0x5623856c51ec 
0x56238566dce2 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a
tcmalloc: large alloc 4290519040 bytes == 0x562492bf2000 @ 
0x7fed890d6887 0x562385370229 0x5623853703a3 0x5623856c51ec 
0x56238566dce2 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a
tcmalloc: large alloc 8581029888 bytes == 0x562593068000 @ 
0x7fed890d6887 0x562385370229 0x5623853703a3 0x5623856c51ec 
0x56238566dce2 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a
tcmalloc: large alloc 17162051584 bytes == 0x562792fea000 @ 
0x7fed890d6887 0x562385370229 0x5623853703a3 0x5623856c51ec 
0x56238566dce2 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a
tcmalloc: large alloc 13559291904 bytes == 0x562b92eec000 @ 
0x7fed890d6887 0x562385370229 0x56238537181b 0x562385723a99 
0x56238566dd25 0x56238566fa05 0x562385681d41 0x562385476201 
0x5623853d5737 0x5623853ef418 0x562385420ae1 0x5623852901c2 
0x7fed7b97 0x56238536977a


Thanks!
  Harry

[...]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD: bind unable to bind on any port in range 6800-7300

2019-06-13 Thread Carlos Valiente
I'm running Ceph 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972)
nautilus (stable) on a Kubernetes cluster using Rook
(https://github.com/rook/rook), and my OSD daemons do not start.

Each OSD process runs inside a Kubernetes pod, and each pod gets its
own IP address. I spotted the following log lines in the output of one
of the OSD pods:

Processor -- bind unable to bind to v2:10.34.0.0:7300/17824 on any
port in range 6800-7300: (99) Cannot assign requested address

The pod in question has an IP address of 10.34.0.9, but for some
reason (if I'm interpreting the log message correctly), the OSD
process seems to attempt to listen on IP address 10.34.0.0 instead.

Is that the expected behaviour?

The full logs of the OSD process, together with the network
configuration of the Kubernetes pod they were running in, can be found
here:

https://github.com/rook/rook/issues/3140#issuecomment-501660446

I'm more than happy to provide additional information, in case anyone
here in this list is able to help.

C
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw-admin list bucket based on "last modified"

2019-06-13 Thread M Ranga Swami Reddy
hello - Can we list the objects in rgw, via last modified date?

For example - I wanted to list all the objects which were modified 01 Jun
2019.

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enable buffered write for bluestore

2019-06-13 Thread Tarek Zegar

http://docs.ceph.com/docs/master/rbd/rbd-config-ref/





From:   Trilok Agarwal 
To: ceph-users@lists.ceph.com
Date:   06/12/2019 07:31 PM
Subject:[EXTERNAL] [ceph-users] Enable buffered write for bluestore
Sent by:"ceph-users" 



Hi
How can we enable bluestore_default_buffered_write using ceph-conf utility
Any pointers would be appreciated
___
ceph-users mailing list
ceph-users@lists.ceph.com
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=3V1n-r1W__Mu-wEAwzq7jDpopOSMrfRfomn1f5bgT28&m=IQSM2SLfLlMC9PBaiOwxBO2O-cyYKVyr23g8JOm-DF8&s=3b4-XBOn9Iq644agPjiuSxrFn2zc85p1o_4W41IH3sQ&e=




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-admin list bucket based on "last modified"

2019-06-13 Thread Paul Emmerich
There's no (useful) internal ordering of these entries, so there isn't a
more efficient way than getting everything and sorting it :(


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jun 13, 2019 at 3:33 PM M Ranga Swami Reddy 
wrote:

> hello - Can we list the objects in rgw, via last modified date?
>
> For example - I wanted to list all the objects which were modified 01 Jun
> 2019.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Sage Weil
On Thu, 13 Jun 2019, Harald Staub wrote:
> Idea received from Wido den Hollander:
> bluestore rocksdb options = "compaction_readahead_size=0"
> 
> With this option, I just tried to start 1 of the 3 crashing OSDs, and it came
> up! I did with "ceph osd set noin" for now.

Yay!

> Later it aborted:
> 
> 2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7f2a19f5f700' had timed out after 15
> 2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7f2a19f5f700' had suicide timed out after 150
> 2019-06-13 13:11:11.862 7f2a37982700  0 --1-
> v1:[2001:620:5ca1:201::119]:6809/3426631 >>
> v1:[2001:620:5ca1:201::144]:6821/3627456 conn(0x564f65c0c000 0x564f26d6d800
> :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=18075 cs=1 l=0).handle_connect_reply_2
> connect got RESETSESSION
> 2019-06-13 13:11:11.862 7f2a19f5f700 -1 *** Caught signal (Aborted) **
>  in thread 7f2a19f5f700 thread_name:tp_osd_tp
> 
>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> (stable)
>  1: (()+0x12890) [0x7f2a3a818890]
>  2: (pthread_kill()+0x31) [0x7f2a3a8152d1]
>  3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
> unsigned long)+0x24b) [0x564d732ca2bb]
>  4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> long, unsigned long)+0x255) [0x564d732ca895]
>  5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5a0)
> [0x564d732eb560]
>  6: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564d732ed5d0]
>  7: (()+0x76db) [0x7f2a3a80d6db]
>  8: (clone()+0x3f) [0x7f2a395ad88f]
> 
> I guess that this is because of pending backfilling and the noin flag?
> Afterwards it restarted by itself and came up. I stopped it again for now.

I think that increasing the various suicide timeout options will allow 
it to stay up long enough to clean up the ginormous objects:

 ceph config set osd.NNN osd_op_thread_suicide_timeout 2h

> It looks healthy so far:
> ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> fsck success
> 
> Now we have to choose how to continue, trying to reduce the risk of losing
> data (most bucket indexes are intact currently). My guess would be to let this
> OSD (which was not the primary) go in and hope that it recovers. In case of a
> problem, maybe we could still use the other OSDs "somehow"? In case of
> success, we would bring back the other OSDs as well?
> 
> OTOH we could try to continue with the key dump from earlier today.

I would start all three osds the same way, with 'noout' set on the 
cluster.  You should try to avoid triggering recovery because it will have 
a hard time getting through the big index object on that bucket (i.e., it 
will take a long time, and might trigger some blocked ios and so forth).

(Side note that since you started the OSD read-write using the internal 
copy of rocksdb, don't forget that the external copy you extracted 
(/mnt/ceph/db?) is now stale!)

sage

> 
> Any opinions?
> 
> Thanks!
>  Harry
> 
> On 13.06.19 09:32, Harald Staub wrote:
> > On 13.06.19 00:33, Sage Weil wrote:
> > [...]
> > > One other thing to try before taking any drastic steps (as described
> > > below):
> > > 
> > >   ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck
> > 
> > This gives: fsck success
> > 
> > and the large alloc warnings:
> > 
> > tcmalloc: large alloc 2145263616 bytes == 0x562412e1 @ 0x7fed890d6887
> > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2 0x56238566fa05
> > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 0x562385420ae1
> > 0x5623852901c2 0x7fed7b97 0x56238536977a
> > tcmalloc: large alloc 4290519040 bytes == 0x562492bf2000 @ 0x7fed890d6887
> > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2 0x56238566fa05
> > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 0x562385420ae1
> > 0x5623852901c2 0x7fed7b97 0x56238536977a
> > tcmalloc: large alloc 8581029888 bytes == 0x562593068000 @ 0x7fed890d6887
> > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2 0x56238566fa05
> > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 0x562385420ae1
> > 0x5623852901c2 0x7fed7b97 0x56238536977a
> > tcmalloc: large alloc 17162051584 bytes == 0x562792fea000 @ 0x7fed890d6887
> > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2 0x56238566fa05
> > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 0x562385420ae1
> > 0x5623852901c2 0x7fed7b97 0x56238536977a
> > tcmalloc: large alloc 13559291904 bytes == 0x562b92eec000 @ 0x7fed890d6887
> > 0x562385370229 0x56238537181b 0x562385723a99 0x56238566dd25 0x56238566fa05
> > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 0x562385420ae1
> > 0x5623852901c2 0x7fed7b97 0x56238536977a
> > 
> > Thanks!
> >   Harry
> > 
> > [...]
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-us

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Paul Emmerich
Something I had suggested off-list (repeated here if anyone else finds
themselves in a similar scenario):

since only one PG is dead and the OSD now seems to be alive enough to
start/mount: consider taking a backup of the affected PG with

ceph-objectstore-tool --op export --pgid X.YY

(That might also take a loong time)

That export can later be imported into any other OSD if these three dead
OSDs turn out to be a lost cause.
(Risk: importing the PG somewhere else might kill that OSD as well,
depending on the nature of the problem; I suggested new OSDs as import
target)

Paul

On Thu, Jun 13, 2019 at 3:52 PM Sage Weil  wrote:

> On Thu, 13 Jun 2019, Harald Staub wrote:
> > Idea received from Wido den Hollander:
> > bluestore rocksdb options = "compaction_readahead_size=0"
> >
> > With this option, I just tried to start 1 of the 3 crashing OSDs, and it
> came
> > up! I did with "ceph osd set noin" for now.
>
> Yay!
>
> > Later it aborted:
> >
> > 2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout
> > 'OSD::osd_op_tp thread 0x7f2a19f5f700' had timed out after 15
> > 2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout
> > 'OSD::osd_op_tp thread 0x7f2a19f5f700' had suicide timed out after 150
> > 2019-06-13 13:11:11.862 7f2a37982700  0 --1-
> > v1:[2001:620:5ca1:201::119]:6809/3426631 >>
> > v1:[2001:620:5ca1:201::144]:6821/3627456 conn(0x564f65c0c000
> 0x564f26d6d800
> > :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=18075 cs=1
> l=0).handle_connect_reply_2
> > connect got RESETSESSION
> > 2019-06-13 13:11:11.862 7f2a19f5f700 -1 *** Caught signal (Aborted) **
> >  in thread 7f2a19f5f700 thread_name:tp_osd_tp
> >
> >  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> > (stable)
> >  1: (()+0x12890) [0x7f2a3a818890]
> >  2: (pthread_kill()+0x31) [0x7f2a3a8152d1]
> >  3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*,
> > unsigned long)+0x24b) [0x564d732ca2bb]
> >  4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
> unsigned
> > long, unsigned long)+0x255) [0x564d732ca895]
> >  5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5a0)
> > [0x564d732eb560]
> >  6: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564d732ed5d0]
> >  7: (()+0x76db) [0x7f2a3a80d6db]
> >  8: (clone()+0x3f) [0x7f2a395ad88f]
> >
> > I guess that this is because of pending backfilling and the noin flag?
> > Afterwards it restarted by itself and came up. I stopped it again for
> now.
>
> I think that increasing the various suicide timeout options will allow
> it to stay up long enough to clean up the ginormous objects:
>
>  ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
>
> > It looks healthy so far:
> > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> > fsck success
> >
> > Now we have to choose how to continue, trying to reduce the risk of
> losing
> > data (most bucket indexes are intact currently). My guess would be to
> let this
> > OSD (which was not the primary) go in and hope that it recovers. In case
> of a
> > problem, maybe we could still use the other OSDs "somehow"? In case of
> > success, we would bring back the other OSDs as well?
> >
> > OTOH we could try to continue with the key dump from earlier today.
>
> I would start all three osds the same way, with 'noout' set on the
> cluster.  You should try to avoid triggering recovery because it will have
> a hard time getting through the big index object on that bucket (i.e., it
> will take a long time, and might trigger some blocked ios and so forth).
>
> (Side note that since you started the OSD read-write using the internal
> copy of rocksdb, don't forget that the external copy you extracted
> (/mnt/ceph/db?) is now stale!)
>
> sage
>
> >
> > Any opinions?
> >
> > Thanks!
> >  Harry
> >
> > On 13.06.19 09:32, Harald Staub wrote:
> > > On 13.06.19 00:33, Sage Weil wrote:
> > > [...]
> > > > One other thing to try before taking any drastic steps (as described
> > > > below):
> > > >
> > > >   ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck
> > >
> > > This gives: fsck success
> > >
> > > and the large alloc warnings:
> > >
> > > tcmalloc: large alloc 2145263616 bytes == 0x562412e1 @
> 0x7fed890d6887
> > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2
> 0x56238566fa05
> > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418
> 0x562385420ae1
> > > 0x5623852901c2 0x7fed7b97 0x56238536977a
> > > tcmalloc: large alloc 4290519040 bytes == 0x562492bf2000 @
> 0x7fed890d6887
> > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2
> 0x56238566fa05
> > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418
> 0x562385420ae1
> > > 0x5623852901c2 0x7fed7b97 0x56238536977a
> > > tcmalloc: large alloc 8581029888 bytes == 0x562593068000 @
> 0x7fed890d6887
> > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2
> 0x56238566fa05
> > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Sage Weil
On Thu, 13 Jun 2019, Paul Emmerich wrote:
> Something I had suggested off-list (repeated here if anyone else finds
> themselves in a similar scenario):
> 
> since only one PG is dead and the OSD now seems to be alive enough to
> start/mount: consider taking a backup of the affected PG with
> 
> ceph-objectstore-tool --op export --pgid X.YY
> 
> (That might also take a loong time)
> 
> That export can later be imported into any other OSD if these three dead
> OSDs turn out to be a lost cause.

Yes--this is a great suggestion!

There may also be other PGs that are stale because all 3 copies land on 
these 3 OSDs... and for those less-problematic PGs, importing them 
elsewhere is comparatively safe.  But doing those imports on fresh OSD(s) 
is always a good practice!

sage


> (Risk: importing the PG somewhere else might kill that OSD as well,
> depending on the nature of the problem; I suggested new OSDs as import
> target)
> 
> Paul
> 
> On Thu, Jun 13, 2019 at 3:52 PM Sage Weil  wrote:
> 
> > On Thu, 13 Jun 2019, Harald Staub wrote:
> > > Idea received from Wido den Hollander:
> > > bluestore rocksdb options = "compaction_readahead_size=0"
> > >
> > > With this option, I just tried to start 1 of the 3 crashing OSDs, and it
> > came
> > > up! I did with "ceph osd set noin" for now.
> >
> > Yay!
> >
> > > Later it aborted:
> > >
> > > 2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout
> > > 'OSD::osd_op_tp thread 0x7f2a19f5f700' had timed out after 15
> > > 2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout
> > > 'OSD::osd_op_tp thread 0x7f2a19f5f700' had suicide timed out after 150
> > > 2019-06-13 13:11:11.862 7f2a37982700  0 --1-
> > > v1:[2001:620:5ca1:201::119]:6809/3426631 >>
> > > v1:[2001:620:5ca1:201::144]:6821/3627456 conn(0x564f65c0c000
> > 0x564f26d6d800
> > > :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=18075 cs=1
> > l=0).handle_connect_reply_2
> > > connect got RESETSESSION
> > > 2019-06-13 13:11:11.862 7f2a19f5f700 -1 *** Caught signal (Aborted) **
> > >  in thread 7f2a19f5f700 thread_name:tp_osd_tp
> > >
> > >  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> > > (stable)
> > >  1: (()+0x12890) [0x7f2a3a818890]
> > >  2: (pthread_kill()+0x31) [0x7f2a3a8152d1]
> > >  3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> > const*,
> > > unsigned long)+0x24b) [0x564d732ca2bb]
> > >  4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
> > unsigned
> > > long, unsigned long)+0x255) [0x564d732ca895]
> > >  5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5a0)
> > > [0x564d732eb560]
> > >  6: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564d732ed5d0]
> > >  7: (()+0x76db) [0x7f2a3a80d6db]
> > >  8: (clone()+0x3f) [0x7f2a395ad88f]
> > >
> > > I guess that this is because of pending backfilling and the noin flag?
> > > Afterwards it restarted by itself and came up. I stopped it again for
> > now.
> >
> > I think that increasing the various suicide timeout options will allow
> > it to stay up long enough to clean up the ginormous objects:
> >
> >  ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
> >
> > > It looks healthy so far:
> > > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> > > fsck success
> > >
> > > Now we have to choose how to continue, trying to reduce the risk of
> > losing
> > > data (most bucket indexes are intact currently). My guess would be to
> > let this
> > > OSD (which was not the primary) go in and hope that it recovers. In case
> > of a
> > > problem, maybe we could still use the other OSDs "somehow"? In case of
> > > success, we would bring back the other OSDs as well?
> > >
> > > OTOH we could try to continue with the key dump from earlier today.
> >
> > I would start all three osds the same way, with 'noout' set on the
> > cluster.  You should try to avoid triggering recovery because it will have
> > a hard time getting through the big index object on that bucket (i.e., it
> > will take a long time, and might trigger some blocked ios and so forth).
> >
> > (Side note that since you started the OSD read-write using the internal
> > copy of rocksdb, don't forget that the external copy you extracted
> > (/mnt/ceph/db?) is now stale!)
> >
> > sage
> >
> > >
> > > Any opinions?
> > >
> > > Thanks!
> > >  Harry
> > >
> > > On 13.06.19 09:32, Harald Staub wrote:
> > > > On 13.06.19 00:33, Sage Weil wrote:
> > > > [...]
> > > > > One other thing to try before taking any drastic steps (as described
> > > > > below):
> > > > >
> > > > >   ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck
> > > >
> > > > This gives: fsck success
> > > >
> > > > and the large alloc warnings:
> > > >
> > > > tcmalloc: large alloc 2145263616 bytes == 0x562412e1 @
> > 0x7fed890d6887
> > > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2
> > 0x56238566fa05
> > > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418
> > 0x562385420ae1
> > > > 0x5

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Harald Staub

On 13.06.19 15:52, Sage Weil wrote:

On Thu, 13 Jun 2019, Harald Staub wrote:

[...]

I think that increasing the various suicide timeout options will allow
it to stay up long enough to clean up the ginormous objects:

  ceph config set osd.NNN osd_op_thread_suicide_timeout 2h


ok


It looks healthy so far:
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
fsck success

Now we have to choose how to continue, trying to reduce the risk of losing
data (most bucket indexes are intact currently). My guess would be to let this
OSD (which was not the primary) go in and hope that it recovers. In case of a
problem, maybe we could still use the other OSDs "somehow"? In case of
success, we would bring back the other OSDs as well?

OTOH we could try to continue with the key dump from earlier today.


I would start all three osds the same way, with 'noout' set on the
cluster.  You should try to avoid triggering recovery because it will have
a hard time getting through the big index object on that bucket (i.e., it
will take a long time, and might trigger some blocked ios and so forth).


This I do not understand, how would I avoid recovery?


(Side note that since you started the OSD read-write using the internal
copy of rocksdb, don't forget that the external copy you extracted
(/mnt/ceph/db?) is now stale!)


As suggested by Paul Emmerich (see next E-mail in this thread), I 
exported this PG. It took not that long (20 minutes).


Thank you!
 Harry
[...]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Sage Weil
On Thu, 13 Jun 2019, Harald Staub wrote:
> On 13.06.19 15:52, Sage Weil wrote:
> > On Thu, 13 Jun 2019, Harald Staub wrote:
> [...]
> > I think that increasing the various suicide timeout options will allow
> > it to stay up long enough to clean up the ginormous objects:
> > 
> >   ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
> 
> ok
> 
> > > It looks healthy so far:
> > > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> > > fsck success
> > > 
> > > Now we have to choose how to continue, trying to reduce the risk of losing
> > > data (most bucket indexes are intact currently). My guess would be to let
> > > this
> > > OSD (which was not the primary) go in and hope that it recovers. In case
> > > of a
> > > problem, maybe we could still use the other OSDs "somehow"? In case of
> > > success, we would bring back the other OSDs as well?
> > > 
> > > OTOH we could try to continue with the key dump from earlier today.
> > 
> > I would start all three osds the same way, with 'noout' set on the
> > cluster.  You should try to avoid triggering recovery because it will have
> > a hard time getting through the big index object on that bucket (i.e., it
> > will take a long time, and might trigger some blocked ios and so forth).
> 
> This I do not understand, how would I avoid recovery?

Well, simply doing 'ceph osd set noout' is sufficient to avoid 
recovery, I suppose.  But in any case, getting at least 2 of the 
existing copies/OSDs online (assuming your pool's min_size=2) will mean 
you can finish the reshard process and clean up the big object without 
copying the PG anywhere.

I think you may as well do all 3 OSDs this way, then clean up the big 
object--that way in the end no data will have to move.

This is Nautilus, right?  If you scrub the PGs in question, that will also 
now raise the health alert if there are any remaining big omap objects... 
if that warning goes away you'll know you're doing cleaning up.  A final 
rocksdb compaction should then be enough to remove any remaing weirdness 
from rocksdb's internal layout.
 
> > (Side note that since you started the OSD read-write using the internal
> > copy of rocksdb, don't forget that the external copy you extracted
> > (/mnt/ceph/db?) is now stale!)
> 
> As suggested by Paul Emmerich (see next E-mail in this thread), I exported
> this PG. It took not that long (20 minutes).

Great :)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Any way to modify Bluestore label ?

2019-06-13 Thread Vincent Pharabot
Hello,

I would like to modify Bluestore label of an OSD, is there a way to do this
?

I so that we could diplay them with  "ceph-bluestore-tool show-label" but i
did not find anyway to modify them...

Is it possible ?
I changed LVM tags but that don't help with bluestore labels..

# ceph-bluestore-tool show-label --dev
/dev/ceph-dd64f696-4908-4088-8bea-9ed5e15dd3ce/osd-block-3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e
{
"/dev/ceph-dd64f696-4908-4088-8bea-9ed5e15dd3ce/osd-block-3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e":
{
"osd_uuid": "3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e",
"size": 1073737629696,
"btime": "2019-06-11 17:18:12.935690",
"description": "main",
"bluefs": "1",
"ceph_fsid": "cf7017d0-bb78-44d9-9d99-dfe2c210a4fa",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "xx",
"ready": "ready",
"whoami": "1"
}
}

Thank you !
Vincent
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any way to modify Bluestore label ?

2019-06-13 Thread Konstantin Shalygin

Hello,

I would like to modify Bluestore label of an OSD, is there a way to do this
?

I so that we could diplay them with  "ceph-bluestore-tool show-label" but i
did not find anyway to modify them...

Is it possible ?
I changed LVM tags but that don't help with bluestore labels..

# ceph-bluestore-tool show-label --dev
/dev/ceph-dd64f696-4908-4088-8bea-9ed5e15dd3ce/osd-block-3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e
{
"/dev/ceph-dd64f696-4908-4088-8bea-9ed5e15dd3ce/osd-block-3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e":
{
"osd_uuid": "3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e",
"size": 1073737629696,
"btime": "2019-06-11 17:18:12.935690",
"description": "main",
"bluefs": "1",
"ceph_fsid": "cf7017d0-bb78-44d9-9d99-dfe2c210a4fa",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "xx",
"ready": "ready",
"whoami": "1"
}
}


This possible like this:

ceph-bluestore-tool set-label-key --dev /var/lib/ceph/osd/ceph-1/block 
--key  --value <123>




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any way to modify Bluestore label ?

2019-06-13 Thread Vincent Pharabot
Woaw ok thanks a lot i missed that in the doc...

Le jeu. 13 juin 2019 à 16:49, Konstantin Shalygin  a écrit :

> Hello,
>
> I would like to modify Bluestore label of an OSD, is there a way to do this
> ?
>
> I so that we could diplay them with  "ceph-bluestore-tool show-label" but i
> did not find anyway to modify them...
>
> Is it possible ?
> I changed LVM tags but that don't help with bluestore labels..
>
> # ceph-bluestore-tool show-label --dev
> /dev/ceph-dd64f696-4908-4088-8bea-9ed5e15dd3ce/osd-block-3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e
> {
> "/dev/ceph-dd64f696-4908-4088-8bea-9ed5e15dd3ce/osd-block-3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e":
> {
> "osd_uuid": "3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e",
> "size": 1073737629696,
> "btime": "2019-06-11 17:18:12.935690",
> "description": "main",
> "bluefs": "1",
> "ceph_fsid": "cf7017d0-bb78-44d9-9d99-dfe2c210a4fa",
> "kv_backend": "rocksdb",
> "magic": "ceph osd volume v026",
> "mkfs_done": "yes",
> "osd_key": "xx",
> "ready": "ready",
> "whoami": "1"
> }
> }
>
> This possible like this:
> ceph-bluestore-tool set-label-key --dev /var/lib/ceph/osd/ceph-1/block
> --key  --value <123>
>
>
>
> k
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Verifying current configuration values

2019-06-13 Thread Jorge Garcia

I'm using mimic, which I thought was supported. Here's the full version:

# ceph -v
ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
(stable)

# ceph daemon osd.0 config show | grep memory
    "debug_deliberately_leak_memory": "false",
    "mds_cache_memory_limit": "1073741824",
    "ms_dpdk_memory_channel": "4",
    "rocksdb_collect_memory_stats": "false",

The mimic installation was from the RPM packages at ceph.com

On 6/12/19 3:51 PM, Jorge Garcia wrote:
I'm following the bluestore config reference guide and trying to 
change the value for osd_memory_target. I added the following entry in 
the /etc/ceph/ceph.conf file:


  [osd]
  osd_memory_target = 2147483648

and restarted the osd daemons doing "systemctl restart 
ceph-osd.target". Now, how do I verify that the value has changed? I 
have tried "ceph daemon osd.0 config show" and it lists many settings, 
but osd_memory_target isn't one of them. What am I doing wrong?


Thanks!



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Verifying current configuration values

2019-06-13 Thread Paul Emmerich
I think this option was added in 13.2.4 (or 13.2.5?)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jun 13, 2019 at 7:00 PM Jorge Garcia  wrote:

> I'm using mimic, which I thought was supported. Here's the full version:
>
> # ceph -v
> ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
> (stable)
> # ceph daemon osd.0 config show | grep memory
>  "debug_deliberately_leak_memory": "false",
>  "mds_cache_memory_limit": "1073741824",
>  "ms_dpdk_memory_channel": "4",
>  "rocksdb_collect_memory_stats": "false",
>
> The mimic installation was from the RPM packages at ceph.com
>
> On 6/12/19 3:51 PM, Jorge Garcia wrote:
> > I'm following the bluestore config reference guide and trying to
> > change the value for osd_memory_target. I added the following entry in
> > the /etc/ceph/ceph.conf file:
> >
> >   [osd]
> >   osd_memory_target = 2147483648
> >
> > and restarted the osd daemons doing "systemctl restart
> > ceph-osd.target". Now, how do I verify that the value has changed? I
> > have tried "ceph daemon osd.0 config show" and it lists many settings,
> > but osd_memory_target isn't one of them. What am I doing wrong?
> >
> > Thanks!
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Day Netherlands Schedule Now Available!

2019-06-13 Thread Mike Perez
Hi everyone,

The Ceph Day Netherlands schedule is now available!

https://ceph.com/cephdays/netherlands-2019/

Registration is free and still open, so please come join us for some
great content and discussion with members of the community of all
levels!

https://www.eventbrite.com/e/ceph-day-netherlands-tickets-62122673589

--
Mike Perez (thingee)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Harald Staub

Looks fine (at least so far), thank you all!

After having exported all 3 copies of the bad PG, we decided to try it 
in-place. We also set norebalance to make sure that no data is moved. 
When the PG was up, the resharding finished with a "success" message. 
The large omap warning is gone after deep-scrubbing the PG.


Then we set the 3 OSDs to out. Soon after, one after the other was down 
(maybe for 2 minutes) and we got degraded PGs, but only once.


Thank you!
 Harry

On 13.06.19 16:14, Sage Weil wrote:

On Thu, 13 Jun 2019, Harald Staub wrote:

On 13.06.19 15:52, Sage Weil wrote:

On Thu, 13 Jun 2019, Harald Staub wrote:

[...]

I think that increasing the various suicide timeout options will allow
it to stay up long enough to clean up the ginormous objects:

   ceph config set osd.NNN osd_op_thread_suicide_timeout 2h


ok


It looks healthy so far:
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
fsck success

Now we have to choose how to continue, trying to reduce the risk of losing
data (most bucket indexes are intact currently). My guess would be to let
this
OSD (which was not the primary) go in and hope that it recovers. In case
of a
problem, maybe we could still use the other OSDs "somehow"? In case of
success, we would bring back the other OSDs as well?

OTOH we could try to continue with the key dump from earlier today.


I would start all three osds the same way, with 'noout' set on the
cluster.  You should try to avoid triggering recovery because it will have
a hard time getting through the big index object on that bucket (i.e., it
will take a long time, and might trigger some blocked ios and so forth).


This I do not understand, how would I avoid recovery?


Well, simply doing 'ceph osd set noout' is sufficient to avoid
recovery, I suppose.  But in any case, getting at least 2 of the
existing copies/OSDs online (assuming your pool's min_size=2) will mean
you can finish the reshard process and clean up the big object without
copying the PG anywhere.

I think you may as well do all 3 OSDs this way, then clean up the big
object--that way in the end no data will have to move.

This is Nautilus, right?  If you scrub the PGs in question, that will also
now raise the health alert if there are any remaining big omap objects...
if that warning goes away you'll know you're doing cleaning up.  A final
rocksdb compaction should then be enough to remove any remaing weirdness
from rocksdb's internal layout.
  

(Side note that since you started the OSD read-write using the internal
copy of rocksdb, don't forget that the external copy you extracted
(/mnt/ceph/db?) is now stale!)


As suggested by Paul Emmerich (see next E-mail in this thread), I exported
this PG. It took not that long (20 minutes).


Great :)

sage


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Octopus roadmap planning series is now available

2019-06-13 Thread Mike Perez
In case you missed these events on the community calendar, here are
the recordings:

https://www.youtube.com/playlist?list=PLrBUGiINAakPCrcdqjbBR_VlFa5buEW2J

--
Mike Perez (thingee)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Verifying current configuration values

2019-06-13 Thread Jorge Garcia
Thanks! That's the correct solution. I upgraded to 13.2.6 (latest mimic) 
and the option is now there...


On 6/13/19 10:22 AM, Paul Emmerich wrote:

I think this option was added in 13.2.4 (or 13.2.5?)

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io 
Tel: +49 89 1896585 90


On Thu, Jun 13, 2019 at 7:00 PM Jorge Garcia > wrote:


I'm using mimic, which I thought was supported. Here's the full
version:

# ceph -v
ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
(stable)
# ceph daemon osd.0 config show | grep memory
 "debug_deliberately_leak_memory": "false",
 "mds_cache_memory_limit": "1073741824",
 "ms_dpdk_memory_channel": "4",
 "rocksdb_collect_memory_stats": "false",

The mimic installation was from the RPM packages at ceph.com


On 6/12/19 3:51 PM, Jorge Garcia wrote:
> I'm following the bluestore config reference guide and trying to
> change the value for osd_memory_target. I added the following
entry in
> the /etc/ceph/ceph.conf file:
>
>   [osd]
>   osd_memory_target = 2147483648
>
> and restarted the osd daemons doing "systemctl restart
> ceph-osd.target". Now, how do I verify that the value has
changed? I
> have tried "ceph daemon osd.0 config show" and it lists many
settings,
> but osd_memory_target isn't one of them. What am I doing wrong?
>
> Thanks!
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mutable health warnings

2019-06-13 Thread Neha Ojha
Hi everyone,

There has been some interest in a feature that helps users to mute
health warnings. There is a trello card[1] associated with it and
we've had some discussion[2] in the past in a CDM about it. In
general, we want to understand a few things:

1. what is the level of interest in this feature
2. for how long should we mute these warnings - should the period be
decided by us or the user
3. possible misuse of this feature and negative impacts of muting some warnings

Let us know what you think.

[1] https://trello.com/c/vINMkfTf/358-mute-health-warnings
[2] https://pad.ceph.com/p/cephalocon-usability-brainstorming

Thanks,
Neha
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] strange osd beacon

2019-06-13 Thread Rafał Wądołowski
Hi,

Is it normal that osd beacon could be without pgs? Like below. This
drive contain data, but I cannot make him to run.

Ceph v.12.2.4


 {
    "description": "osd_beacon(pgs [] lec 857158 v869771)",
    "initiated_at": "2019-06-14 06:39:37.972795",
    "age": 189.310037,
    "duration": 189.453167,
    "type_data": {
    "events": [
    {
    "time": "2019-06-14 06:39:37.972795",
    "event": "initiated"
    },
    {
    "time": "2019-06-14 06:39:37.972954",
    "event": "mon:_ms_dispatch"
    },
    {
    "time": "2019-06-14 06:39:37.972956",
    "event": "mon:dispatch_op"
    },
    {
    "time": "2019-06-14 06:39:37.972956",
    "event": "psvc:dispatch"
    },
    {
    "time": "2019-06-14 06:39:37.972976",
    "event": "osdmap:preprocess_query"
    },
    {
    "time": "2019-06-14 06:39:37.972978",
    "event": "osdmap:preprocess_beacon"
    },
    {
    "time": "2019-06-14 06:39:37.972982",
    "event": "forward_request_leader"
    },
    {
    "time": "2019-06-14 06:39:37.973064",
    "event": "forwarded"
    }
    ],
    "info": {
    "seq": 22378,
    "src_is_mon": false,
    "source": "osd.1092 10.11.2.33:6842/159188",
    "forwarded_to_leader": true
    }
    }
    }


Best Regards,

Rafał Wądołowski

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com