It didn't work, emailed logs to you.
> On 2.10.2018, at 14:43, Igor Fedotov <ifedo...@suse.de> wrote: > > The major change is in get_bluefs_rebalance_txn function, it lacked > bluefs_rebalance_txn assignment.. > > > > On 10/2/2018 2:40 PM, Sergey Malinin wrote: >> PR doesn't seem to have changed since yesterday. Am I missing something? >> >> >>> On 2.10.2018, at 14:15, Igor Fedotov <ifedo...@suse.de> wrote: >>> >>> Please update the patch from the PR - it didn't update bluefs extents list >>> before. >>> >>> Also please set debug bluestore 20 when re-running repair and collect the >>> log. >>> >>> If repair doesn't help - would you send repair and startup logs directly to >>> me as I have some issues accessing ceph-post-file uploads. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/2/2018 11:39 AM, Sergey Malinin wrote: >>>> Yes, I did repair all OSDs and it finished with 'repair success'. I backed >>>> up OSDs so now I have more room to play. >>>> I posted log files using ceph-post-file with the following IDs: >>>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 >>>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93 >>>> >>>> >>>>> On 2.10.2018, at 11:26, Igor Fedotov <ifedo...@suse.de> wrote: >>>>> >>>>> You did repair for any of this OSDs, didn't you? For all of them? >>>>> >>>>> >>>>> Would you please provide the log for both types (failed on mount and >>>>> failed with enospc) of failing OSDs. Prior to collecting please remove >>>>> existing ones prior and set debug bluestore to 20. >>>>> >>>>> >>>>> >>>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote: >>>>>> I was able to apply patches to mimic, but nothing changed. One osd that >>>>>> I had space expanded on fails with bluefs mount IO error, others keep >>>>>> failing with enospc. >>>>>> >>>>>> >>>>>>> On 1.10.2018, at 19:26, Igor Fedotov <ifedo...@suse.de> wrote: >>>>>>> >>>>>>> So you should call repair which rebalances (i.e. allocates additional >>>>>>> space) BlueFS space. Hence allowing OSD to start. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Igor >>>>>>> >>>>>>> >>>>>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote: >>>>>>>> Not exactly. The rebalancing from this kv_sync_thread still might be >>>>>>>> deferred due to the nature of this thread (haven't 100% sure though). >>>>>>>> >>>>>>>> Here is my PR showing the idea (still untested and perhaps >>>>>>>> unfinished!!!) >>>>>>>> >>>>>>>> https://github.com/ceph/ceph/pull/24353 >>>>>>>> >>>>>>>> >>>>>>>> Igor >>>>>>>> >>>>>>>> >>>>>>>> On 10/1/2018 7:07 PM, Sergey Malinin wrote: >>>>>>>>> Can you please confirm whether I got this right: >>>>>>>>> >>>>>>>>> --- BlueStore.cc.bak 2018-10-01 18:54:45.096836419 +0300 >>>>>>>>> +++ BlueStore.cc 2018-10-01 19:01:35.937623861 +0300 >>>>>>>>> @@ -9049,22 +9049,17 @@ >>>>>>>>> throttle_bytes.put(costs); >>>>>>>>> PExtentVector bluefs_gift_extents; >>>>>>>>> - if (bluefs && >>>>>>>>> - after_flush - bluefs_last_balance > >>>>>>>>> - cct->_conf->bluestore_bluefs_balance_interval) { >>>>>>>>> - bluefs_last_balance = after_flush; >>>>>>>>> - int r = _balance_bluefs_freespace(&bluefs_gift_extents); >>>>>>>>> - assert(r >= 0); >>>>>>>>> - if (r > 0) { >>>>>>>>> - for (auto& p : bluefs_gift_extents) { >>>>>>>>> - bluefs_extents.insert(p.offset, p.length); >>>>>>>>> - } >>>>>>>>> - bufferlist bl; >>>>>>>>> - encode(bluefs_extents, bl); >>>>>>>>> - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>>>>>>>> - << bluefs_extents << std::dec << dendl; >>>>>>>>> - synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>>>>>>>> + int r = _balance_bluefs_freespace(&bluefs_gift_extents); >>>>>>>>> + ceph_assert(r >= 0); >>>>>>>>> + if (r > 0) { >>>>>>>>> + for (auto& p : bluefs_gift_extents) { >>>>>>>>> + bluefs_extents.insert(p.offset, p.length); >>>>>>>>> } >>>>>>>>> + bufferlist bl; >>>>>>>>> + encode(bluefs_extents, bl); >>>>>>>>> + dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>>>>>>>> + << bluefs_extents << std::dec << dendl; >>>>>>>>> + synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>>>>>>>> } >>>>>>>>> // cleanup sync deferred keys >>>>>>>>> >>>>>>>>>> On 1.10.2018, at 18:39, Igor Fedotov <ifedo...@suse.de> wrote: >>>>>>>>>> >>>>>>>>>> So you have just a single main device per OSD.... >>>>>>>>>> >>>>>>>>>> Then bluestore-tool wouldn't help, it's unable to expand BlueFS >>>>>>>>>> partition at main device, standalone devices are supported only. >>>>>>>>>> >>>>>>>>>> Given that you're able to rebuild the code I can suggest to make a >>>>>>>>>> patch that triggers BlueFS rebalance (see code snippet below) on >>>>>>>>>> repairing. >>>>>>>>>> PExtentVector bluefs_gift_extents; >>>>>>>>>> int r = _balance_bluefs_freespace(&bluefs_gift_extents); >>>>>>>>>> ceph_assert(r >= 0); >>>>>>>>>> if (r > 0) { >>>>>>>>>> for (auto& p : bluefs_gift_extents) { >>>>>>>>>> bluefs_extents.insert(p.offset, p.length); >>>>>>>>>> } >>>>>>>>>> bufferlist bl; >>>>>>>>>> encode(bluefs_extents, bl); >>>>>>>>>> dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>>>>>>>>> << bluefs_extents << std::dec << dendl; >>>>>>>>>> synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> If it waits I can probably make a corresponding PR tomorrow. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Igor >>>>>>>>>> On 10/1/2018 6:16 PM, Sergey Malinin wrote: >>>>>>>>>>> I have rebuilt the tool, but none of my OSDs no matter dead or >>>>>>>>>>> alive have any symlinks other than 'block' pointing to LVM. >>>>>>>>>>> I adjusted main device size but it looks like it needs even more >>>>>>>>>>> space for db compaction. After executing bluefs-bdev-expand OSD >>>>>>>>>>> fails to start, however 'fsck' and 'repair' commands finished >>>>>>>>>>> successfully. >>>>>>>>>>> >>>>>>>>>>> 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init >>>>>>>>>>> 2018-10-01 18:02:39.763 7fc9226c6240 1 >>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation >>>>>>>>>>> metadata >>>>>>>>>>> 2018-10-01 18:02:40.907 7fc9226c6240 1 >>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in >>>>>>>>>>> 2249899 extents >>>>>>>>>>> 2018-10-01 18:02:40.951 7fc9226c6240 -1 >>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace >>>>>>>>>>> bluefs extra 0x[6d6f000000~50c800000] >>>>>>>>>>> 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc >>>>>>>>>>> 0x0x55d053fb9180 shutdown >>>>>>>>>>> 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown >>>>>>>>>>> 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb: >>>>>>>>>>> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: >>>>>>>>>>> canceling all background work >>>>>>>>>>> 2018-10-01 18:02:40.967 7fc9226c6240 4 rocksdb: >>>>>>>>>>> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete >>>>>>>>>>> 2018-10-01 18:02:40.971 7fc9226c6240 1 bluefs umount >>>>>>>>>>> 2018-10-01 18:02:40.975 7fc9226c6240 1 stupidalloc >>>>>>>>>>> 0x0x55d053883800 shutdown >>>>>>>>>>> 2018-10-01 18:02:40.975 7fc9226c6240 1 bdev(0x55d053c32e00 >>>>>>>>>>> /var/lib/ceph/osd/ceph-1/block) close >>>>>>>>>>> 2018-10-01 18:02:41.267 7fc9226c6240 1 bdev(0x55d053c32a80 >>>>>>>>>>> /var/lib/ceph/osd/ceph-1/block) close >>>>>>>>>>> 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to >>>>>>>>>>> mount object store >>>>>>>>>>> 2018-10-01 18:02:41.443 7fc9226c6240 -1 ** ERROR: osd init failed: >>>>>>>>>>> (5) Input/output error >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 1.10.2018, at 18:09, Igor Fedotov <ifedo...@suse.de> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Well, actually you can avoid bluestore-tool rebuild. >>>>>>>>>>>> >>>>>>>>>>>> You'll need to edit the first chunk of blocks.db where labels are >>>>>>>>>>>> stored. (Please make a backup first!!!) >>>>>>>>>>>> >>>>>>>>>>>> Size label is stored at offset 0x52 and is 8 bytes long - >>>>>>>>>>>> little-endian 64bit integer encoding. (Please verify that old >>>>>>>>>>>> value at this offset exactly corresponds to you original volume >>>>>>>>>>>> size and/or 'size' label reported by ceph-bluestore-tool). >>>>>>>>>>>> >>>>>>>>>>>> So you have to put new DB volume size there. Or you can send the >>>>>>>>>>>> first 4K chunk (e.g. extracted with dd) along with new DB volume >>>>>>>>>>>> size (in bytes) to me and I'll do that for you. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Igor >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 10/1/2018 5:32 PM, Igor Fedotov wrote: >>>>>>>>>>>>> On 10/1/2018 5:03 PM, Sergey Malinin wrote: >>>>>>>>>>>>>> Before I received your response, I had already added 20GB to the >>>>>>>>>>>>>> OSD (by epanding LV followed by bluefs-bdev-expand) and ran >>>>>>>>>>>>>> "ceph-kvstore-tool bluestore-kv <path> compact", however it >>>>>>>>>>>>>> still needs more space. >>>>>>>>>>>>>> Is that because I didn't update DB size with set-label-key? >>>>>>>>>>>>> In mimic you need to run both "bluefs-bdev-expand" and >>>>>>>>>>>>> "set-label-key" command to commit bluefs volume expansion. >>>>>>>>>>>>> Unfortunately the last command doesn't handle "size" label >>>>>>>>>>>>> properly. That's why you might need to backport and rebuild with >>>>>>>>>>>>> the mentioned commits. >>>>>>>>>>>>> >>>>>>>>>>>>>> What exactly is the label-key that needs to be updated, as I >>>>>>>>>>>>>> couldn't find which one is related to DB: >>>>>>>>>>>>>> >>>>>>>>>>>>>> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 >>>>>>>>>>>>>> inferring bluefs devices from bluestore path >>>>>>>>>>>>>> { >>>>>>>>>>>>>> "/var/lib/ceph/osd/ceph-1/block": { >>>>>>>>>>>>>> "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc", >>>>>>>>>>>>>> "size": 471305551872, >>>>>>>>>>>>>> "btime": "2018-07-31 03:06:43.751243", >>>>>>>>>>>>>> "description": "main", >>>>>>>>>>>>>> "bluefs": "1", >>>>>>>>>>>>>> "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533", >>>>>>>>>>>>>> "kv_backend": "rocksdb", >>>>>>>>>>>>>> "magic": "ceph osd volume v026", >>>>>>>>>>>>>> "mkfs_done": "yes", >>>>>>>>>>>>>> "osd_key": "XXX", >>>>>>>>>>>>>> "ready": "ready", >>>>>>>>>>>>>> "whoami": "1" >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>> 'size' label but your output is for block(aka slow) device. >>>>>>>>>>>>> >>>>>>>>>>>>> It should return labels for db/wal devices as well (block.db and >>>>>>>>>>>>> block.wal symlinks respectively). It works for me in master, >>>>>>>>>>>>> can't verify with mimic at the moment though.. >>>>>>>>>>>>> Here is output for master: >>>>>>>>>>>>> >>>>>>>>>>>>> # bin/ceph-bluestore-tool show-label --path dev/osd0 >>>>>>>>>>>>> inferring bluefs devices from bluestore path >>>>>>>>>>>>> { >>>>>>>>>>>>> "dev/osd0/block": { >>>>>>>>>>>>> "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", >>>>>>>>>>>>> "size": 21474836480, >>>>>>>>>>>>> "btime": "2018-09-10 15:55:09.044039", >>>>>>>>>>>>> "description": "main", >>>>>>>>>>>>> "bluefs": "1", >>>>>>>>>>>>> "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c", >>>>>>>>>>>>> "kv_backend": "rocksdb", >>>>>>>>>>>>> "magic": "ceph osd volume v026", >>>>>>>>>>>>> "mkfs_done": "yes", >>>>>>>>>>>>> "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==", >>>>>>>>>>>>> "ready": "ready", >>>>>>>>>>>>> "whoami": "0" >>>>>>>>>>>>> }, >>>>>>>>>>>>> "dev/osd0/block.wal": { >>>>>>>>>>>>> "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", >>>>>>>>>>>>> "size": 1048576000, >>>>>>>>>>>>> "btime": "2018-09-10 15:55:09.044985", >>>>>>>>>>>>> "description": "bluefs wal" >>>>>>>>>>>>> }, >>>>>>>>>>>>> "dev/osd0/block.db": { >>>>>>>>>>>>> "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", >>>>>>>>>>>>> "size": 1048576000, >>>>>>>>>>>>> "btime": "2018-09-10 15:55:09.044469", >>>>>>>>>>>>> "description": "bluefs db" >>>>>>>>>>>>> } >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> You can try --dev option instead of --path, e.g. >>>>>>>>>>>>> ceph-bluestore-tool show-label --dev <path-to-block.db> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> On 1.10.2018, at 16:48, Igor Fedotov <ifedo...@suse.de> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This looks like a sort of deadlock when BlueFS needs some >>>>>>>>>>>>>>> additional space to replay the log left after the crash. Which >>>>>>>>>>>>>>> happens during BlueFS open. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But such a space (at slow device as DB is full) is gifted in >>>>>>>>>>>>>>> background during bluefs rebalance procedure which will occur >>>>>>>>>>>>>>> after the open. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hence OSDs stuck in permanent crashing.. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The only way to recover I can suggest for now is to expand DB >>>>>>>>>>>>>>> volumes. You can do that with lvm tools if you have any spare >>>>>>>>>>>>>>> space for that. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Once resized you'll need ceph-bluestore-tool to indicate volume >>>>>>>>>>>>>>> expansion to BlueFS (bluefs-bdev-expand command ) and finally >>>>>>>>>>>>>>> update DB volume size label with set-label-key command. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The latter is a bit tricky for mimic - you might need to >>>>>>>>>>>>>>> backport >>>>>>>>>>>>>>> https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> and rebuild ceph-bluestore-tool. Alternatively you can backport >>>>>>>>>>>>>>> https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> then bluefs expansion and label updates will occur in a single >>>>>>>>>>>>>>> step. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'll do these backports in upstream but this will take some >>>>>>>>>>>>>>> time to pass all the procedures and get into official mimic >>>>>>>>>>>>>>> release. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Will fire a ticket to fix the original issue as well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Igor >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 10/1/2018 3:28 PM, Sergey Malinin wrote: >>>>>>>>>>>>>>>> These are LVM bluestore NVMe SSDs created with "ceph-volume >>>>>>>>>>>>>>>> --lvm prepare --bluestore /dev/nvme0n1p3" i.e. without >>>>>>>>>>>>>>>> specifying wal/db devices. >>>>>>>>>>>>>>>> OSDs were created with bluestore_min_alloc_size_ssd=4096, >>>>>>>>>>>>>>>> another modified setting is bluestore_cache_kv_max=1073741824 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> DB/block usage collected by prometheus module for 3 failed and >>>>>>>>>>>>>>>> 1 survived OSDs: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0 >>>>>>>>>>>>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0 >>>>>>>>>>>>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 >>>>>>>>>>>>>>>> --> this one has survived >>>>>>>>>>>>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0 >>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0 >>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0 >>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0 >>>>>>>>>>>>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0 >>>>>>>>>>>>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0 >>>>>>>>>>>>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0 >>>>>>>>>>>>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0 >>>>>>>>>>>>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0 >>>>>>>>>>>>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> First crashed OSD was doing DB compaction, others crashed >>>>>>>>>>>>>>>> shortly after during backfilling. Workload was "ceph-data-scan >>>>>>>>>>>>>>>> scan_inodes" filling metadata pool located on these OSDs at >>>>>>>>>>>>>>>> the rate close to 10k objects/second. >>>>>>>>>>>>>>>> Here is the log excerpt of the first crash occurrence: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2018-10-01 03:27:12.762 7fbf16dd6700 0 >>>>>>>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace >>>>>>>>>>>>>>>> no allocate on 0x80000000 min_alloc_size 0x1000 >>>>>>>>>>>>>>>> 2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb: >>>>>>>>>>>>>>>> [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] >>>>>>>>>>>>>>>> [default] [JOB 24] Generated table #89741: 106356 keys, >>>>>>>>>>>>>>>> 68110589 bytes >>>>>>>>>>>>>>>> 2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb: EVENT_LOG_v1 >>>>>>>>>>>>>>>> {"time_micros": 1538353632892744, "cf_name": "default", "job": >>>>>>>>>>>>>>>> 24, "event": "table_file_creation", "file_number": 89741, >>>>>>>>>>>>>>>> "file_size": 68110589, "table_properties": {"data_size": >>>>>>>>>>>>>>>> 67112903, "index_size": 579319, "filter_size": 417316, >>>>>>>>>>>>>>>> "raw_key_size": 6733561, "raw_average_key_size": 63, >>>>>>>>>>>>>>>> "raw_value_size": 60994583, "raw_average_value_size": 573, >>>>>>>>>>>>>>>> "num_data_blocks": 16336, "num_entries": 106356, >>>>>>>>>>>>>>>> "filter_policy_name": "rocksdb.BuiltinBloomFilter", >>>>>>>>>>>>>>>> "kDeletedKeys": "14444", "kMergeOperands": "0"}} >>>>>>>>>>>>>>>> 2018-10-01 03:27:12.934 7fbf1e5e5700 4 rocksdb: >>>>>>>>>>>>>>>> [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] >>>>>>>>>>>>>>>> [default] [JOB 24] Generated table #89742: 23214 keys, >>>>>>>>>>>>>>>> 16352315 bytes >>>>>>>>>>>>>>>> 2018-10-01 03:27:12.934 7fbf1e5e5700 4 rocksdb: EVENT_LOG_v1 >>>>>>>>>>>>>>>> {"time_micros": 1538353632938670, "cf_name": "default", "job": >>>>>>>>>>>>>>>> 24, "event": "table_file_creation", "file_number": 89742, >>>>>>>>>>>>>>>> "file_size": 16352315, "table_properties": {"data_size": >>>>>>>>>>>>>>>> 16116986, "index_size": 139894, "filter_size": 94386, >>>>>>>>>>>>>>>> "raw_key_size": 1470883, "raw_average_key_size": 63, >>>>>>>>>>>>>>>> "raw_value_size": 14775006, "raw_average_value_size": 636, >>>>>>>>>>>>>>>> "num_data_blocks": 3928, "num_entries": 23214, >>>>>>>>>>>>>>>> "filter_policy_name": "rocksdb.BuiltinBloomFilter", >>>>>>>>>>>>>>>> "kDeletedKeys": "90", "kMergeOperands": "0"}} >>>>>>>>>>>>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 1 bluefs _allocate >>>>>>>>>>>>>>>> failed to allocate 0x4100000 on bdev 1, free 0x1a00000; >>>>>>>>>>>>>>>> fallback to bdev 2 >>>>>>>>>>>>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _allocate >>>>>>>>>>>>>>>> failed to allocate 0x4100000 on bdev 2, dne >>>>>>>>>>>>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _flush_range >>>>>>>>>>>>>>>> allocated: 0x0 offset: 0x0 length: 0x40ea9f1 >>>>>>>>>>>>>>>> 2018-10-01 03:27:13.046 7fbf1e5e5700 -1 >>>>>>>>>>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function >>>>>>>>>>>>>>>> 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, >>>>>>>>>>>>>>>> uint64_t)' thread 7fbf1e5e5700 time 2018-10-01 03:27:13.048298 >>>>>>>>>>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED >>>>>>>>>>>>>>>> assert(0 == "bluefs enospc") >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ceph version 13.2.2 >>>>>>>>>>>>>>>> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) >>>>>>>>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>>>>>>>> char const*)+0x102) [0x7fbf2d4fe5c2] >>>>>>>>>>>>>>>> 2: (()+0x26c787) [0x7fbf2d4fe787] >>>>>>>>>>>>>>>> 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned >>>>>>>>>>>>>>>> long, unsigned long)+0x1ab4) [0x5619325114b4] >>>>>>>>>>>>>>>> 4: (BlueRocksWritableFile::Flush()+0x3d) [0x561932527c1d] >>>>>>>>>>>>>>>> 5: (rocksdb::WritableFileWriter::Flush()+0x1b9) >>>>>>>>>>>>>>>> [0x56193271c399] >>>>>>>>>>>>>>>> 6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) >>>>>>>>>>>>>>>> [0x56193271d42b] >>>>>>>>>>>>>>>> 7: >>>>>>>>>>>>>>>> (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status >>>>>>>>>>>>>>>> const&, rocksdb::CompactionJob::SubcompactionState*, >>>>>>>>>>>>>>>> rocksdb::RangeDelAggregator*, CompactionIterationStats*, >>>>>>>>>>>>>>>> rocksdb::Slice const*)+0x3db) [0x56193276098b] >>>>>>>>>>>>>>>> 8: >>>>>>>>>>>>>>>> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9) >>>>>>>>>>>>>>>> [0x561932763da9] >>>>>>>>>>>>>>>> 9: (rocksdb::CompactionJob::Run()+0x314) [0x561932765504] >>>>>>>>>>>>>>>> 10: (rocksdb::DBImpl::BackgroundCompaction(bool*, >>>>>>>>>>>>>>>> rocksdb::JobContext*, rocksdb::LogBuffer*, >>>>>>>>>>>>>>>> rocksdb::DBImpl::PrepickedCompaction*)+0xc54) [0x5619325b5c44] >>>>>>>>>>>>>>>> 11: >>>>>>>>>>>>>>>> (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, >>>>>>>>>>>>>>>> rocksdb::Env::Priority)+0x397) [0x5619325b8557] >>>>>>>>>>>>>>>> 12: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) >>>>>>>>>>>>>>>> [0x5619325b8cd7] >>>>>>>>>>>>>>>> 13: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned >>>>>>>>>>>>>>>> long)+0x266) [0x5619327a5e36] >>>>>>>>>>>>>>>> 14: >>>>>>>>>>>>>>>> (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x47) >>>>>>>>>>>>>>>> [0x5619327a5fb7] >>>>>>>>>>>>>>>> 15: (()+0xbe733) [0x7fbf2b500733] >>>>>>>>>>>>>>>> 16: (()+0x76db) [0x7fbf2bbf86db] >>>>>>>>>>>>>>>> 17: (clone()+0x3f) [0x7fbf2abbc88f] >>>>>>>>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS >>>>>>>>>>>>>>>> <executable>` is needed to interpret this. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 1.10.2018, at 15:01, Igor Fedotov <ifedo...@suse.de> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Sergey, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> could you please provide more details on your OSDs ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> What are sizes for DB/block devices? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Do you have any modifications in BlueStore config settings? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Can you share stats you're referring to? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Igor >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 10/1/2018 12:29 PM, Sergey Malinin wrote: >>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>> 3 of 4 NVME OSDs crashed at the same time on assert(0 == >>>>>>>>>>>>>>>>>> "bluefs enospc") and no longer start. >>>>>>>>>>>>>>>>>> Stats collected just before crash show that >>>>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes is 100% used. Although OSDs have >>>>>>>>>>>>>>>>>> over 50% of free space, it is not reallocated for DB usage. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 1 bluefs _allocate >>>>>>>>>>>>>>>>>> failed to allocate 0x100000 on bdev 1, free 0x0; fallback to >>>>>>>>>>>>>>>>>> bdev 2 >>>>>>>>>>>>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate >>>>>>>>>>>>>>>>>> failed to allocate 0x100000 on bdev 2, dne >>>>>>>>>>>>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range >>>>>>>>>>>>>>>>>> allocated: 0x0 offset: 0x0 length: 0xa8700 >>>>>>>>>>>>>>>>>> 2018-10-01 12:18:06.748 7f1d6a04d240 -1 >>>>>>>>>>>>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function >>>>>>>>>>>>>>>>>> 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, >>>>>>>>>>>>>>>>>> uint64_t)' thread 7f1d6a04d240 time 2018-10-01 >>>>>>>>>>>>>>>>>> 12:18:06.746800 >>>>>>>>>>>>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED >>>>>>>>>>>>>>>>>> assert(0 == "bluefs enospc") >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ceph version 13.2.2 >>>>>>>>>>>>>>>>>> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) >>>>>>>>>>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, >>>>>>>>>>>>>>>>>> int, char const*)+0x102) [0x7f1d6146f5c2] >>>>>>>>>>>>>>>>>> 2: (()+0x26c787) [0x7f1d6146f787] >>>>>>>>>>>>>>>>>> 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned >>>>>>>>>>>>>>>>>> long, unsigned long)+0x1ab4) [0x5586b22684b4] >>>>>>>>>>>>>>>>>> 4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d] >>>>>>>>>>>>>>>>>> 5: (rocksdb::WritableFileWriter::Flush()+0x1b9) >>>>>>>>>>>>>>>>>> [0x5586b2473399] >>>>>>>>>>>>>>>>>> 6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) >>>>>>>>>>>>>>>>>> [0x5586b247442b] >>>>>>>>>>>>>>>>>> 7: (rocksdb::BuildTable(std::__cxx11::basic_string<char, >>>>>>>>>>>>>>>>>> std::char_traits<char>, std::allocator<char> > const&, >>>>>>>>>>>>>>>>>> rocksdb::Env*, rocksdb::ImmutableCFOptions const&, >>>>>>>>>>>>>>>>>> rocksdb::MutableCFOptions const&, rocksdb::EnvOptions >>>>>>>>>>>>>>>>>> const&, rock >>>>>>>>>>>>>>>>>> sdb::TableCache*, rocksdb::InternalIterator*, >>>>>>>>>>>>>>>>>> std::unique_ptr<rocksdb::InternalIterator, >>>>>>>>>>>>>>>>>> std::default_delete<rocksdb::InternalIterator> >, >>>>>>>>>>>>>>>>>> rocksdb::FileMetaData*, rocksdb::InternalKeyComparator >>>>>>>>>>>>>>>>>> const&, std::vector<std::unique_ptr< >>>>>>>>>>>>>>>>>> rocksdb::IntTblPropCollectorFactory, >>>>>>>>>>>>>>>>>> std::default_delete<rocksdb::IntTblPropCollectorFactory> >, >>>>>>>>>>>>>>>>>> std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, >>>>>>>>>>>>>>>>>> std::default_delete<rocksdb::IntTblPropCollectorFactory> > >>>>>>>>>>>>>>>>>> > > co >>>>>>>>>>>>>>>>>> nst*, unsigned int, std::__cxx11::basic_string<char, >>>>>>>>>>>>>>>>>> std::char_traits<char>, std::allocator<char> > const&, >>>>>>>>>>>>>>>>>> std::vector<unsigned long, std::allocator<unsigned long> >, >>>>>>>>>>>>>>>>>> unsigned long, rocksdb::SnapshotChecker*, >>>>>>>>>>>>>>>>>> rocksdb::Compression >>>>>>>>>>>>>>>>>> Type, rocksdb::CompressionOptions const&, bool, >>>>>>>>>>>>>>>>>> rocksdb::InternalStats*, rocksdb::TableFileCreationReason, >>>>>>>>>>>>>>>>>> rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, >>>>>>>>>>>>>>>>>> rocksdb::TableProperties*, int, unsigned long, unsigned >>>>>>>>>>>>>>>>>> long, rocksdb >>>>>>>>>>>>>>>>>> ::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94] >>>>>>>>>>>>>>>>>> 8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, >>>>>>>>>>>>>>>>>> rocksdb::ColumnFamilyData*, rocksdb::MemTable*, >>>>>>>>>>>>>>>>>> rocksdb::VersionEdit*)+0xcb7) [0x5586b2321457] >>>>>>>>>>>>>>>>>> 9: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned >>>>>>>>>>>>>>>>>> long, std::allocator<unsigned long> > const&, unsigned >>>>>>>>>>>>>>>>>> long*, bool)+0x19de) [0x5586b232373e] >>>>>>>>>>>>>>>>>> 10: >>>>>>>>>>>>>>>>>> (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, >>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, >>>>>>>>>>>>>>>>>> bool, bool, bool)+0x5d4) [0x5586b23242f4] >>>>>>>>>>>>>>>>>> 11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, >>>>>>>>>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>, >>>>>>>>>>>>>>>>>> std::allocator<char> > const&, >>>>>>>>>>>>>>>>>> std::vector<rocksdb::ColumnFamilyDescriptor, >>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyDescri >>>>>>>>>>>>>>>>>> ptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, >>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyHandle*> >*, >>>>>>>>>>>>>>>>>> rocksdb::DB**, bool)+0x68b) [0x5586b232559b] >>>>>>>>>>>>>>>>>> 12: (rocksdb::DB::Open(rocksdb::DBOptions const&, >>>>>>>>>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>, >>>>>>>>>>>>>>>>>> std::allocator<char> > const&, >>>>>>>>>>>>>>>>>> std::vector<rocksdb::ColumnFamilyDescriptor, >>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyDescriptor >>>>>>>>>>>>>>>>>>>> const&, std::vector<rocksdb::ColumnFamilyHandle*, >>>>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyHandle*> >*, >>>>>>>>>>>>>>>>>>>> rocksdb::DB**)+0x22) [0x5586b2326e72] >>>>>>>>>>>>>>>>>> 13: (RocksDBStore::do_open(std::ostream&, bool, >>>>>>>>>>>>>>>>>> std::vector<KeyValueDB::ColumnFamily, >>>>>>>>>>>>>>>>>> std::allocator<KeyValueDB::ColumnFamily> > const*)+0x170c) >>>>>>>>>>>>>>>>>> [0x5586b220219c] >>>>>>>>>>>>>>>>>> 14: (BlueStore::_open_db(bool, bool)+0xd8e) >>>>>>>>>>>>>>>>>> [0x5586b218ee1e] >>>>>>>>>>>>>>>>>> 15: (BlueStore::_mount(bool, bool)+0x4b7) [0x5586b21bf807] >>>>>>>>>>>>>>>>>> 16: (OSD::init()+0x295) [0x5586b1d673c5] >>>>>>>>>>>>>>>>>> 17: (main()+0x268d) [0x5586b1c554ed] >>>>>>>>>>>>>>>>>> 18: (__libc_start_main()+0xe7) [0x7f1d5ea2db97] >>>>>>>>>>>>>>>>>> 19: (_start()+0x2a) [0x5586b1d1d7fa] >>>>>>>>>>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS >>>>>>>>>>>>>>>>>> <executable>` is needed to interpret this. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>>>>> ceph-users@lists.ceph.com >>>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>> ceph-users@lists.ceph.com >>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@lists.ceph.com >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com