It didn't work, emailed logs to you.

> On 2.10.2018, at 14:43, Igor Fedotov <ifedo...@suse.de> wrote:
> 
> The major change is in get_bluefs_rebalance_txn function, it lacked 
> bluefs_rebalance_txn assignment..
> 
> 
> 
> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>> PR doesn't seem to have changed since yesterday. Am I missing something?
>> 
>> 
>>> On 2.10.2018, at 14:15, Igor Fedotov <ifedo...@suse.de> wrote:
>>> 
>>> Please update the patch from the PR - it didn't update bluefs extents list 
>>> before.
>>> 
>>> Also please set debug bluestore 20 when re-running repair and collect the 
>>> log.
>>> 
>>> If repair doesn't help - would you send repair and startup logs directly to 
>>> me as I have some issues accessing ceph-post-file uploads.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>>>> Yes, I did repair all OSDs and it finished with 'repair success'. I backed 
>>>> up OSDs so now I have more room to play.
>>>> I posted log files using ceph-post-file with the following IDs:
>>>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>>>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>>>> 
>>>> 
>>>>> On 2.10.2018, at 11:26, Igor Fedotov <ifedo...@suse.de> wrote:
>>>>> 
>>>>> You did repair for any of this OSDs, didn't you? For all of them?
>>>>> 
>>>>> 
>>>>> Would you please provide the log for both types (failed on mount and 
>>>>> failed with enospc) of failing OSDs. Prior to collecting please remove 
>>>>> existing ones prior and set debug bluestore to 20.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>>>>>> I was able to apply patches to mimic, but nothing changed. One osd that 
>>>>>> I had space expanded on fails with bluefs mount IO error, others keep 
>>>>>> failing with enospc.
>>>>>> 
>>>>>> 
>>>>>>> On 1.10.2018, at 19:26, Igor Fedotov <ifedo...@suse.de> wrote:
>>>>>>> 
>>>>>>> So you should call repair which rebalances (i.e. allocates additional 
>>>>>>> space) BlueFS space. Hence allowing OSD to start.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Igor
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>>>>>>>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>>>>>>>> deferred due to the nature of this thread (haven't 100% sure though).
>>>>>>>> 
>>>>>>>> Here is my PR showing the idea (still untested and perhaps 
>>>>>>>> unfinished!!!)
>>>>>>>> 
>>>>>>>> https://github.com/ceph/ceph/pull/24353
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Igor
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>>>>>>>> Can you please confirm whether I got this right:
>>>>>>>>> 
>>>>>>>>> --- BlueStore.cc.bak    2018-10-01 18:54:45.096836419 +0300
>>>>>>>>> +++ BlueStore.cc    2018-10-01 19:01:35.937623861 +0300
>>>>>>>>> @@ -9049,22 +9049,17 @@
>>>>>>>>>         throttle_bytes.put(costs);
>>>>>>>>>           PExtentVector bluefs_gift_extents;
>>>>>>>>> -      if (bluefs &&
>>>>>>>>> -      after_flush - bluefs_last_balance >
>>>>>>>>> -      cct->_conf->bluestore_bluefs_balance_interval) {
>>>>>>>>> -    bluefs_last_balance = after_flush;
>>>>>>>>> -    int r = _balance_bluefs_freespace(&bluefs_gift_extents);
>>>>>>>>> -    assert(r >= 0);
>>>>>>>>> -    if (r > 0) {
>>>>>>>>> -      for (auto& p : bluefs_gift_extents) {
>>>>>>>>> -        bluefs_extents.insert(p.offset, p.length);
>>>>>>>>> -      }
>>>>>>>>> -      bufferlist bl;
>>>>>>>>> -      encode(bluefs_extents, bl);
>>>>>>>>> -      dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>>>>>>>> -           << bluefs_extents << std::dec << dendl;
>>>>>>>>> -      synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>>>>>>>> +      int r = _balance_bluefs_freespace(&bluefs_gift_extents);
>>>>>>>>> +      ceph_assert(r >= 0);
>>>>>>>>> +      if (r > 0) {
>>>>>>>>> +    for (auto& p : bluefs_gift_extents) {
>>>>>>>>> +      bluefs_extents.insert(p.offset, p.length);
>>>>>>>>>       }
>>>>>>>>> +    bufferlist bl;
>>>>>>>>> +    encode(bluefs_extents, bl);
>>>>>>>>> +    dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>>>>>>>> +         << bluefs_extents << std::dec << dendl;
>>>>>>>>> +    synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>>>>>>>>         }
>>>>>>>>>           // cleanup sync deferred keys
>>>>>>>>> 
>>>>>>>>>> On 1.10.2018, at 18:39, Igor Fedotov <ifedo...@suse.de> wrote:
>>>>>>>>>> 
>>>>>>>>>> So you have just a single main device per OSD....
>>>>>>>>>> 
>>>>>>>>>> Then bluestore-tool wouldn't help, it's unable to expand BlueFS 
>>>>>>>>>> partition at main device, standalone devices are supported only.
>>>>>>>>>> 
>>>>>>>>>> Given that you're able to rebuild the code I can suggest to make a 
>>>>>>>>>> patch that triggers BlueFS rebalance (see code snippet below) on 
>>>>>>>>>> repairing.
>>>>>>>>>>      PExtentVector bluefs_gift_extents;
>>>>>>>>>>      int r = _balance_bluefs_freespace(&bluefs_gift_extents);
>>>>>>>>>>      ceph_assert(r >= 0);
>>>>>>>>>>      if (r > 0) {
>>>>>>>>>>        for (auto& p : bluefs_gift_extents) {
>>>>>>>>>>          bluefs_extents.insert(p.offset, p.length);
>>>>>>>>>>        }
>>>>>>>>>>        bufferlist bl;
>>>>>>>>>>        encode(bluefs_extents, bl);
>>>>>>>>>>        dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>>>>>>>>>             << bluefs_extents << std::dec << dendl;
>>>>>>>>>>        synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>>>>>>>>>      }
>>>>>>>>>> 
>>>>>>>>>> If it waits I can probably make a corresponding PR tomorrow.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Igor
>>>>>>>>>> On 10/1/2018 6:16 PM, Sergey Malinin wrote:
>>>>>>>>>>> I have rebuilt the tool, but none of my OSDs no matter dead or 
>>>>>>>>>>> alive have any symlinks other than 'block' pointing to LVM.
>>>>>>>>>>> I adjusted main device size but it looks like it needs even more 
>>>>>>>>>>> space for db compaction. After executing bluefs-bdev-expand OSD 
>>>>>>>>>>> fails to start, however 'fsck' and 'repair' commands finished 
>>>>>>>>>>> successfully.
>>>>>>>>>>> 
>>>>>>>>>>> 2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
>>>>>>>>>>> 2018-10-01 18:02:39.763 7fc9226c6240  1 
>>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation 
>>>>>>>>>>> metadata
>>>>>>>>>>> 2018-10-01 18:02:40.907 7fc9226c6240  1 
>>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 
>>>>>>>>>>> 2249899 extents
>>>>>>>>>>> 2018-10-01 18:02:40.951 7fc9226c6240 -1 
>>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace 
>>>>>>>>>>> bluefs extra 0x[6d6f000000~50c800000]
>>>>>>>>>>> 2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 
>>>>>>>>>>> 0x0x55d053fb9180 shutdown
>>>>>>>>>>> 2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
>>>>>>>>>>> 2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
>>>>>>>>>>> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: 
>>>>>>>>>>> canceling all background work
>>>>>>>>>>> 2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
>>>>>>>>>>> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
>>>>>>>>>>> 2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
>>>>>>>>>>> 2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 
>>>>>>>>>>> 0x0x55d053883800 shutdown
>>>>>>>>>>> 2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
>>>>>>>>>>> /var/lib/ceph/osd/ceph-1/block) close
>>>>>>>>>>> 2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
>>>>>>>>>>> /var/lib/ceph/osd/ceph-1/block) close
>>>>>>>>>>> 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to 
>>>>>>>>>>> mount object store
>>>>>>>>>>> 2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: 
>>>>>>>>>>> (5) Input/output error
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 1.10.2018, at 18:09, Igor Fedotov <ifedo...@suse.de> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Well, actually you can avoid bluestore-tool rebuild.
>>>>>>>>>>>> 
>>>>>>>>>>>> You'll need to edit the first chunk of blocks.db where labels are 
>>>>>>>>>>>> stored. (Please make a backup first!!!)
>>>>>>>>>>>> 
>>>>>>>>>>>> Size label is stored at offset 0x52 and is 8 bytes long - 
>>>>>>>>>>>> little-endian 64bit integer encoding. (Please verify that old 
>>>>>>>>>>>> value at this offset exactly corresponds to you original volume 
>>>>>>>>>>>> size and/or 'size' label reported by ceph-bluestore-tool).
>>>>>>>>>>>> 
>>>>>>>>>>>> So you have to put new DB volume size there. Or you can send the 
>>>>>>>>>>>> first 4K chunk (e.g. extracted with dd) along with new DB volume 
>>>>>>>>>>>> size (in bytes) to me and I'll do that for you.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Igor
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 10/1/2018 5:32 PM, Igor Fedotov wrote:
>>>>>>>>>>>>> On 10/1/2018 5:03 PM, Sergey Malinin wrote:
>>>>>>>>>>>>>> Before I received your response, I had already added 20GB to the 
>>>>>>>>>>>>>> OSD (by epanding LV followed by bluefs-bdev-expand) and ran 
>>>>>>>>>>>>>> "ceph-kvstore-tool bluestore-kv <path> compact", however it 
>>>>>>>>>>>>>> still needs more space.
>>>>>>>>>>>>>> Is that because I didn't update DB size with set-label-key?
>>>>>>>>>>>>> In mimic you need to run both "bluefs-bdev-expand" and 
>>>>>>>>>>>>> "set-label-key" command to commit bluefs volume expansion.
>>>>>>>>>>>>> Unfortunately the last command doesn't handle "size" label 
>>>>>>>>>>>>> properly. That's why you might need to backport and rebuild with 
>>>>>>>>>>>>> the mentioned commits.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What exactly is the label-key that needs to be updated, as I 
>>>>>>>>>>>>>> couldn't find which one is related to DB:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
>>>>>>>>>>>>>> inferring bluefs devices from bluestore path
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>       "/var/lib/ceph/osd/ceph-1/block": {
>>>>>>>>>>>>>>           "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
>>>>>>>>>>>>>>           "size": 471305551872,
>>>>>>>>>>>>>>           "btime": "2018-07-31 03:06:43.751243",
>>>>>>>>>>>>>>           "description": "main",
>>>>>>>>>>>>>>           "bluefs": "1",
>>>>>>>>>>>>>>           "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
>>>>>>>>>>>>>>           "kv_backend": "rocksdb",
>>>>>>>>>>>>>>           "magic": "ceph osd volume v026",
>>>>>>>>>>>>>>           "mkfs_done": "yes",
>>>>>>>>>>>>>>           "osd_key": "XXX",
>>>>>>>>>>>>>>           "ready": "ready",
>>>>>>>>>>>>>>           "whoami": "1"
>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>> 'size' label but your output is for block(aka slow) device.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It should return labels for db/wal devices as well (block.db and 
>>>>>>>>>>>>> block.wal symlinks respectively). It works for me in master, 
>>>>>>>>>>>>> can't verify with mimic at the moment though..
>>>>>>>>>>>>> Here is output for master:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # bin/ceph-bluestore-tool show-label --path dev/osd0
>>>>>>>>>>>>> inferring bluefs devices from bluestore path
>>>>>>>>>>>>> {
>>>>>>>>>>>>>      "dev/osd0/block": {
>>>>>>>>>>>>>          "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
>>>>>>>>>>>>>          "size": 21474836480,
>>>>>>>>>>>>>          "btime": "2018-09-10 15:55:09.044039",
>>>>>>>>>>>>>          "description": "main",
>>>>>>>>>>>>>          "bluefs": "1",
>>>>>>>>>>>>>          "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c",
>>>>>>>>>>>>>          "kv_backend": "rocksdb",
>>>>>>>>>>>>>          "magic": "ceph osd volume v026",
>>>>>>>>>>>>>          "mkfs_done": "yes",
>>>>>>>>>>>>>          "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==",
>>>>>>>>>>>>>          "ready": "ready",
>>>>>>>>>>>>>          "whoami": "0"
>>>>>>>>>>>>>      },
>>>>>>>>>>>>>      "dev/osd0/block.wal": {
>>>>>>>>>>>>>          "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
>>>>>>>>>>>>>          "size": 1048576000,
>>>>>>>>>>>>>          "btime": "2018-09-10 15:55:09.044985",
>>>>>>>>>>>>>          "description": "bluefs wal"
>>>>>>>>>>>>>      },
>>>>>>>>>>>>>      "dev/osd0/block.db": {
>>>>>>>>>>>>>          "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
>>>>>>>>>>>>>          "size": 1048576000,
>>>>>>>>>>>>>          "btime": "2018-09-10 15:55:09.044469",
>>>>>>>>>>>>>          "description": "bluefs db"
>>>>>>>>>>>>>      }
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can try --dev option instead of --path, e.g.
>>>>>>>>>>>>> ceph-bluestore-tool show-label --dev <path-to-block.db>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 1.10.2018, at 16:48, Igor Fedotov <ifedo...@suse.de> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This looks like a sort of deadlock when BlueFS needs some 
>>>>>>>>>>>>>>> additional space to replay the log left after the crash. Which 
>>>>>>>>>>>>>>> happens during BlueFS open.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> But such a space (at slow device as DB is full) is gifted in 
>>>>>>>>>>>>>>> background during bluefs rebalance procedure which will occur 
>>>>>>>>>>>>>>> after the open.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hence OSDs stuck in permanent crashing..
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The only way to recover I can suggest for now is to expand DB 
>>>>>>>>>>>>>>> volumes. You can do that with lvm tools if you have any spare 
>>>>>>>>>>>>>>> space for that.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Once resized you'll need ceph-bluestore-tool to indicate volume 
>>>>>>>>>>>>>>> expansion to BlueFS (bluefs-bdev-expand command ) and finally 
>>>>>>>>>>>>>>> update DB volume size label with  set-label-key command.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The latter is a bit tricky for mimic - you might need to 
>>>>>>>>>>>>>>> backport 
>>>>>>>>>>>>>>> https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> and rebuild ceph-bluestore-tool. Alternatively you can backport 
>>>>>>>>>>>>>>> https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> then bluefs expansion and label updates will occur in a single 
>>>>>>>>>>>>>>> step.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'll do these backports in upstream but this will take some 
>>>>>>>>>>>>>>> time to pass all the procedures and get into official mimic 
>>>>>>>>>>>>>>> release.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Will fire a ticket to fix the original issue as well.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Igor
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 10/1/2018 3:28 PM, Sergey Malinin wrote:
>>>>>>>>>>>>>>>> These are LVM bluestore NVMe SSDs created with "ceph-volume 
>>>>>>>>>>>>>>>> --lvm prepare --bluestore /dev/nvme0n1p3" i.e. without 
>>>>>>>>>>>>>>>> specifying wal/db devices.
>>>>>>>>>>>>>>>> OSDs were created with bluestore_min_alloc_size_ssd=4096, 
>>>>>>>>>>>>>>>> another modified setting is bluestore_cache_kv_max=1073741824
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> DB/block usage collected by prometheus module for 3 failed and 
>>>>>>>>>>>>>>>> 1 survived OSDs:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0
>>>>>>>>>>>>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0
>>>>>>>>>>>>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 
>>>>>>>>>>>>>>>> --> this one has survived
>>>>>>>>>>>>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0
>>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0
>>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0
>>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0
>>>>>>>>>>>>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0
>>>>>>>>>>>>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0
>>>>>>>>>>>>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0
>>>>>>>>>>>>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0
>>>>>>>>>>>>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0
>>>>>>>>>>>>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> First crashed OSD was doing DB compaction, others crashed 
>>>>>>>>>>>>>>>> shortly after during backfilling. Workload was "ceph-data-scan 
>>>>>>>>>>>>>>>> scan_inodes" filling metadata pool located on these OSDs at 
>>>>>>>>>>>>>>>> the rate close to 10k objects/second.
>>>>>>>>>>>>>>>> Here is the log excerpt of the first crash occurrence:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2018-10-01 03:27:12.762 7fbf16dd6700  0 
>>>>>>>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace 
>>>>>>>>>>>>>>>> no allocate on 0x80000000 min_alloc_size 0x1000
>>>>>>>>>>>>>>>> 2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: 
>>>>>>>>>>>>>>>> [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] 
>>>>>>>>>>>>>>>> [default] [JOB 24] Generated table #89741: 106356 keys, 
>>>>>>>>>>>>>>>> 68110589 bytes
>>>>>>>>>>>>>>>> 2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 
>>>>>>>>>>>>>>>> {"time_micros": 1538353632892744, "cf_name": "default", "job": 
>>>>>>>>>>>>>>>> 24, "event": "table_file_creation", "file_number": 89741, 
>>>>>>>>>>>>>>>> "file_size": 68110589, "table_properties": {"data_size": 
>>>>>>>>>>>>>>>> 67112903, "index_size": 579319, "filter_size": 417316, 
>>>>>>>>>>>>>>>> "raw_key_size": 6733561, "raw_average_key_size": 63, 
>>>>>>>>>>>>>>>> "raw_value_size": 60994583, "raw_average_value_size": 573, 
>>>>>>>>>>>>>>>> "num_data_blocks": 16336, "num_entries": 106356, 
>>>>>>>>>>>>>>>> "filter_policy_name": "rocksdb.BuiltinBloomFilter", 
>>>>>>>>>>>>>>>> "kDeletedKeys": "14444", "kMergeOperands": "0"}}
>>>>>>>>>>>>>>>> 2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: 
>>>>>>>>>>>>>>>> [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] 
>>>>>>>>>>>>>>>> [default] [JOB 24] Generated table #89742: 23214 keys, 
>>>>>>>>>>>>>>>> 16352315 bytes
>>>>>>>>>>>>>>>> 2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 
>>>>>>>>>>>>>>>> {"time_micros": 1538353632938670, "cf_name": "default", "job": 
>>>>>>>>>>>>>>>> 24, "event": "table_file_creation", "file_number": 89742, 
>>>>>>>>>>>>>>>> "file_size": 16352315, "table_properties": {"data_size": 
>>>>>>>>>>>>>>>> 16116986, "index_size": 139894, "filter_size": 94386, 
>>>>>>>>>>>>>>>> "raw_key_size": 1470883, "raw_average_key_size": 63, 
>>>>>>>>>>>>>>>> "raw_value_size": 14775006, "raw_average_value_size": 636, 
>>>>>>>>>>>>>>>> "num_data_blocks": 3928, "num_entries": 23214, 
>>>>>>>>>>>>>>>> "filter_policy_name": "rocksdb.BuiltinBloomFilter", 
>>>>>>>>>>>>>>>> "kDeletedKeys": "90", "kMergeOperands": "0"}}
>>>>>>>>>>>>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700  1 bluefs _allocate 
>>>>>>>>>>>>>>>> failed to allocate 0x4100000 on bdev 1, free 0x1a00000; 
>>>>>>>>>>>>>>>> fallback to bdev 2
>>>>>>>>>>>>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _allocate 
>>>>>>>>>>>>>>>> failed to allocate 0x4100000 on bdev 2, dne
>>>>>>>>>>>>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _flush_range 
>>>>>>>>>>>>>>>> allocated: 0x0 offset: 0x0 length: 0x40ea9f1
>>>>>>>>>>>>>>>> 2018-10-01 03:27:13.046 7fbf1e5e5700 -1 
>>>>>>>>>>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 
>>>>>>>>>>>>>>>> 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, 
>>>>>>>>>>>>>>>> uint64_t)' thread 7fbf1e5e5700 time 2018-10-01 03:27:13.048298
>>>>>>>>>>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED 
>>>>>>>>>>>>>>>> assert(0 == "bluefs enospc")
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>    ceph version 13.2.2 
>>>>>>>>>>>>>>>> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
>>>>>>>>>>>>>>>>    1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
>>>>>>>>>>>>>>>> char const*)+0x102) [0x7fbf2d4fe5c2]
>>>>>>>>>>>>>>>>    2: (()+0x26c787) [0x7fbf2d4fe787]
>>>>>>>>>>>>>>>>    3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned 
>>>>>>>>>>>>>>>> long, unsigned long)+0x1ab4) [0x5619325114b4]
>>>>>>>>>>>>>>>>    4: (BlueRocksWritableFile::Flush()+0x3d) [0x561932527c1d]
>>>>>>>>>>>>>>>>    5: (rocksdb::WritableFileWriter::Flush()+0x1b9) 
>>>>>>>>>>>>>>>> [0x56193271c399]
>>>>>>>>>>>>>>>>    6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) 
>>>>>>>>>>>>>>>> [0x56193271d42b]
>>>>>>>>>>>>>>>>    7: 
>>>>>>>>>>>>>>>> (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
>>>>>>>>>>>>>>>>  const&, rocksdb::CompactionJob::SubcompactionState*, 
>>>>>>>>>>>>>>>> rocksdb::RangeDelAggregator*, CompactionIterationStats*, 
>>>>>>>>>>>>>>>> rocksdb::Slice const*)+0x3db) [0x56193276098b]
>>>>>>>>>>>>>>>>    8: 
>>>>>>>>>>>>>>>> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9)
>>>>>>>>>>>>>>>>  [0x561932763da9]
>>>>>>>>>>>>>>>>    9: (rocksdb::CompactionJob::Run()+0x314) [0x561932765504]
>>>>>>>>>>>>>>>>    10: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
>>>>>>>>>>>>>>>> rocksdb::JobContext*, rocksdb::LogBuffer*, 
>>>>>>>>>>>>>>>> rocksdb::DBImpl::PrepickedCompaction*)+0xc54) [0x5619325b5c44]
>>>>>>>>>>>>>>>>    11: 
>>>>>>>>>>>>>>>> (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
>>>>>>>>>>>>>>>>  rocksdb::Env::Priority)+0x397) [0x5619325b8557]
>>>>>>>>>>>>>>>>    12: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) 
>>>>>>>>>>>>>>>> [0x5619325b8cd7]
>>>>>>>>>>>>>>>>    13: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned 
>>>>>>>>>>>>>>>> long)+0x266) [0x5619327a5e36]
>>>>>>>>>>>>>>>>    14: 
>>>>>>>>>>>>>>>> (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x47) 
>>>>>>>>>>>>>>>> [0x5619327a5fb7]
>>>>>>>>>>>>>>>>    15: (()+0xbe733) [0x7fbf2b500733]
>>>>>>>>>>>>>>>>    16: (()+0x76db) [0x7fbf2bbf86db]
>>>>>>>>>>>>>>>>    17: (clone()+0x3f) [0x7fbf2abbc88f]
>>>>>>>>>>>>>>>>    NOTE: a copy of the executable, or `objdump -rdS 
>>>>>>>>>>>>>>>> <executable>` is needed to interpret this.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 1.10.2018, at 15:01, Igor Fedotov <ifedo...@suse.de> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Sergey,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> could you please provide more details on your OSDs ?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What are sizes for DB/block devices?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Do you have any modifications in BlueStore config settings?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Can you share stats you're referring to?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Igor
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 10/1/2018 12:29 PM, Sergey Malinin wrote:
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>> 3 of 4 NVME OSDs crashed at the same time on assert(0 == 
>>>>>>>>>>>>>>>>>> "bluefs enospc") and no longer start.
>>>>>>>>>>>>>>>>>> Stats collected just before crash show that 
>>>>>>>>>>>>>>>>>> ceph_bluefs_db_used_bytes is 100% used. Although OSDs have 
>>>>>>>>>>>>>>>>>> over 50% of free space, it is not reallocated for DB usage.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240  1 bluefs _allocate 
>>>>>>>>>>>>>>>>>> failed to allocate 0x100000 on bdev 1, free 0x0; fallback to 
>>>>>>>>>>>>>>>>>> bdev 2
>>>>>>>>>>>>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate 
>>>>>>>>>>>>>>>>>> failed to allocate 0x100000 on bdev 2, dne
>>>>>>>>>>>>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range 
>>>>>>>>>>>>>>>>>> allocated: 0x0 offset: 0x0 length: 0xa8700
>>>>>>>>>>>>>>>>>> 2018-10-01 12:18:06.748 7f1d6a04d240 -1 
>>>>>>>>>>>>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 
>>>>>>>>>>>>>>>>>> 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, 
>>>>>>>>>>>>>>>>>> uint64_t)' thread 7f1d6a04d240 time 2018-10-01 
>>>>>>>>>>>>>>>>>> 12:18:06.746800
>>>>>>>>>>>>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED 
>>>>>>>>>>>>>>>>>> assert(0 == "bluefs enospc")
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>    ceph version 13.2.2 
>>>>>>>>>>>>>>>>>> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
>>>>>>>>>>>>>>>>>>    1: (ceph::__ceph_assert_fail(char const*, char const*, 
>>>>>>>>>>>>>>>>>> int, char const*)+0x102) [0x7f1d6146f5c2]
>>>>>>>>>>>>>>>>>>    2: (()+0x26c787) [0x7f1d6146f787]
>>>>>>>>>>>>>>>>>>    3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned 
>>>>>>>>>>>>>>>>>> long, unsigned long)+0x1ab4) [0x5586b22684b4]
>>>>>>>>>>>>>>>>>>    4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d]
>>>>>>>>>>>>>>>>>>    5: (rocksdb::WritableFileWriter::Flush()+0x1b9) 
>>>>>>>>>>>>>>>>>> [0x5586b2473399]
>>>>>>>>>>>>>>>>>>    6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) 
>>>>>>>>>>>>>>>>>> [0x5586b247442b]
>>>>>>>>>>>>>>>>>>    7: (rocksdb::BuildTable(std::__cxx11::basic_string<char, 
>>>>>>>>>>>>>>>>>> std::char_traits<char>, std::allocator<char> > const&, 
>>>>>>>>>>>>>>>>>> rocksdb::Env*, rocksdb::ImmutableCFOptions const&, 
>>>>>>>>>>>>>>>>>> rocksdb::MutableCFOptions const&, rocksdb::EnvOptions 
>>>>>>>>>>>>>>>>>> const&, rock
>>>>>>>>>>>>>>>>>> sdb::TableCache*, rocksdb::InternalIterator*, 
>>>>>>>>>>>>>>>>>> std::unique_ptr<rocksdb::InternalIterator, 
>>>>>>>>>>>>>>>>>> std::default_delete<rocksdb::InternalIterator> >, 
>>>>>>>>>>>>>>>>>> rocksdb::FileMetaData*, rocksdb::InternalKeyComparator 
>>>>>>>>>>>>>>>>>> const&, std::vector<std::unique_ptr<
>>>>>>>>>>>>>>>>>> rocksdb::IntTblPropCollectorFactory, 
>>>>>>>>>>>>>>>>>> std::default_delete<rocksdb::IntTblPropCollectorFactory> >, 
>>>>>>>>>>>>>>>>>> std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory,
>>>>>>>>>>>>>>>>>>  std::default_delete<rocksdb::IntTblPropCollectorFactory> > 
>>>>>>>>>>>>>>>>>> > > co
>>>>>>>>>>>>>>>>>> nst*, unsigned int, std::__cxx11::basic_string<char, 
>>>>>>>>>>>>>>>>>> std::char_traits<char>, std::allocator<char> > const&, 
>>>>>>>>>>>>>>>>>> std::vector<unsigned long, std::allocator<unsigned long> >, 
>>>>>>>>>>>>>>>>>> unsigned long, rocksdb::SnapshotChecker*, 
>>>>>>>>>>>>>>>>>> rocksdb::Compression
>>>>>>>>>>>>>>>>>> Type, rocksdb::CompressionOptions const&, bool, 
>>>>>>>>>>>>>>>>>> rocksdb::InternalStats*, rocksdb::TableFileCreationReason, 
>>>>>>>>>>>>>>>>>> rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, 
>>>>>>>>>>>>>>>>>> rocksdb::TableProperties*, int, unsigned long, unsigned 
>>>>>>>>>>>>>>>>>> long, rocksdb
>>>>>>>>>>>>>>>>>> ::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94]
>>>>>>>>>>>>>>>>>>    8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, 
>>>>>>>>>>>>>>>>>> rocksdb::ColumnFamilyData*, rocksdb::MemTable*, 
>>>>>>>>>>>>>>>>>> rocksdb::VersionEdit*)+0xcb7) [0x5586b2321457]
>>>>>>>>>>>>>>>>>>    9: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned 
>>>>>>>>>>>>>>>>>> long, std::allocator<unsigned long> > const&, unsigned 
>>>>>>>>>>>>>>>>>> long*, bool)+0x19de) [0x5586b232373e]
>>>>>>>>>>>>>>>>>>    10: 
>>>>>>>>>>>>>>>>>> (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
>>>>>>>>>>>>>>>>>>  std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, 
>>>>>>>>>>>>>>>>>> bool, bool, bool)+0x5d4) [0x5586b23242f4]
>>>>>>>>>>>>>>>>>>    11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, 
>>>>>>>>>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>>>>>>>>>>>>>>>>> std::allocator<char> > const&, 
>>>>>>>>>>>>>>>>>> std::vector<rocksdb::ColumnFamilyDescriptor, 
>>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyDescri
>>>>>>>>>>>>>>>>>> ptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, 
>>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyHandle*> >*, 
>>>>>>>>>>>>>>>>>> rocksdb::DB**, bool)+0x68b) [0x5586b232559b]
>>>>>>>>>>>>>>>>>>    12: (rocksdb::DB::Open(rocksdb::DBOptions const&, 
>>>>>>>>>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>>>>>>>>>>>>>>>>> std::allocator<char> > const&, 
>>>>>>>>>>>>>>>>>> std::vector<rocksdb::ColumnFamilyDescriptor, 
>>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyDescriptor
>>>>>>>>>>>>>>>>>>>> const&, std::vector<rocksdb::ColumnFamilyHandle*, 
>>>>>>>>>>>>>>>>>>>> std::allocator<rocksdb::ColumnFamilyHandle*> >*, 
>>>>>>>>>>>>>>>>>>>> rocksdb::DB**)+0x22) [0x5586b2326e72]
>>>>>>>>>>>>>>>>>>    13: (RocksDBStore::do_open(std::ostream&, bool, 
>>>>>>>>>>>>>>>>>> std::vector<KeyValueDB::ColumnFamily, 
>>>>>>>>>>>>>>>>>> std::allocator<KeyValueDB::ColumnFamily> > const*)+0x170c) 
>>>>>>>>>>>>>>>>>> [0x5586b220219c]
>>>>>>>>>>>>>>>>>>    14: (BlueStore::_open_db(bool, bool)+0xd8e) 
>>>>>>>>>>>>>>>>>> [0x5586b218ee1e]
>>>>>>>>>>>>>>>>>>    15: (BlueStore::_mount(bool, bool)+0x4b7) [0x5586b21bf807]
>>>>>>>>>>>>>>>>>>    16: (OSD::init()+0x295) [0x5586b1d673c5]
>>>>>>>>>>>>>>>>>>    17: (main()+0x268d) [0x5586b1c554ed]
>>>>>>>>>>>>>>>>>>    18: (__libc_start_main()+0xe7) [0x7f1d5ea2db97]
>>>>>>>>>>>>>>>>>>    19: (_start()+0x2a) [0x5586b1d1d7fa]
>>>>>>>>>>>>>>>>>>    NOTE: a copy of the executable, or `objdump -rdS 
>>>>>>>>>>>>>>>>>> <executable>` is needed to interpret this.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to