Re: [ceph-users] [Jewel] Crash Osd with void Hit_set_trim

pascal.pu...@pci-conseil.net Sun, 22 Oct 2017 09:42:13 -0700

Hello,

I ran today a lot read IO with an simple rsync... and again, an OSD crashed :

But as before, I can't restart OSD. It continue crashing again. So OSD is out, cluster is recovering.


I had just time to increase OSD log.

# ceph tell osd.14 injectargs --debug-osd 5/5

Join log :

# grep -B100 -100 objdump /var/log/ceph/ceph-osd.14.log

If I ran another read, an other OSD willl probably crash.

Any Idee ?

I will probably plan to move data from erasure pool to replicat 3x pool. It's becoming unstable without any change.


Regards,

PS: Last sunday, I lost RBD header during remove of cache tier... a lot of thanks to http://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recovery/, to recreate it and resurrect RBD disk :)


Le 19/10/2017 à 00:19, Brad Hubbard a écrit :

On Wed, Oct 18, 2017 at 11:16 PM, pascal.pu...@pci-conseil.net
<pascal.pu...@pci-conseil.net> wrote:

hello,

For 2 week, I lost sometime some OSD :
Here trace :

     0> 2017-10-18 05:16:40.873511 7f7c1e497700 -1 osd/ReplicatedPG.cc: In
function '*void ReplicatedPG::hit_set_trim(*ReplicatedPG::OpContextUPtr&,
unsigned int)' thread 7f7c1e497700 time 2017-10-18 05:16:40.869962
osd/ReplicatedPG.cc: 11782: FAILED assert(obc)

Can you try to capture a log with debug_osd set to 10 or greater as
per http://tracker.ceph.com/issues/19185 ?

This will allow us to see the output from the
PrimaryLogPG::get_object_context() function which may help identify
the problem.

Please also check your machines all have the same time zone set and
their clocks are in sync.

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x55eec15a09e5]
  2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x6dd)
[0x55eec107a52d]
  3: (ReplicatedPG::hit_set_persist()+0xd7c) [0x55eec107d1bc]
  4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1a92)
[0x55eec109bbe2]
  5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x747) [0x55eec10588a7]
  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x41d) [0x55eec0f0bbad]
  7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d)
[0x55eec0f0bdfd]
  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x77b) [0x55eec0f0f7db]
  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887)
[0x55eec1590987]
  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55eec15928f0]
  11: (()+0x7e25) [0x7f7c4fd52e25]
  12: (clone()+0x6d) [0x7f7c4e3dc34d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.

I am using Jewel 10.2.10

I am using Erasure coding pool (2+1) + Nvme cache tier (backwrite) with 3
replica with simple RBD disk.
(12 OSD Sata disk on 4 nodes + 1 nvme on each node = 48 x OSD sata + 8 x
NVMe Osd (I split NVMe in 2).
Last week, it was only nvme OSD which crashed. So I unmap all disk, detroyed
cache and recreated It.
 From this days, it work fine. Today, an OSD crahed. But it was not an NVME
OSD this time, a normal OSD (sata).

Any idee ? what about this void "*ReplicatedPG::hit_set_trim".

*thanks for your help,*
*
Regards,





_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
        *Performance Conseil Informatique*
Pascal Pucci
Consultant Infrastructure
pascal.pu...@pci-conseil.net <mailto:pascal.pu...@pci-conseil.net>
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net         /*News :*

Parteneriat DataCore -PCI est Silver Partner <http://www.performance-conseil-informatique.net/2017/06/02/partenaire-datacore/> Très heureux de réaliser des projets continuité stockage avec DataCore depuis 2008. PCI est partenaire Silver DataCore. Merci à DataCore ...lire...I <http://www.performance-conseil-informatique.net/2017/06/02/partenaire-datacore/>

log-ceph-osd.14.log.gz
Description: GNU Zip compressed data

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Jewel] Crash Osd with void Hit_set_trim

Reply via email to