Re: [ceph-users] [Jewel] Crash Osd with void Hit_set_trim

Brad Hubbard Sun, 22 Oct 2017 17:06:05 -0700

2017-10-22 17:32:56.031086 7f3acaff5700  1 osd.14 pg_epoch: 72024 pg[37.1c(
v 71593'41657 (60849'38594,71593'41657] local-les=72023 n=13 ec=7037
les/c/f 72023/72023/66447 72022/72022/72022) [14,1,41] r=0 lpr=72022
crt=71593'41657 lcod 0'
0 mlcod 0'0 active+clean] hit_set_trim 37:38000000:.ceph-internal::
hit_set_37.1c_archive_2017-08-31 01%3a03%3a24.697717Z_2017-08-31
01%3a52%3a34.767197Z:head not found
2017-10-22 17:32:56.033936 7f3acaff5700 -1 osd/ReplicatedPG.cc: In function
'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&, unsigned
int)' thread 7f3acaff5700 time 2017-10-22 17:32:56.031105
osd/ReplicatedPG.cc: 11782: FAILED assert(obc)


It appears to be looking for (and failing to find) a hitset object with a
timestamp from August? Does that sound right to you? Of course, it appears
an object for that timestamp does not exist.

What are the settings for this cache tier?

Could you check your logs for any errors from the 'agent_load_hit_sets'
function?


On Mon, Oct 23, 2017 at 2:41 AM, pascal.pu...@pci-conseil.net <
pascal.pu...@pci-conseil.net> wrote:

> Hello,
>
> I ran today a lot read IO with an simple rsync... and again, an OSD
> crashed :
>
> But as before, I can't restart OSD. It continue crashing again. So OSD is
> out, cluster is recovering.
>
> I had just time to increase OSD log.
>
> # ceph tell osd.14 injectargs --debug-osd 5/5
>
> Join log :
>
> # grep -B100 -100 objdump /var/log/ceph/ceph-osd.14.log
>
> If I ran another read, an other OSD willl probably crash.
>
> Any Idee ?
>
> I will probably plan to move data from erasure pool to replicat 3x pool.
> It's becoming unstable without any change.
>
> Regards,
>
> PS: Last sunday, I lost RBD header during remove of cache tier... a lot of
> thanks to http://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recovery/,
> to recreate it and resurrect RBD disk :)
> Le 19/10/2017 à 00:19, Brad Hubbard a écrit :
>
> On Wed, Oct 18, 2017 at 11:16 PM, 
> pascal.pu...@pci-conseil.net<pascal.pu...@pci-conseil.net> 
> <pascal.pu...@pci-conseil.net> wrote:
>
> hello,
>
> For 2 week, I lost sometime some OSD :
> Here trace :
>
>     0> 2017-10-18 05:16:40.873511 7f7c1e497700 -1 osd/ReplicatedPG.cc: In
> function '*void ReplicatedPG::hit_set_trim(*ReplicatedPG::OpContextUPtr&,
> unsigned int)' thread 7f7c1e497700 time 2017-10-18 05:16:40.869962
> osd/ReplicatedPG.cc: 11782: FAILED assert(obc)
>
> Can you try to capture a log with debug_osd set to 10 or greater as
> per http://tracker.ceph.com/issues/19185 ?
>
> This will allow us to see the output from the
> PrimaryLogPG::get_object_context() function which may help identify
> the problem.
>
> Please also check your machines all have the same time zone set and
> their clocks are in sync.
>
>
>  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x55eec15a09e5]
>  2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
> std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x6dd)
> [0x55eec107a52d]
>  3: (ReplicatedPG::hit_set_persist()+0xd7c) [0x55eec107d1bc]
>  4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1a92)
> [0x55eec109bbe2]
>  5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x747) [0x55eec10588a7]
>  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>,
> ThreadPool::TPHandle&)+0x41d) [0x55eec0f0bbad]
>  7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d)
> [0x55eec0f0bdfd]
>  8: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x77b) [0x55eec0f0f7db]
>  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887)
> [0x55eec1590987]
>  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55eec15928f0]
>  11: (()+0x7e25) [0x7f7c4fd52e25]
>  12: (clone()+0x6d) [0x7f7c4e3dc34d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> I am using Jewel 10.2.10
>
> I am using Erasure coding pool (2+1) + Nvme cache tier (backwrite) with 3
> replica with simple RBD disk.
> (12 OSD Sata disk on 4 nodes + 1 nvme on each node = 48 x OSD sata + 8 x
> NVMe Osd (I split NVMe in 2).
> Last week, it was only nvme OSD which crashed. So I unmap all disk, detroyed
> cache and recreated It.
> From this days, it work fine. Today, an OSD crahed. But it was not an NVME
> OSD this time, a normal OSD (sata).
>
> Any idee ? what about this void "*ReplicatedPG::hit_set_trim".
>
> *thanks for your help,*
> *
> Regards,
>
>
>
>
>
> _______________________________________________
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> *Performance Conseil Informatique*
> Pascal Pucci
> Consultant Infrastructure
> pascal.pu...@pci-conseil.net
> Mobile : 06 51 47 84 98
> Bureau : 02 85 52 41 81
> http://www.performance-conseil-informatique.net
> *News : Parteneriat DataCore -PCI est Silver Partner
> <http://www.performance-conseil-informatique.net/2017/06/02/partenaire-datacore/>
> Très heureux de réaliser des projets continuité stockage avec DataCore
> depuis 2008. PCI est partenaire Silver DataCore. Merci à DataCore
> ...lire...I
> <http://www.performance-conseil-informatique.net/2017/06/02/partenaire-datacore/>
> *
>



-- 
Cheers,
Brad

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Jewel] Crash Osd with void Hit_set_trim

Reply via email to