Re: [ceph-users] OSD turned itself off

Josef Johansson Mon, 16 Feb 2015 13:30:07 -0800

And yeah, it’s the same EIO 5 error.

So ok, the errors doesn’t show anything useful to the osd crash.



> On 16 Feb 2015, at 21:58, Josef Johansson <jo...@oderland.se> wrote:
> 
> Well, I knew it had all the correct information since earlier so gave it a 
> shot :)
> 
> Anyway, I think it may be just a bad controller as well. New enterprise 
> drives shouldn’t be giving read errors this early in deployment tbh.
> 
> Cheers,
> Josef
>> On 16 Feb 2015, at 17:37, Greg Farnum <gfar...@redhat.com 
>> <mailto:gfar...@redhat.com>> wrote:
>> 
>> Woah, major thread necromancy! :)
>> 
>> On Feb 13, 2015, at 3:03 PM, Josef Johansson <jo...@oderland.se 
>> <mailto:jo...@oderland.se>> wrote:
>>> 
>>> Hi,
>>> 
>>> I skimmed the logs again, as we’ve had more of this kinda errors,
>>> 
>>> I saw a lot of lossy connections errors,
>>> -2567> 2014-11-24 11:49:40.028755 7f6d49367700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.54:0/1011446 pipe(0x19321b80 sd=44 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x110d2b00).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2564> 2014-11-24 11:49:42.000463 7f6d51df1700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/1015676 pipe(0x22d6000 sd=204 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x16e218c0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2563> 2014-11-24 11:49:47.704467 7f6d4d1a5700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/3029106 pipe(0x231f6780 sd=158 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x136bd1e0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2562> 2014-11-24 11:49:48.180604 7f6d4cb9f700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/2027138 pipe(0x1657f180 sd=254 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x13273340).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2561> 2014-11-24 11:49:48.808604 7f6d4c498700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/2023529 pipe(0x12831900 sd=289 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x12401600).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2559> 2014-11-24 11:49:50.128379 7f6d4b88c700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.53:0/1023180 pipe(0x11cb2280 sd=309 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x1280a000).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2558> 2014-11-24 11:49:52.472867 7f6d425eb700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/3019692 pipe(0x18eb4a00 sd=311 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x10df6b00).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2556> 2014-11-24 11:49:55.100208 7f6d49e72700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/3021273 pipe(0x1bacf680 sd=353 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x164ae2c0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2555> 2014-11-24 11:49:55.776568 7f6d49468700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/3024351 pipe(0x1bacea00 sd=20 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x1887ba20).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2554> 2014-11-24 11:49:57.704437 7f6d49165700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/1023529 pipe(0x1a32ac80 sd=213 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0xfe93b80).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2553> 2014-11-24 11:49:58.694246 7f6d47549700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/3017204 pipe(0x102e5b80 sd=370 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0xfb5a000).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2551> 2014-11-24 11:50:00.412242 7f6d4673b700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/3027138 pipe(0x1b83b400 sd=250 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x12922dc0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2387> 2014-11-24 11:50:22.761490 7f6d44fa4700  0 -- 
>>> 10.168.7.23:6840/4010217 >> 10.168.7.25:0/27131 pipe(0xfc60c80 sd=300 :6840 
>>> s=0 pgs=0 cs=0 l=1 c=0x1241d080).accept replacing existing (lossy) channel 
>>> (new one lossy=1)
>>> -2300> 2014-11-24 11:50:31.366214 7f6d517eb700  0 -- 
>>> 10.168.7.23:6840/4010217 >> 10.168.7.22:0/15549 pipe(0x193b3180 sd=214 
>>> :6840 s=0 pgs=0 cs=0 l=1 c=0x10ebbe40).accept replacing existing (lossy) 
>>> channel (new one lossy=1)
>>> -2247> 2014-11-24 11:50:37.372934 7f6d4a276700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/1013890 pipe(0x25d4780 sd=112 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x10666580).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2246> 2014-11-24 11:50:37.738539 7f6d4f6ca700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/3026502 pipe(0x1338ea00 sd=230 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x123f11e0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2245> 2014-11-24 11:50:38.390093 7f6d48c60700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/2026502 pipe(0x16ba7400 sd=276 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x7d4fb80).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2242> 2014-11-24 11:50:40.505458 7f6d3e43a700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.53:0/1012682 pipe(0x12a53180 sd=183 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x10537080).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2198> 2014-11-24 11:51:14.273025 7f6d44ea3700  0 -- 
>>> 10.168.7.23:6865/5010217 >> 10.168.7.25:0/30755 pipe(0x162bb680 sd=327 
>>> :6865 s=0 pgs=0 cs=0 l=1 c=0x16e21600).accept replacing existing (lossy) 
>>> channel (new one lossy=1)
>>> -1881> 2014-11-29 00:45:42.247394 7f6d5c155700  0 -- 10.168.7.23:6819/10217 
>>> submit_message osd_op_reply(949861 
>>> rbd_data.1c56a792eb141f2.0000000000006200 [stat,write 2228224~12288] ondisk 
>>> = 0) v4 remote, 10.168.7.54:0/1025735, failed lossy con, dropping message 
>>> 0x1bc00400
>>> -976> 2015-01-05 07:10:01.763055 7f6d5c155700  0 -- 10.168.7.23:6819/10217 
>>> submit_message osd_op_reply(11034565 
>>> rbd_data.1cc69562eb141f2.00000000000003ce [stat,write 1925120~4096] ondisk 
>>> = 0) v4 remote, 10.168.7.54:0/2007323, failed lossy con, dropping message 
>>> 0x12989400
>>> -855> 2015-01-10 22:01:36.589036 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
>>> submit_message osd_op_reply(727627 
>>> rbd_data.1cc69413d1b58ba.0000000000000055 [stat,write 2289664~4096] ondisk 
>>> = 0) v4 remote, 10.168.7.54:0/1007323, failed lossy con, dropping message 
>>> 0x24f68800
>>> -819> 2015-01-12 05:25:06.229753 7f6d3646c700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.53:0/2019809 pipe(0x1f0e9680 sd=460 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x13090420).accept replacing existing (lossy) channel (new one lossy=1)
>>> -818> 2015-01-12 05:25:06.581703 7f6d37534700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.53:0/1025252 pipe(0x1b67a780 sd=71 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x16311e40).accept replacing existing (lossy) channel (new one lossy=1)
>>> -817> 2015-01-12 05:25:21.342998 7f6d41167700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.53:0/1025579 pipe(0x114e8000 sd=502 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x16310160).accept replacing existing (lossy) channel (new one lossy=1)
>>> -808> 2015-01-12 16:01:35.783534 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
>>> submit_message osd_op_reply(752034 
>>> rbd_data.1cc69413d1b58ba.0000000000000055 [stat,write 2387968~8192] ondisk 
>>> = 0) v4 remote, 10.168.7.54:0/1007323, failed lossy con, dropping message 
>>> 0x1fde9a00
>>> -515> 2015-01-25 18:44:23.303855 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
>>> submit_message osd_op_reply(46402240 
>>> rbd_data.4b8e9b3d1b58ba.0000000000000471 [read 1310720~4096] ondisk = 0) v4 
>>> remote, 10.168.7.51:0/1017204, failed lossy con, dropping message 0x250bce00
>>> -303> 2015-02-02 22:30:03.140599 7f6d5c155700  0 -- 10.168.7.23:6819/10217 
>>> submit_message osd_op_reply(17710313 
>>> rbd_data.1cc69562eb141f2.00000000000003ce [stat,write 4145152~4096] ondisk 
>>> = 0) v4 remote, 10.168.7.54:0/2007323, failed lossy con, dropping message 
>>> 0x1c5d4200
>>> -236> 2015-02-05 15:29:04.945660 7f6d3d357700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/1026961 pipe(0x1c63e780 sd=203 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x11dc8dc0).accept replacing existing (lossy) channel (new one lossy=1)
>>>  -66> 2015-02-10 20:20:36.673969 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
>>> submit_message osd_op_reply(11088 rbd_data.10b8c82eb141f2.0000000000004459 
>>> [stat,write 749568~8192] ondisk = 0) v4 remote, 10.168.7.55:0/1005630, 
>>> failed lossy con, dropping message 0x138db200
>>> 
>>> Could this have lead to the data being erroneous, or is the -5 return code 
>>> just a sign of a broken hard drive?
>>> 
>> 
>> These are the OSDs creating new connections to each other because the 
>> previous ones failed. That's not necessarily a problem (although here it's 
>> probably a symptom of some kind of issue, given the frequency) and cannot 
>> introduce data corruption of any kind.
>> I’m not seeing any -5 return codes as part of that messenger debug output, 
>> so unless you were referring to your EIO from last June I’m not sure what 
>> that’s about? (If you do mean EIOs, yes, they’re still a sign of a broken 
>> hard drive or local FS.)
>> 
>>> Cheers,
>>> Josef
>>> 
>>>> On 14 Jun 2014, at 02:38, Josef Johansson <jo...@oderland.se 
>>>> <mailto:jo...@oderland.se>> wrote:
>>>> 
>>>> Thanks for the quick response.
>>>> 
>>>> Cheers,
>>>> Josef
>>>> 
>>>> Gregory Farnum skrev 2014-06-14 02:36:
>>>>> On Fri, Jun 13, 2014 at 5:25 PM, Josef Johansson <jo...@oderland.se 
>>>>> <mailto:jo...@oderland.se>> wrote:
>>>>>> Hi Greg,
>>>>>> 
>>>>>> Thanks for the clarification. I believe the OSD was in the middle of a 
>>>>>> deep
>>>>>> scrub (sorry for not mentioning this straight away), so then it could've
>>>>>> been a silent error that got wind during scrub?
>>>>> Yeah.
>>>>> 
>>>>>> What's best practice when the store is corrupted like this?
>>>>> Remove the OSD from the cluster, and either reformat the disk or
>>>>> replace as you judge appropriate.
>>>>> -Greg
>>>>> 
>>>>>> Cheers,
>>>>>> Josef
>>>>>> 
>>>>>> Gregory Farnum skrev 2014-06-14 02:21:
>>>>>> 
>>>>>>> The OSD did a read off of the local filesystem and it got back the EIO
>>>>>>> error code. That means the store got corrupted or something, so it
>>>>>>> killed itself to avoid spreading bad data to the rest of the cluster.
>>>>>>> -Greg
>>>>>>> Software Engineer #42 @ http://inktank.com <http://inktank.com/> | 
>>>>>>> http://ceph.com <http://ceph.com/>
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson <jo...@oderland.se 
>>>>>>> <mailto:jo...@oderland.se>>
>>>>>>> wrote:
>>>>>>>> Hey,
>>>>>>>> 
>>>>>>>> Just examing what happened to an OSD, that was just turned off. Data 
>>>>>>>> has
>>>>>>>> been moved away from it, so hesitating to turned it back on.
>>>>>>>> 
>>>>>>>> Got the below in the logs, any clues to what the assert talks about?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Josef
>>>>>>>> 
>>>>>>>> -1 os/FileStore.cc <http://filestore.cc/>: In function 'virtual int 
>>>>>>>> FileStore::read(coll_t,
>>>>>>>> const
>>>>>>>> hobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7fdacb88
>>>>>>>> c700 time 2014-06-11 21:13:54.036982
>>>>>>>> os/FileStore.cc <http://filestore.cc/>: 2992: FAILED assert(allow_eio 
>>>>>>>> || !m_filestore_fail_eio
>>>>>>>> ||
>>>>>>>> got != -5)
>>>>>>>> 
>>>>>>>> ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
>>>>>>>> 1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned
>>>>>>>> long,
>>>>>>>> ceph::buffer::list&, bool)+0x653) [0x8ab6c3]
>>>>>>>> 2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
>>>>>>>> std::vector<OSDOp,
>>>>>>>> std::allocator<OSDOp> >&)+0x350) [0x708230]
>>>>>>>> 3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86)
>>>>>>>> [0x713366]
>>>>>>>> 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3095)
>>>>>>>> [0x71acb5]
>>>>>>>> 5: (PG::do_request(std::tr1::shared_ptr<OpRequest>,
>>>>>>>> ThreadPool::TPHandle&)+0x3f0) [0x812340]
>>>>>>>> 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>>>>>>>> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2ea) 
>>>>>>>> [0x75c80a]
>>>>>>>> 7: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>,
>>>>>>>> ThreadPool::TPHandle&)+0x198) [0x770da8]
>>>>>>>> 8: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>,
>>>>>>>> std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG>
>>>>>>>>> ::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7a89
>>>>>>>> ce]
>>>>>>>> 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x9b5dea]
>>>>>>>> 10: (ThreadPool::WorkThread::entry()+0x10) [0x9b7040]
>>>>>>>> 11: (()+0x6b50) [0x7fdadffdfb50]
>>>>>>>> 12: (clone()+0x6d) [0x7fdade53b0ed]
>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>>>>> needed to
>>>>>>>> interpret this.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD turned itself off

Reply via email to