Re: [ceph-users] Some OSD and MDS crash
Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU wrote: > Hi, > > I join : > - osd.20 is one of osd that I detect which makes crash other OSD. > - osd.23 is one of osd which crash when i start osd.20 > - mds, is one of my MDS > > I cut log file because they are to big but. All is here : > https://blondeau.users.greyc.fr/cephlog/ > > Regards > > Le 30/06/2014 17:35, Gregory Farnum a écrit : > >> What's the backtrace from the crashing OSDs? >> >> Keep in mind that as a dev release, it's generally best not to upgrade >> to unnamed versions like 0.82 (but it's probably too late to go back >> now). > > > I will remember it the next time ;) > > >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >> >> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU >> wrote: >>> >>> Hi, >>> >>> After the upgrade to firefly, I have some PG in peering state. >>> I seen the output of 0.82 so I try to upgrade for solved my problem. >>> >>> My three MDS crash and some OSD triggers a chain reaction that kills >>> other >>> OSD. >>> I think my MDS will not start because of the metadata are on the OSD. >>> >>> I have 36 OSD on three servers and I identified 5 OSD which makes crash >>> others. If i not start their, the cluster passe in reconstructive state >>> with >>> 31 OSD but i have 378 in down+peering state. >>> >>> How can I do ? Would you more information ( os, crash log, etc ... ) ? >>> >>> Regards >>> >>> -- >>> -- >>> Pierre BLONDEAU >>> Administrateur Systèmes & réseaux >>> Université de Caen >>> Laboratoire GREYC, Département d'informatique >>> >>> tel : 02 31 56 75 42 >>> bureau : Campus 2, Science 3, 406 >>> -- >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > > -- > -- > Pierre BLONDEAU > Administrateur Systèmes & réseaux > Université de Caen > Laboratoire GREYC, Département d'informatique > > tel : 02 31 56 75 42 > bureau : Campus 2, Science 3, 406 > -- > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU wrote: > Yes, but how i do that ? > > With a command like that ? > > ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms > 1' > > By modify the /etc/ceph/ceph.conf ? This file is really poor because I use > udev detection. > > When I have made these changes, you want the three log files or only > osd.20's ? > > Thank you so much for the help > > Regards > Pierre > > Le 01/07/2014 23:51, Samuel Just a écrit : > >> Can you reproduce with >> debug osd = 20 >> debug filestore = 20 >> debug ms = 1 >> ? >> -Sam >> >> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU >> wrote: >>> >>> Hi, >>> >>> I join : >>> - osd.20 is one of osd that I detect which makes crash other OSD. >>> - osd.23 is one of osd which crash when i start osd.20 >>> - mds, is one of my MDS >>> >>> I cut log file because they are to big but. All is here : >>> https://blondeau.users.greyc.fr/cephlog/ >>> >>> Regards >>> >>> Le 30/06/2014 17:35, Gregory Farnum a écrit : >>> >>>> What's the backtrace from the crashing OSDs? >>>> >>>> Keep in mind that as a dev release, it's generally best not to upgrade >>>> to unnamed versions like 0.82 (but it's probably too late to go back >>>> now). >>> >>> >>> >>> I will remember it the next time ;) >>> >>> >>>> -Greg >>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>> >>>> >>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU >>>> wrote: >>>>> >>>>> >>>>> Hi, >>>>> >>>>> After the upgrade to firefly, I have some PG in peering state. >>>>> I seen the output of 0.82 so I try to upgrade for solved my problem. >>>>> >>>>> My three MDS crash and some OSD triggers a chain reaction that kills >>>>> other >>>>> OSD. >>>>> I think my MDS will not start because of the metadata are on the OSD. >>>>> >>>>> I have 36 OSD on three servers and I identified 5 OSD which makes crash >>>>> others. If i not start their, the cluster passe in reconstructive state >>>>> with >>>>> 31 OSD but i have 378 in down+peering state. >>>>> >>>>> How can I do ? Would you more information ( os, crash log, etc ... ) ? >>>>> >>>>> Regards >>>>> >>>>> -- >>>>> -- >>>>> Pierre BLONDEAU >>>>> Administrateur Systèmes & réseaux >>>>> Université de Caen >>>>> Laboratoire GREYC, Département d'informatique >>>>> >>>>> tel : 02 31 56 75 42 >>>>> bureau : Campus 2, Science 3, 406 >>>>> -- >>>>> >>>>> >>>>> ___ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>> >>> >>> -- >>> -- >>> Pierre BLONDEAU >>> Administrateur Systèmes & réseaux >>> Université de Caen >>> Laboratoire GREYC, Département d'informatique >>> >>> tel : 02 31 56 75 42 >>> bureau : Campus 2, Science 3, 406 >>> -- >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > > -- > -- > Pierre BLONDEAU > Administrateur Systèmes & réseaux > Université de Caen > Laboratoire GREYC, Département d'informatique > > tel : 02 31 56 75 42 > bureau : Campus 2, Science 3, 406 > -- > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU wrote: > Hi, > > I do it, the log files are available here : > https://blondeau.users.greyc.fr/cephlog/debug20/ > > The OSD's files are really big +/- 80M . > > After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. > I remark that after this the number of down+peering PG decrease from 367 to > 248. It's "normal" ? May be it's temporary, the time that the cluster > verifies all the PG ? > > Regards > Pierre > > Le 02/07/2014 19:16, Samuel Just a écrit : > >> You should add >> >> debug osd = 20 >> debug filestore = 20 >> debug ms = 1 >> >> to the [osd] section of the ceph.conf and restart the osds. I'd like >> all three logs if possible. >> >> Thanks >> -Sam >> >> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU >> wrote: >>> >>> Yes, but how i do that ? >>> >>> With a command like that ? >>> >>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 >>> --debug-ms >>> 1' >>> >>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I >>> use >>> udev detection. >>> >>> When I have made these changes, you want the three log files or only >>> osd.20's ? >>> >>> Thank you so much for the help >>> >>> Regards >>> Pierre >>> >>> Le 01/07/2014 23:51, Samuel Just a écrit : >>> >>>> Can you reproduce with >>>> debug osd = 20 >>>> debug filestore = 20 >>>> debug ms = 1 >>>> ? >>>> -Sam >>>> >>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU >>>> wrote: >>>>> >>>>> >>>>> Hi, >>>>> >>>>> I join : >>>>>- osd.20 is one of osd that I detect which makes crash other OSD. >>>>>- osd.23 is one of osd which crash when i start osd.20 >>>>>- mds, is one of my MDS >>>>> >>>>> I cut log file because they are to big but. All is here : >>>>> https://blondeau.users.greyc.fr/cephlog/ >>>>> >>>>> Regards >>>>> >>>>> Le 30/06/2014 17:35, Gregory Farnum a écrit : >>>>> >>>>>> What's the backtrace from the crashing OSDs? >>>>>> >>>>>> Keep in mind that as a dev release, it's generally best not to upgrade >>>>>> to unnamed versions like 0.82 (but it's probably too late to go back >>>>>> now). >>>>> >>>>> >>>>> I will remember it the next time ;) >>>>> >>>>>> -Greg >>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>> >>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU >>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> After the upgrade to firefly, I have some PG in peering state. >>>>>>> I seen the output of 0.82 so I try to upgrade for solved my problem. >>>>>>> >>>>>>> My three MDS crash and some OSD triggers a chain reaction that kills >>>>>>> other >>>>>>> OSD. >>>>>>> I think my MDS will not start because of the metadata are on the OSD. >>>>>>> >>>>>>> I have 36 OSD on three servers and I identified 5 OSD which makes >>>>>>> crash >>>>>>> others. If i not start their, the cluster passe in reconstructive >>>>>>> state >>>>>>> with >>>>>>> 31 OSD but i have 378 in down+peering state. >>>>>>> >>>>>>> How can I do ? Would you more information ( os, crash log, etc ... ) >>>>>>> ? >>>>>>> >>>>>>> Regards > > > -- > -- > Pierre BLONDEAU > Administrateur Systèmes & réseaux > Université de Caen > Laboratoire GREYC, Département d'informatique > > tel : 02 31 56 75 42 > bureau : Campus 2, Science 3, 406 > -- > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just wrote: > Ok, in current/meta on osd 20 and osd 23, please attach all files matching > > ^osdmap.13258.* > > There should be one such file on each osd. (should look something like > osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, > you'll want to use find). > > What version of ceph is running on your mons? How many mons do you have? > -Sam > > On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU > wrote: >> Hi, >> >> I do it, the log files are available here : >> https://blondeau.users.greyc.fr/cephlog/debug20/ >> >> The OSD's files are really big +/- 80M . >> >> After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. >> I remark that after this the number of down+peering PG decrease from 367 to >> 248. It's "normal" ? May be it's temporary, the time that the cluster >> verifies all the PG ? >> >> Regards >> Pierre >> >> Le 02/07/2014 19:16, Samuel Just a écrit : >> >>> You should add >>> >>> debug osd = 20 >>> debug filestore = 20 >>> debug ms = 1 >>> >>> to the [osd] section of the ceph.conf and restart the osds. I'd like >>> all three logs if possible. >>> >>> Thanks >>> -Sam >>> >>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU >>> wrote: >>>> >>>> Yes, but how i do that ? >>>> >>>> With a command like that ? >>>> >>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 >>>> --debug-ms >>>> 1' >>>> >>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I >>>> use >>>> udev detection. >>>> >>>> When I have made these changes, you want the three log files or only >>>> osd.20's ? >>>> >>>> Thank you so much for the help >>>> >>>> Regards >>>> Pierre >>>> >>>> Le 01/07/2014 23:51, Samuel Just a écrit : >>>> >>>>> Can you reproduce with >>>>> debug osd = 20 >>>>> debug filestore = 20 >>>>> debug ms = 1 >>>>> ? >>>>> -Sam >>>>> >>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU >>>>> wrote: >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> I join : >>>>>>- osd.20 is one of osd that I detect which makes crash other OSD. >>>>>>- osd.23 is one of osd which crash when i start osd.20 >>>>>>- mds, is one of my MDS >>>>>> >>>>>> I cut log file because they are to big but. All is here : >>>>>> https://blondeau.users.greyc.fr/cephlog/ >>>>>> >>>>>> Regards >>>>>> >>>>>> Le 30/06/2014 17:35, Gregory Farnum a écrit : >>>>>> >>>>>>> What's the backtrace from the crashing OSDs? >>>>>>> >>>>>>> Keep in mind that as a dev release, it's generally best not to upgrade >>>>>>> to unnamed versions like 0.82 (but it's probably too late to go back >>>>>>> now). >>>>>> >>>>>> >>>>>> I will remember it the next time ;) >>>>>> >>>>>>> -Greg >>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>>> >>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> After the upgrade to firefly, I have some PG in peering state. >>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my problem. >>>>>>>> >>>>>>>> My three MDS crash and some OSD triggers a chain reaction that kills >>>>>>>> other >>>>>>>> OSD. >>>>>>>> I think my MDS will not start because of the metadata are on the OSD. >>>>>>>> >>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which makes >>>>>>>> crash >>>>>>>> others. If i not start their, the cluster passe in reconstructive >>>>>>>> state >>>>>>>> with >>>>>>>> 31 OSD but i have 378 in down+peering state. >>>>>>>> >>>>>>>> How can I do ? Would you more information ( os, crash log, etc ... ) >>>>>>>> ? >>>>>>>> >>>>>>>> Regards >> >> >> -- >> -- >> Pierre BLONDEAU >> Administrateur Systèmes & réseaux >> Université de Caen >> Laboratoire GREYC, Département d'informatique >> >> tel : 02 31 56 75 42 >> bureau : Campus 2, Science 3, 406 >> -- >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Joao: this looks like divergent osdmaps, osd 20 and osd 23 have differing ideas of the acting set for pg 2.11. Did we add hashes to the incremental maps? What would you want to know from the mons? -Sam On Wed, Jul 2, 2014 at 3:10 PM, Samuel Just wrote: > Also, what version did you upgrade from, and how did you upgrade? > -Sam > > On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just wrote: >> Ok, in current/meta on osd 20 and osd 23, please attach all files matching >> >> ^osdmap.13258.* >> >> There should be one such file on each osd. (should look something like >> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, >> you'll want to use find). >> >> What version of ceph is running on your mons? How many mons do you have? >> -Sam >> >> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU >> wrote: >>> Hi, >>> >>> I do it, the log files are available here : >>> https://blondeau.users.greyc.fr/cephlog/debug20/ >>> >>> The OSD's files are really big +/- 80M . >>> >>> After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. >>> I remark that after this the number of down+peering PG decrease from 367 to >>> 248. It's "normal" ? May be it's temporary, the time that the cluster >>> verifies all the PG ? >>> >>> Regards >>> Pierre >>> >>> Le 02/07/2014 19:16, Samuel Just a écrit : >>> >>>> You should add >>>> >>>> debug osd = 20 >>>> debug filestore = 20 >>>> debug ms = 1 >>>> >>>> to the [osd] section of the ceph.conf and restart the osds. I'd like >>>> all three logs if possible. >>>> >>>> Thanks >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU >>>> wrote: >>>>> >>>>> Yes, but how i do that ? >>>>> >>>>> With a command like that ? >>>>> >>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 >>>>> --debug-ms >>>>> 1' >>>>> >>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I >>>>> use >>>>> udev detection. >>>>> >>>>> When I have made these changes, you want the three log files or only >>>>> osd.20's ? >>>>> >>>>> Thank you so much for the help >>>>> >>>>> Regards >>>>> Pierre >>>>> >>>>> Le 01/07/2014 23:51, Samuel Just a écrit : >>>>> >>>>>> Can you reproduce with >>>>>> debug osd = 20 >>>>>> debug filestore = 20 >>>>>> debug ms = 1 >>>>>> ? >>>>>> -Sam >>>>>> >>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I join : >>>>>>>- osd.20 is one of osd that I detect which makes crash other OSD. >>>>>>>- osd.23 is one of osd which crash when i start osd.20 >>>>>>>- mds, is one of my MDS >>>>>>> >>>>>>> I cut log file because they are to big but. All is here : >>>>>>> https://blondeau.users.greyc.fr/cephlog/ >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Le 30/06/2014 17:35, Gregory Farnum a écrit : >>>>>>> >>>>>>>> What's the backtrace from the crashing OSDs? >>>>>>>> >>>>>>>> Keep in mind that as a dev release, it's generally best not to upgrade >>>>>>>> to unnamed versions like 0.82 (but it's probably too late to go back >>>>>>>> now). >>>>>>> >>>>>>> >>>>>>> I will remember it the next time ;) >>>>>>> >>>>>>>> -Greg >>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>>>> >>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> After the upgrade to firefly, I have some PG in peering state. >>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my problem. >>>>>>>>> >>>>>>>>> My three MDS crash and some OSD triggers a chain reaction that kills >>>>>>>>> other >>>>>>>>> OSD. >>>>>>>>> I think my MDS will not start because of the metadata are on the OSD. >>>>>>>>> >>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which makes >>>>>>>>> crash >>>>>>>>> others. If i not start their, the cluster passe in reconstructive >>>>>>>>> state >>>>>>>>> with >>>>>>>>> 31 OSD but i have 378 in down+peering state. >>>>>>>>> >>>>>>>>> How can I do ? Would you more information ( os, crash log, etc ... ) >>>>>>>>> ? >>>>>>>>> >>>>>>>>> Regards >>> >>> >>> -- >>> -- >>> Pierre BLONDEAU >>> Administrateur Systèmes & réseaux >>> Université de Caen >>> Laboratoire GREYC, Département d'informatique >>> >>> tel : 02 31 56 75 42 >>> bureau : Campus 2, Science 3, 406 >>> -- >>> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU wrote: > The files > > When I upgrade : > ceph-deploy install --stable firefly servers... > on each servers service ceph restart mon > on each servers service ceph restart osd > on each servers service ceph restart mds > > I upgraded from emperor to firefly. After repair, remap, replace, etc ... I > have some PG which pass in peering state. > > I thought why not try the version 0.82, it could solve my problem. ( > It's my mistake ). So, I upgrade from firefly to 0.83 with : > ceph-deploy install --testing servers... > .. > > Now, all programs are in version 0.82. > I have 3 mons, 36 OSD and 3 mds. > > Pierre > > PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta > directory. > > Le 03/07/2014 00:10, Samuel Just a écrit : > >> Also, what version did you upgrade from, and how did you upgrade? >> -Sam >> >> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just wrote: >>> >>> Ok, in current/meta on osd 20 and osd 23, please attach all files >>> matching >>> >>> ^osdmap.13258.* >>> >>> There should be one such file on each osd. (should look something like >>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, >>> you'll want to use find). >>> >>> What version of ceph is running on your mons? How many mons do you have? >>> -Sam >>> >>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU >>> wrote: >>>> >>>> Hi, >>>> >>>> I do it, the log files are available here : >>>> https://blondeau.users.greyc.fr/cephlog/debug20/ >>>> >>>> The OSD's files are really big +/- 80M . >>>> >>>> After starting the osd.20 some other osd crash. I pass from 31 osd up to >>>> 16. >>>> I remark that after this the number of down+peering PG decrease from 367 >>>> to >>>> 248. It's "normal" ? May be it's temporary, the time that the cluster >>>> verifies all the PG ? >>>> >>>> Regards >>>> Pierre >>>> >>>> Le 02/07/2014 19:16, Samuel Just a écrit : >>>> >>>>> You should add >>>>> >>>>> debug osd = 20 >>>>> debug filestore = 20 >>>>> debug ms = 1 >>>>> >>>>> to the [osd] section of the ceph.conf and restart the osds. I'd like >>>>> all three logs if possible. >>>>> >>>>> Thanks >>>>> -Sam >>>>> >>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU >>>>> wrote: >>>>>> >>>>>> >>>>>> Yes, but how i do that ? >>>>>> >>>>>> With a command like that ? >>>>>> >>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 >>>>>> --debug-ms >>>>>> 1' >>>>>> >>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I >>>>>> use >>>>>> udev detection. >>>>>> >>>>>> When I have made these changes, you want the three log files or only >>>>>> osd.20's ? >>>>>> >>>>>> Thank you so much for the help >>>>>> >>>>>> Regards >>>>>> Pierre >>>>>> >>>>>> Le 01/07/2014 23:51, Samuel Just a écrit : >>>>>> >>>>>>> Can you reproduce with >>>>>>> debug osd = 20 >>>>>>> debug filestore = 20 >>>>>>> debug ms = 1 >>>>>>> ? >>>>>>> -Sam >>>>>>> >>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I join : >>>>>>>> - osd.20 is one of osd that I detect which makes crash other >>>>>>>> OSD. >>>>>>>> - osd.23 is one of osd which crash when
Re: [ceph-users] Some OSD and MDS crash
Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i > /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 < tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? Pierre: do you recall how and when that got set? -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just wrote: > Yeah, divergent osdmaps: > 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none > 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none > > Joao: thoughts? > -Sam > > On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU > wrote: >> The files >> >> When I upgrade : >> ceph-deploy install --stable firefly servers... >> on each servers service ceph restart mon >> on each servers service ceph restart osd >> on each servers service ceph restart mds >> >> I upgraded from emperor to firefly. After repair, remap, replace, etc ... I >> have some PG which pass in peering state. >> >> I thought why not try the version 0.82, it could solve my problem. ( >> It's my mistake ). So, I upgrade from firefly to 0.83 with : >> ceph-deploy install --testing servers... >> .. >> >> Now, all programs are in version 0.82. >> I have 3 mons, 36 OSD and 3 mds. >> >> Pierre >> >> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta >> directory. >> >> Le 03/07/2014 00:10, Samuel Just a écrit : >> >>> Also, what version did you upgrade from, and how did you upgrade? >>> -Sam >>> >>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just wrote: >>>> >>>> Ok, in current/meta on osd 20 and osd 23, please attach all files >>>> matching >>>> >>>> ^osdmap.13258.* >>>> >>>> There should be one such file on each osd. (should look something like >>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, >>>> you'll want to use find). >>>> >>>> What version of ceph is running on your mons? How many mons do you have? >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I do it, the log files are available here : >>>>> https://blondeau.users.greyc.fr/cephlog/debug20/ >>>>> >>>>> The OSD's files are really big +/- 80M . >>>>> >>>>> After starting the osd.20 some other osd crash. I pass from 31 osd up to >>>>> 16. >>>>> I remark that after this the number of down+peering PG decrease from 367 >>>>> to >>>>> 248. It's "normal" ? May be it's temporary, the time that the cluster >>>>> verifies all the PG ? >>>>> >>>>> Regards >>>>> Pierre >>>>> >>>>> Le 02/07/2014 19:16, Samuel Just a écrit : >>>>> >>>>>> You should add >>>>>> >>>>>> debug osd = 20 >>>>>> debug filestore = 20 >>>>>> debug ms = 1 >>>>>> >>>>>> to the [osd] section of the ceph.conf and restart the osds. I'd like >>>>>> all three logs if possible. >>>>>> >>>>>> Thanks >>>>>> -Sam >>>>>> >>>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> Yes, but how i do that ? >>>>>>> >>>>>>> With a command like that ? >>>>>>> >>>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 >>>>>>> --debug-ms >>>>>>> 1' >>>>>>> >>>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I >>>>>>> use >>>>>>> udev detection. >>>>>>> >>>>>>> When I have made these changes, you want the three log files or only >>>>>>> osd
Re: [ceph-users] Some OSD and MDS crash
Can you confirm from the admin socket that all monitors are running the same version? -Sam On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU wrote: > Le 03/07/2014 00:55, Samuel Just a écrit : > >> Ah, >> >> ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush >> /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i > >> /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d >> ../ceph/src/osdmaptool: osdmap file >> 'osd-20_osdmap.13258__0_4E62BB79__none' >> ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 >> ../ceph/src/osdmaptool: osdmap file >> 'osd-23_osdmap.13258__0_4E62BB79__none' >> ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 >> 6d5 >> < tunable chooseleaf_vary_r 1 >> >> Looks like the chooseleaf_vary_r tunable somehow ended up divergent? >> >> Pierre: do you recall how and when that got set? > > > I am not sure to understand, but if I good remember after the update in > firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I > see "feature set mismatch" in log. > > So if I good remeber, i do : ceph osd crush tunables optimal for the problem > of "crush map" and I update my client and server kernel to 3.16rc. > > It's could be that ? > > Pierre > > >> -Sam >> >> On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just wrote: >>> >>> Yeah, divergent osdmaps: >>> 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none >>> 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none >>> >>> Joao: thoughts? >>> -Sam >>> >>> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU >>> wrote: >>>> >>>> The files >>>> >>>> When I upgrade : >>>> ceph-deploy install --stable firefly servers... >>>> on each servers service ceph restart mon >>>> on each servers service ceph restart osd >>>> on each servers service ceph restart mds >>>> >>>> I upgraded from emperor to firefly. After repair, remap, replace, etc >>>> ... I >>>> have some PG which pass in peering state. >>>> >>>> I thought why not try the version 0.82, it could solve my problem. ( >>>> It's my mistake ). So, I upgrade from firefly to 0.83 with : >>>> ceph-deploy install --testing servers... >>>> .. >>>> >>>> Now, all programs are in version 0.82. >>>> I have 3 mons, 36 OSD and 3 mds. >>>> >>>> Pierre >>>> >>>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta >>>> directory. >>>> >>>> Le 03/07/2014 00:10, Samuel Just a écrit : >>>> >>>>> Also, what version did you upgrade from, and how did you upgrade? >>>>> -Sam >>>>> >>>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just >>>>> wrote: >>>>>> >>>>>> >>>>>> Ok, in current/meta on osd 20 and osd 23, please attach all files >>>>>> matching >>>>>> >>>>>> ^osdmap.13258.* >>>>>> >>>>>> There should be one such file on each osd. (should look something like >>>>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, >>>>>> you'll want to use find). >>>>>> >>>>>> What version of ceph is running on your mons? How many mons do you >>>>>> have? >>>>>> -Sam >>>>>> >>>>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I do it, the log files are available here : >>>>>>> https://blondeau.users.greyc.fr/cephlog/debug20/ >>>>>>> >>>>>>> The OSD's files are really big +/- 80M . >>>>>>> >>>>>>> After starting the osd.20 some other osd crash. I pass from 31 osd up >>>>>>> to >>>>>>> 16. >>>>>>> I remark that after this the number of down+peering PG decrease from >>>>>>> 367 >>>>>>> to >>>>>>> 248. It's "normal" ? May be it's temporary, the time that
Re: [ceph-users] Some OSD and MDS crash
Yes, thanks. -Sam On Wed, Jul 2, 2014 at 4:21 PM, Pierre BLONDEAU wrote: > Like that ? > > # ceph --admin-daemon /var/run/ceph/ceph-mon.william.asok version > {"version":"0.82"} > # ceph --admin-daemon /var/run/ceph/ceph-mon.jack.asok version > {"version":"0.82"} > # ceph --admin-daemon /var/run/ceph/ceph-mon.joe.asok version > {"version":"0.82"} > > Pierre > > Le 03/07/2014 01:17, Samuel Just a écrit : > >> Can you confirm from the admin socket that all monitors are running >> the same version? >> -Sam >> >> On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU >> wrote: >>> >>> Le 03/07/2014 00:55, Samuel Just a écrit : >>> >>>> Ah, >>>> >>>> ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush >>>> /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i > >>>> /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d >>>> ../ceph/src/osdmaptool: osdmap file >>>> 'osd-20_osdmap.13258__0_4E62BB79__none' >>>> ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 >>>> ../ceph/src/osdmaptool: osdmap file >>>> 'osd-23_osdmap.13258__0_4E62BB79__none' >>>> ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 >>>> 6d5 >>>> < tunable chooseleaf_vary_r 1 >>>> >>>> Looks like the chooseleaf_vary_r tunable somehow ended up divergent? >>>> >>>> Pierre: do you recall how and when that got set? >>> >>> >>> >>> I am not sure to understand, but if I good remember after the update in >>> firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I >>> see "feature set mismatch" in log. >>> >>> So if I good remeber, i do : ceph osd crush tunables optimal for the >>> problem >>> of "crush map" and I update my client and server kernel to 3.16rc. >>> >>> It's could be that ? >>> >>> Pierre >>> >>> >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just >>>> wrote: >>>>> >>>>> >>>>> Yeah, divergent osdmaps: >>>>> 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none >>>>> 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none >>>>> >>>>> Joao: thoughts? >>>>> -Sam >>>>> >>>>> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU >>>>> wrote: >>>>>> >>>>>> >>>>>> The files >>>>>> >>>>>> When I upgrade : >>>>>>ceph-deploy install --stable firefly servers... >>>>>>on each servers service ceph restart mon >>>>>>on each servers service ceph restart osd >>>>>>on each servers service ceph restart mds >>>>>> >>>>>> I upgraded from emperor to firefly. After repair, remap, replace, etc >>>>>> ... I >>>>>> have some PG which pass in peering state. >>>>>> >>>>>> I thought why not try the version 0.82, it could solve my problem. ( >>>>>> It's my mistake ). So, I upgrade from firefly to 0.83 with : >>>>>>ceph-deploy install --testing servers... >>>>>>.. >>>>>> >>>>>> Now, all programs are in version 0.82. >>>>>> I have 3 mons, 36 OSD and 3 mds. >>>>>> >>>>>> Pierre >>>>>> >>>>>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta >>>>>> directory. >>>>>> >>>>>> Le 03/07/2014 00:10, Samuel Just a écrit : >>>>>> >>>>>>> Also, what version did you upgrade from, and how did you upgrade? >>>>>>> -Sam >>>>>>> >>>>>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Ok, in current/meta on osd 20 and osd 23, please attach all files >>>>>>>> matching >>>>>>>> >>>>>>>> ^osdmap.13258.* >>>>>>>> >>>>&
Re: [ceph-users] scrub error on firefly
Can you attach your ceph.conf for your osds? -Sam On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann wrote: > I can also confirm that after upgrading to firefly both of our clusters > (test and live) were going from 0 scrub errors each for about 6 Month to > about 9-12 per week... > This also makes me kind of nervous, since as far as I know everything "ceph > pg repair" does, is to copy the primary object to all replicas, no matter > which object is the correct one. > Of course the described method of manual checking works (for pools with more > than 2 replicas), but doing this in a large cluster nearly every week is > horribly timeconsuming and error prone. > It would be great to get an explanation for the increased numbers of scrub > errors since firefly. Were they just not detected correctly in previous > versions? Or is there maybe something wrong with the new code? > > Acutally, our company is currently preventing our projects to move to ceph > because of this problem. > > Regards, > Christian > > Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von "Travis > Rhoden [trho...@gmail.com] > Gesendet: Donnerstag, 10. Juli 2014 16:24 > An: Gregory Farnum > Cc: ceph-users@lists.ceph.com > Betreff: Re: [ceph-users] scrub error on firefly > > And actually just to follow-up, it does seem like there are some additional > smarts beyond just using the primary to overwrite the secondaries... Since > I captured md5 sums before and after the repair, I can say that in this > particular instance, the secondary copy was used to overwrite the primary. > So, I'm just trusting Ceph to the right thing, and so far it seems to, but > the comments here about needing to determine the correct object and place it > on the primary PG make me wonder if I've been missing something. > > - Travis > > > On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden wrote: >> >> I can also say that after a recent upgrade to Firefly, I have experienced >> massive uptick in scrub errors. The cluster was on cuttlefish for about a >> year, and had maybe one or two scrub errors. After upgrading to Firefly, >> we've probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a >> day for a few weeks until the whole cluster was rescrubbed, it seemed). >> >> What I cannot determine, however, is how to know which object is busted? >> For example, just today I ran into a scrub error. The object has two copies >> and is an 8MB piece of an RBD, and has identical timestamps, identical >> xattrs names and values. But it definitely has a different MD5 sum. How to >> know which one is correct? >> >> I've been just kicking off pg repair each time, which seems to just use >> the primary copy to overwrite the others. Haven't run into any issues with >> that so far, but it does make me nervous. >> >> - Travis >> >> >> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum wrote: >>> >>> It's not very intuitive or easy to look at right now (there are plans >>> from the recent developer summit to improve things), but the central >>> log should have output about exactly what objects are busted. You'll >>> then want to compare the copies manually to determine which ones are >>> good or bad, get the good copy on the primary (make sure you preserve >>> xattrs), and run repair. >>> -Greg >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith wrote: >>> > Greetings, >>> > >>> > I upgraded to firefly last week and I suddenly received this error: >>> > >>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >>> > >>> > ceph health detail shows the following: >>> > >>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >>> > pg 3.c6 is active+clean+inconsistent, acting [2,5] >>> > 1 scrub errors >>> > >>> > The docs say that I can run `ceph pg repair 3.c6` to fix this. What I >>> > want >>> > to know is what are the risks of data loss if I run that command in >>> > this >>> > state and how can I mitigate them? >>> > >>> > -- >>> > Randall Smith >>> > Computing Services >>> > Adams State University >>> > http://www.adams.edu/ >>> > 719-587-7741 >>> > >>> > ___ >>> > ceph-users mailing list >>> > ceph-users@lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] scrub error on firefly
Repair I think will tend to choose the copy with the lowest osd number which is not obviously corrupted. Even with three replicas, it does not do any kind of voting at this time. -Sam On Thu, Jul 10, 2014 at 10:39 AM, Chahal, Sudip wrote: > I've a basic related question re: Firefly operation - would appreciate any > insights: > > With three replicas, if checksum inconsistencies across replicas are found > during deep-scrub then: > a. does the majority win or is the primary always the winner and > used to overwrite the secondaries > b. is this reconciliation done automatically during > deep-scrub or does each reconciliation have to be executed manually by the > administrator? > > With 2 replicas - how are things different (if at all): >a. The primary is declared the winner - correct? >b. is this reconciliation done automatically during deep-scrub > or does it have to be done "manually" because there is no majority? > > Thanks, > > -Sudip > > > -Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Samuel Just > Sent: Thursday, July 10, 2014 10:16 AM > To: Christian Eichelmann > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] scrub error on firefly > > Can you attach your ceph.conf for your osds? > -Sam > > On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann > wrote: >> I can also confirm that after upgrading to firefly both of our >> clusters (test and live) were going from 0 scrub errors each for about >> 6 Month to about 9-12 per week... >> This also makes me kind of nervous, since as far as I know everything >> "ceph pg repair" does, is to copy the primary object to all replicas, >> no matter which object is the correct one. >> Of course the described method of manual checking works (for pools >> with more than 2 replicas), but doing this in a large cluster nearly >> every week is horribly timeconsuming and error prone. >> It would be great to get an explanation for the increased numbers of >> scrub errors since firefly. Were they just not detected correctly in >> previous versions? Or is there maybe something wrong with the new code? >> >> Acutally, our company is currently preventing our projects to move to >> ceph because of this problem. >> >> Regards, >> Christian >> >> Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von >> "Travis Rhoden [trho...@gmail.com] >> Gesendet: Donnerstag, 10. Juli 2014 16:24 >> An: Gregory Farnum >> Cc: ceph-users@lists.ceph.com >> Betreff: Re: [ceph-users] scrub error on firefly >> >> And actually just to follow-up, it does seem like there are some >> additional smarts beyond just using the primary to overwrite the >> secondaries... Since I captured md5 sums before and after the repair, >> I can say that in this particular instance, the secondary copy was used to >> overwrite the primary. >> So, I'm just trusting Ceph to the right thing, and so far it seems to, >> but the comments here about needing to determine the correct object >> and place it on the primary PG make me wonder if I've been missing something. >> >> - Travis >> >> >> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden wrote: >>> >>> I can also say that after a recent upgrade to Firefly, I have >>> experienced massive uptick in scrub errors. The cluster was on >>> cuttlefish for about a year, and had maybe one or two scrub errors. >>> After upgrading to Firefly, we've probably seen 3 to 4 dozen in the >>> last month or so (was getting 2-3 a day for a few weeks until the whole >>> cluster was rescrubbed, it seemed). >>> >>> What I cannot determine, however, is how to know which object is busted? >>> For example, just today I ran into a scrub error. The object has two >>> copies and is an 8MB piece of an RBD, and has identical timestamps, >>> identical xattrs names and values. But it definitely has a different >>> MD5 sum. How to know which one is correct? >>> >>> I've been just kicking off pg repair each time, which seems to just >>> use the primary copy to overwrite the others. Haven't run into any >>> issues with that so far, but it does make me nervous. >>> >>> - Travis >>> >>> >>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum wrote: >>>> >>>> It's not very intuitive or easy to look at right now (t
Re: [ceph-users] scrub error on firefly
It could be an indication of a problem on osd 5, but the timing is worrying. Can you attach your ceph.conf? Have there been any osds going down, new osds added, anything to cause recovery? Anything in dmesg to indicate an fs problem? Have you recently changed any settings? -Sam On Thu, Jul 10, 2014 at 2:58 PM, Randy Smith wrote: > Greetings, > > Just a follow up on my original issue. =ceph pg repair ...= fixed the > problem. However, today I got another inconsistent pg. It's interesting to > me that this second error is in the same rbd image and appears to be "close" > to the previously inconsistent pg. (Even more fun, osd.5 was the secondary > in the first error and is the primary here though the other osd is > different.) > > Is this indicative of a problem on osd.5 or perhaps a clue into what's > causing firefly to be so inconsistent? > > The relevant log entries are below. > > 2014-07-07 18:50:48.646407 osd.2 192.168.253.70:6801/56987 163 : [ERR] 3.c6 > shard 2: soid 34dc35c6/rb.0.b0ce3.238e1f29.000b/head//3 digest > 2256074002 != known digest 3998068918 > 2014-07-07 18:51:36.936076 osd.2 192.168.253.70:6801/56987 164 : [ERR] 3.c6 > deep-scrub 0 missing, 1 inconsistent objects > 2014-07-07 18:51:36.936082 osd.2 192.168.253.70:6801/56987 165 : [ERR] 3.c6 > deep-scrub 1 errors > > > 2014-07-10 15:38:53.990328 osd.5 192.168.253.81:6800/10013 257 : [ERR] 3.41 > shard 1: soid e183cc41/rb.0.b0ce3.238e1f29.024c/head//3 digest > 3224286363 != known digest 3409342281 > 2014-07-10 15:39:11.701276 osd.5 192.168.253.81:6800/10013 258 : [ERR] 3.41 > deep-scrub 0 missing, 1 inconsistent objects > 2014-07-10 15:39:11.701281 osd.5 192.168.253.81:6800/10013 259 : [ERR] 3.41 > deep-scrub 1 errors > > > > On Thu, Jul 10, 2014 at 12:05 PM, Chahal, Sudip > wrote: >> >> Thanks - so it appears that the advantage of the 3rd replica (relative to >> 2 replicas) has to do much more with recovering from two concurrent OSD >> failures than with inconsistencies found during deep scrub - would you >> agree? >> >> Re: repair - do you mean the "repair" process during deep scrub - if yes, >> this is automatic - correct? >> Or >> Are you referring to the explicit manually initiated repair commands? >> >> Thanks, >> >> -Sudip >> >> -Original Message- >> From: Samuel Just [mailto:sam.j...@inktank.com] >> Sent: Thursday, July 10, 2014 10:50 AM >> To: Chahal, Sudip >> Cc: Christian Eichelmann; ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] scrub error on firefly >> >> Repair I think will tend to choose the copy with the lowest osd number >> which is not obviously corrupted. Even with three replicas, it does not do >> any kind of voting at this time. >> -Sam >> >> On Thu, Jul 10, 2014 at 10:39 AM, Chahal, Sudip >> wrote: >> > I've a basic related question re: Firefly operation - would appreciate >> > any insights: >> > >> > With three replicas, if checksum inconsistencies across replicas are >> > found during deep-scrub then: >> > a. does the majority win or is the primary always the winner >> > and used to overwrite the secondaries >> > b. is this reconciliation done automatically during >> > deep-scrub or does each reconciliation have to be executed manually by the >> > administrator? >> > >> > With 2 replicas - how are things different (if at all): >> >a. The primary is declared the winner - correct? >> >b. is this reconciliation done automatically during >> > deep-scrub or does it have to be done "manually" because there is no >> > majority? >> > >> > Thanks, >> > >> > -Sudip >> > >> > >> > -Original Message- >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> > Of Samuel Just >> > Sent: Thursday, July 10, 2014 10:16 AM >> > To: Christian Eichelmann >> > Cc: ceph-users@lists.ceph.com >> > Subject: Re: [ceph-users] scrub error on firefly >> > >> > Can you attach your ceph.conf for your osds? >> > -Sam >> > >> > On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann >> > wrote: >> >> I can also confirm that after upgrading to firefly both of our >> >> clusters (test and live) were going from 0 scrub errors each for >> >> about >> >> 6 Month to about 9-12 per week... >> >> This also makes me kind of nervous, sinc
Re: [ceph-users] scrub error on firefly
When you get the next inconsistency, can you copy the actual objects from the osd store trees and get them to us? That might provide a clue. -Sam On Fri, Jul 11, 2014 at 6:52 AM, Randy Smith wrote: > > > > On Thu, Jul 10, 2014 at 4:40 PM, Samuel Just wrote: >> >> It could be an indication of a problem on osd 5, but the timing is >> worrying. Can you attach your ceph.conf? > > > Attached. > >> >> Have there been any osds >> going down, new osds added, anything to cause recovery? > > > I upgraded to firefly last week. As part of the upgrade I, obviously, had to > restart every osd. Also, I attempted to switch to the optimal tunables but > doing so degraded 27% of my cluster and made most of my VMs unresponsive. I > switched back to the legacy tunables and everything was happy again. Both of > those operations, of course, caused recoveries. I have made no changes since > then. > >> >> Anything in >> dmesg to indicate an fs problem? > > > Nothing. The system went inconsistent again this morning, again on the same > rbd but different osds this time. > > 2014-07-11 05:48:12.857657 osd.1 192.168.253.77:6801/12608 904 : [ERR] 3.76 > shard 1: soid 1280076/rb.0.b0ce3.238e1f29.025c/head//3 digest > 2198242284 != known digest 3879754377 > 2014-07-11 05:49:29.020024 osd.1 192.168.253.77:6801/12608 905 : [ERR] 3.76 > deep-scrub 0 missing, 1 inconsistent objects > 2014-07-11 05:49:29.020029 osd.1 192.168.253.77:6801/12608 906 : [ERR] 3.76 > deep-scrub 1 errors > > $ ceph health detail > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > pg 3.76 is active+clean+inconsistent, acting [1,2] > 1 scrub errors > > >> >> Have you recently changed any >> settings? > > > I upgraded from bobtail to dumpling to firefly. > >> >> -Sam >> >> On Thu, Jul 10, 2014 at 2:58 PM, Randy Smith wrote: >> > Greetings, >> > >> > Just a follow up on my original issue. =ceph pg repair ...= fixed the >> > problem. However, today I got another inconsistent pg. It's interesting >> > to >> > me that this second error is in the same rbd image and appears to be >> > "close" >> > to the previously inconsistent pg. (Even more fun, osd.5 was the >> > secondary >> > in the first error and is the primary here though the other osd is >> > different.) >> > >> > Is this indicative of a problem on osd.5 or perhaps a clue into what's >> > causing firefly to be so inconsistent? >> > >> > The relevant log entries are below. >> > >> > 2014-07-07 18:50:48.646407 osd.2 192.168.253.70:6801/56987 163 : [ERR] >> > 3.c6 >> > shard 2: soid 34dc35c6/rb.0.b0ce3.238e1f29.000b/head//3 digest >> > 2256074002 != known digest 3998068918 >> > 2014-07-07 18:51:36.936076 osd.2 192.168.253.70:6801/56987 164 : [ERR] >> > 3.c6 >> > deep-scrub 0 missing, 1 inconsistent objects >> > 2014-07-07 18:51:36.936082 osd.2 192.168.253.70:6801/56987 165 : [ERR] >> > 3.c6 >> > deep-scrub 1 errors >> > >> > >> > 2014-07-10 15:38:53.990328 osd.5 192.168.253.81:6800/10013 257 : [ERR] >> > 3.41 >> > shard 1: soid e183cc41/rb.0.b0ce3.238e1f29.024c/head//3 digest >> > 3224286363 != known digest 3409342281 >> > 2014-07-10 15:39:11.701276 osd.5 192.168.253.81:6800/10013 258 : [ERR] >> > 3.41 >> > deep-scrub 0 missing, 1 inconsistent objects >> > 2014-07-10 15:39:11.701281 osd.5 192.168.253.81:6800/10013 259 : [ERR] >> > 3.41 >> > deep-scrub 1 errors >> > >> > >> > >> > On Thu, Jul 10, 2014 at 12:05 PM, Chahal, Sudip >> > wrote: >> >> >> >> Thanks - so it appears that the advantage of the 3rd replica (relative >> >> to >> >> 2 replicas) has to do much more with recovering from two concurrent OSD >> >> failures than with inconsistencies found during deep scrub - would you >> >> agree? >> >> >> >> Re: repair - do you mean the "repair" process during deep scrub - if >> >> yes, >> >> this is automatic - correct? >> >> Or >> >> Are you referring to the explicit manually initiated repair commands? >> >> >> >> Thanks, >> >> >> >> -Sudip >> >> >> >> -Original Message- >> >> From: Samuel Just [mailto:sam.j...@inktank.com] >> >> Sent: Thursday, July 10, 2014 10:50 AM >> >> To: Chahal, Sudip >&g
Re: [ceph-users] scrub error on firefly
Also, what filesystem are you using? -Sam On Fri, Jul 11, 2014 at 10:37 AM, Sage Weil wrote: > One other thing we might also try is catching this earlier (on first read > of corrupt data) instead of waiting for scrub. If you are not super > performance sensitive, you can add > > filestore sloppy crc = true > filestore sloppy crc block size = 524288 > > That will track and verify CRCs on any large (>512k) writes. Smaller > block sizes will give more precision and more checks, but will generate > larger xattrs and have a bigger impact on performance... > > sage > > > On Fri, 11 Jul 2014, Samuel Just wrote: > >> When you get the next inconsistency, can you copy the actual objects >> from the osd store trees and get them to us? That might provide a >> clue. >> -Sam >> >> On Fri, Jul 11, 2014 at 6:52 AM, Randy Smith wrote: >> > >> > >> > >> > On Thu, Jul 10, 2014 at 4:40 PM, Samuel Just wrote: >> >> >> >> It could be an indication of a problem on osd 5, but the timing is >> >> worrying. Can you attach your ceph.conf? >> > >> > >> > Attached. >> > >> >> >> >> Have there been any osds >> >> going down, new osds added, anything to cause recovery? >> > >> > >> > I upgraded to firefly last week. As part of the upgrade I, obviously, had >> > to >> > restart every osd. Also, I attempted to switch to the optimal tunables but >> > doing so degraded 27% of my cluster and made most of my VMs unresponsive. I >> > switched back to the legacy tunables and everything was happy again. Both >> > of >> > those operations, of course, caused recoveries. I have made no changes >> > since >> > then. >> > >> >> >> >> Anything in >> >> dmesg to indicate an fs problem? >> > >> > >> > Nothing. The system went inconsistent again this morning, again on the same >> > rbd but different osds this time. >> > >> > 2014-07-11 05:48:12.857657 osd.1 192.168.253.77:6801/12608 904 : [ERR] 3.76 >> > shard 1: soid 1280076/rb.0.b0ce3.238e1f29.025c/head//3 digest >> > 2198242284 != known digest 3879754377 >> > 2014-07-11 05:49:29.020024 osd.1 192.168.253.77:6801/12608 905 : [ERR] 3.76 >> > deep-scrub 0 missing, 1 inconsistent objects >> > 2014-07-11 05:49:29.020029 osd.1 192.168.253.77:6801/12608 906 : [ERR] 3.76 >> > deep-scrub 1 errors >> > >> > $ ceph health detail >> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >> > pg 3.76 is active+clean+inconsistent, acting [1,2] >> > 1 scrub errors >> > >> > >> >> >> >> Have you recently changed any >> >> settings? >> > >> > >> > I upgraded from bobtail to dumpling to firefly. >> > >> >> >> >> -Sam >> >> >> >> On Thu, Jul 10, 2014 at 2:58 PM, Randy Smith wrote: >> >> > Greetings, >> >> > >> >> > Just a follow up on my original issue. =ceph pg repair ...= fixed the >> >> > problem. However, today I got another inconsistent pg. It's interesting >> >> > to >> >> > me that this second error is in the same rbd image and appears to be >> >> > "close" >> >> > to the previously inconsistent pg. (Even more fun, osd.5 was the >> >> > secondary >> >> > in the first error and is the primary here though the other osd is >> >> > different.) >> >> > >> >> > Is this indicative of a problem on osd.5 or perhaps a clue into what's >> >> > causing firefly to be so inconsistent? >> >> > >> >> > The relevant log entries are below. >> >> > >> >> > 2014-07-07 18:50:48.646407 osd.2 192.168.253.70:6801/56987 163 : [ERR] >> >> > 3.c6 >> >> > shard 2: soid 34dc35c6/rb.0.b0ce3.238e1f29.000b/head//3 digest >> >> > 2256074002 != known digest 3998068918 >> >> > 2014-07-07 18:51:36.936076 osd.2 192.168.253.70:6801/56987 164 : [ERR] >> >> > 3.c6 >> >> > deep-scrub 0 missing, 1 inconsistent objects >> >> > 2014-07-07 18:51:36.936082 osd.2 192.168.253.70:6801/56987 165 : [ERR] >> >> > 3.c6 >> >> > deep-scrub 1 errors >> >> > >> >> > >> >> > 2014-07-10 15:38:53.990328 osd.5 192.168.253.81:6800/100
Re: [ceph-users] scrub error on firefly
Right. -Sam On Fri, Jul 11, 2014 at 2:05 PM, Randy Smith wrote: > Greetings, > > I'm using xfs. > > Also, when, in a previous email, you asked if I could send the object, do > you mean the files from each server named something like this: > ./3.c6_head/DIR_6/DIR_C/DIR_5/rb.0.b0ce3.238e1f29.000b__head_34DC35C6__3 > ? > > > On Fri, Jul 11, 2014 at 2:00 PM, Samuel Just wrote: >> >> Also, what filesystem are you using? >> -Sam >> >> On Fri, Jul 11, 2014 at 10:37 AM, Sage Weil wrote: >> > One other thing we might also try is catching this earlier (on first >> > read >> > of corrupt data) instead of waiting for scrub. If you are not super >> > performance sensitive, you can add >> > >> > filestore sloppy crc = true >> > filestore sloppy crc block size = 524288 >> > >> > That will track and verify CRCs on any large (>512k) writes. Smaller >> > block sizes will give more precision and more checks, but will generate >> > larger xattrs and have a bigger impact on performance... >> > >> > sage >> > >> > >> > On Fri, 11 Jul 2014, Samuel Just wrote: >> > >> >> When you get the next inconsistency, can you copy the actual objects >> >> from the osd store trees and get them to us? That might provide a >> >> clue. >> >> -Sam >> >> >> >> On Fri, Jul 11, 2014 at 6:52 AM, Randy Smith wrote: >> >> > >> >> > >> >> > >> >> > On Thu, Jul 10, 2014 at 4:40 PM, Samuel Just >> >> > wrote: >> >> >> >> >> >> It could be an indication of a problem on osd 5, but the timing is >> >> >> worrying. Can you attach your ceph.conf? >> >> > >> >> > >> >> > Attached. >> >> > >> >> >> >> >> >> Have there been any osds >> >> >> going down, new osds added, anything to cause recovery? >> >> > >> >> > >> >> > I upgraded to firefly last week. As part of the upgrade I, obviously, >> >> > had to >> >> > restart every osd. Also, I attempted to switch to the optimal >> >> > tunables but >> >> > doing so degraded 27% of my cluster and made most of my VMs >> >> > unresponsive. I >> >> > switched back to the legacy tunables and everything was happy again. >> >> > Both of >> >> > those operations, of course, caused recoveries. I have made no >> >> > changes since >> >> > then. >> >> > >> >> >> >> >> >> Anything in >> >> >> dmesg to indicate an fs problem? >> >> > >> >> > >> >> > Nothing. The system went inconsistent again this morning, again on >> >> > the same >> >> > rbd but different osds this time. >> >> > >> >> > 2014-07-11 05:48:12.857657 osd.1 192.168.253.77:6801/12608 904 : >> >> > [ERR] 3.76 >> >> > shard 1: soid 1280076/rb.0.b0ce3.238e1f29.025c/head//3 digest >> >> > 2198242284 != known digest 3879754377 >> >> > 2014-07-11 05:49:29.020024 osd.1 192.168.253.77:6801/12608 905 : >> >> > [ERR] 3.76 >> >> > deep-scrub 0 missing, 1 inconsistent objects >> >> > 2014-07-11 05:49:29.020029 osd.1 192.168.253.77:6801/12608 906 : >> >> > [ERR] 3.76 >> >> > deep-scrub 1 errors >> >> > >> >> > $ ceph health detail >> >> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >> >> > pg 3.76 is active+clean+inconsistent, acting [1,2] >> >> > 1 scrub errors >> >> > >> >> > >> >> >> >> >> >> Have you recently changed any >> >> >> settings? >> >> > >> >> > >> >> > I upgraded from bobtail to dumpling to firefly. >> >> > >> >> >> >> >> >> -Sam >> >> >> >> >> >> On Thu, Jul 10, 2014 at 2:58 PM, Randy Smith >> >> >> wrote: >> >> >> > Greetings, >> >> >> > >> >> >> > Just a follow up on my original issue. =ceph pg repair ...= fixed >> >> >> > the >> >> >> > problem. However, today I got another inconsistent pg. It'
Re: [ceph-users] scrub error on firefly
And grab the xattrs as well. -Sam On Fri, Jul 11, 2014 at 2:39 PM, Samuel Just wrote: > Right. > -Sam > > On Fri, Jul 11, 2014 at 2:05 PM, Randy Smith wrote: >> Greetings, >> >> I'm using xfs. >> >> Also, when, in a previous email, you asked if I could send the object, do >> you mean the files from each server named something like this: >> ./3.c6_head/DIR_6/DIR_C/DIR_5/rb.0.b0ce3.238e1f29.000b__head_34DC35C6__3 >> ? >> >> >> On Fri, Jul 11, 2014 at 2:00 PM, Samuel Just wrote: >>> >>> Also, what filesystem are you using? >>> -Sam >>> >>> On Fri, Jul 11, 2014 at 10:37 AM, Sage Weil wrote: >>> > One other thing we might also try is catching this earlier (on first >>> > read >>> > of corrupt data) instead of waiting for scrub. If you are not super >>> > performance sensitive, you can add >>> > >>> > filestore sloppy crc = true >>> > filestore sloppy crc block size = 524288 >>> > >>> > That will track and verify CRCs on any large (>512k) writes. Smaller >>> > block sizes will give more precision and more checks, but will generate >>> > larger xattrs and have a bigger impact on performance... >>> > >>> > sage >>> > >>> > >>> > On Fri, 11 Jul 2014, Samuel Just wrote: >>> > >>> >> When you get the next inconsistency, can you copy the actual objects >>> >> from the osd store trees and get them to us? That might provide a >>> >> clue. >>> >> -Sam >>> >> >>> >> On Fri, Jul 11, 2014 at 6:52 AM, Randy Smith wrote: >>> >> > >>> >> > >>> >> > >>> >> > On Thu, Jul 10, 2014 at 4:40 PM, Samuel Just >>> >> > wrote: >>> >> >> >>> >> >> It could be an indication of a problem on osd 5, but the timing is >>> >> >> worrying. Can you attach your ceph.conf? >>> >> > >>> >> > >>> >> > Attached. >>> >> > >>> >> >> >>> >> >> Have there been any osds >>> >> >> going down, new osds added, anything to cause recovery? >>> >> > >>> >> > >>> >> > I upgraded to firefly last week. As part of the upgrade I, obviously, >>> >> > had to >>> >> > restart every osd. Also, I attempted to switch to the optimal >>> >> > tunables but >>> >> > doing so degraded 27% of my cluster and made most of my VMs >>> >> > unresponsive. I >>> >> > switched back to the legacy tunables and everything was happy again. >>> >> > Both of >>> >> > those operations, of course, caused recoveries. I have made no >>> >> > changes since >>> >> > then. >>> >> > >>> >> >> >>> >> >> Anything in >>> >> >> dmesg to indicate an fs problem? >>> >> > >>> >> > >>> >> > Nothing. The system went inconsistent again this morning, again on >>> >> > the same >>> >> > rbd but different osds this time. >>> >> > >>> >> > 2014-07-11 05:48:12.857657 osd.1 192.168.253.77:6801/12608 904 : >>> >> > [ERR] 3.76 >>> >> > shard 1: soid 1280076/rb.0.b0ce3.238e1f29.025c/head//3 digest >>> >> > 2198242284 != known digest 3879754377 >>> >> > 2014-07-11 05:49:29.020024 osd.1 192.168.253.77:6801/12608 905 : >>> >> > [ERR] 3.76 >>> >> > deep-scrub 0 missing, 1 inconsistent objects >>> >> > 2014-07-11 05:49:29.020029 osd.1 192.168.253.77:6801/12608 906 : >>> >> > [ERR] 3.76 >>> >> > deep-scrub 1 errors >>> >> > >>> >> > $ ceph health detail >>> >> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >>> >> > pg 3.76 is active+clean+inconsistent, acting [1,2] >>> >> > 1 scrub errors >>> >> > >>> >> > >>> >> >> >>> >> >> Have you recently changed any >>> >> >> settings? >>> >> > >>> >> > >>> >> > I upgraded from bobtail to dumpling to firefly. >>>
Re: [ceph-users] scrub error on firefly
When you see another one, can you include the xattrs on the files as well (you can use the attr(1) utility)? -Sam On Sat, Jul 12, 2014 at 9:51 AM, Randy Smith wrote: > That image is the root file system for a linux ldap server. > > -- > Randall Smith > Adams State University > www.adams.edu > 719-587-7741 > > On Jul 12, 2014 10:34 AM, "Samuel Just" wrote: >> >> Here's a diff of the two files. One of the two files appears to >> contain ceph leveldb keys? Randy, do you have an idea of what this >> rbd image is being used for (rb.0.b0ce3.238e1f29, that is). >> -Sam >> >> On Fri, Jul 11, 2014 at 7:25 PM, Randy Smith wrote: >> > Greetings, >> > >> > Well it happened again with two pgs this time, still in the same rbd >> > image. >> > They are at http://people.adams.edu/~rbsmith/osd.tar. I think I grabbed >> > the >> > files correctly. If not, let me know and I'll try again on the next >> > failure. >> > It certainly is happening often enough. >> > >> > >> > On Fri, Jul 11, 2014 at 3:39 PM, Samuel Just >> > wrote: >> >> >> >> And grab the xattrs as well. >> >> -Sam >> >> >> >> On Fri, Jul 11, 2014 at 2:39 PM, Samuel Just >> >> wrote: >> >> > Right. >> >> > -Sam >> >> > >> >> > On Fri, Jul 11, 2014 at 2:05 PM, Randy Smith >> >> > wrote: >> >> >> Greetings, >> >> >> >> >> >> I'm using xfs. >> >> >> >> >> >> Also, when, in a previous email, you asked if I could send the >> >> >> object, >> >> >> do >> >> >> you mean the files from each server named something like this: >> >> >> >> >> >> >> >> >> ./3.c6_head/DIR_6/DIR_C/DIR_5/rb.0.b0ce3.238e1f29.000b__head_34DC35C6__3 >> >> >> ? >> >> >> >> >> >> >> >> >> On Fri, Jul 11, 2014 at 2:00 PM, Samuel Just >> >> >> wrote: >> >> >>> >> >> >>> Also, what filesystem are you using? >> >> >>> -Sam >> >> >>> >> >> >>> On Fri, Jul 11, 2014 at 10:37 AM, Sage Weil >> >> >>> wrote: >> >> >>> > One other thing we might also try is catching this earlier (on >> >> >>> > first >> >> >>> > read >> >> >>> > of corrupt data) instead of waiting for scrub. If you are not >> >> >>> > super >> >> >>> > performance sensitive, you can add >> >> >>> > >> >> >>> > filestore sloppy crc = true >> >> >>> > filestore sloppy crc block size = 524288 >> >> >>> > >> >> >>> > That will track and verify CRCs on any large (>512k) writes. >> >> >>> > Smaller >> >> >>> > block sizes will give more precision and more checks, but will >> >> >>> > generate >> >> >>> > larger xattrs and have a bigger impact on performance... >> >> >>> > >> >> >>> > sage >> >> >>> > >> >> >>> > >> >> >>> > On Fri, 11 Jul 2014, Samuel Just wrote: >> >> >>> > >> >> >>> >> When you get the next inconsistency, can you copy the actual >> >> >>> >> objects >> >> >>> >> from the osd store trees and get them to us? That might provide >> >> >>> >> a >> >> >>> >> clue. >> >> >>> >> -Sam >> >> >>> >> >> >> >>> >> On Fri, Jul 11, 2014 at 6:52 AM, Randy Smith >> >> >>> >> wrote: >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > On Thu, Jul 10, 2014 at 4:40 PM, Samuel Just >> >> >>> >> > >> >> >>> >> > wrote: >> >> >>> >> >> >> >> >>> >> >> It could be an indication of a problem on osd 5, but the >> >> >>> >> >> timing >> >> >
Re: [ceph-users] OSD is crashing while running admin socket
That seems reasonable. Bug away! -Sam On Mon, Sep 8, 2014 at 5:11 PM, Somnath Roy wrote: > Hi Sage/Sam, > > > > I faced a crash in OSD with latest Ceph master. Here is the log trace for > the same. > > > > ceph version 0.85-677-gd5777c4 (d5777c421548e7f039bb2c77cb0df2e9c7404723) > > 1: ceph-osd() [0x990def] > > 2: (()+0xfbb0) [0x7f72ae6e6bb0] > > 3: (gsignal()+0x37) [0x7f72acc08f77] > > 4: (abort()+0x148) [0x7f72acc0c5e8] > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f72ad5146e5] > > 6: (()+0x5e856) [0x7f72ad512856] > > 7: (()+0x5e883) [0x7f72ad512883] > > 8: (()+0x5eaae) [0x7f72ad512aae] > > 9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, unsigned int, > unsigned int)+0x277) [0xa88747] > > 10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81) > [0xa89541] > > 11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16] > > 12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32] > > 13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223) [0x6b6483] > > 14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) [0x6b7057] > > 15: (OSD::asok_command(std::string, std::map boost::variant std::allocator >, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_>, > std::less, std::allocator boost::variant std::allocator >, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_> > > >&, > std::string, std::ostream&)+0x1d7) [0x612cb7] > > 16: (OSDSocketHook::call(std::string, std::map boost::variant std::allocator >, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_>, > std::less, std::allocator boost::variant std::allocator >, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_, > boost::detail::variant::void_, boost::detail::variant::void_> > > >&, > std::string, ceph::buffer::list&)+0x67) [0x67c8b7] > > 17: (AdminSocket::do_accept()+0x1007) [0xa79817] > > 18: (AdminSocket::entry()+0x258) [0xa7b448] > > 19: (()+0x7f6e) [0x7f72ae6def6e] > > 20: (clone()+0x6d) [0x7f72a9cd] > > NOTE: a copy of the executable, or `objdump -rdS ` is needed to > interpret this. > > > > Steps to reproduce: > > --- > > > > 1. Run ios > > 2. While ios running , run the following command continuously. > > > > “ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight” > > > > 3. At some point the osd will be crashed. > > > > I think I have root caused it.. > > > > 1. OpTracker::RemoveOnDelete::operator() is calling > op->_unregistered() which clears out message->data() and payload > > 2. After that, if optracking is enabled we are calling > unregister_inflight_op() which removed the op from the xlist. > > 3. Now, while dumping ops, we are calling > _dump_op_descriptor_unlocked() from TrackedOP::dump, which tries to print > the message. > > 4. So, there is a race condition when it tries to print the message > whoes ops (data) field is already cleared. > > > > Fix could be, call this op->_unregistered (in case optracking is enabled) > after it is removed from xlist. > > > > With this fix, I am not getting the crash anymore. > > > > If my observation is correct, please let me know. I will raise a bug and > will fix that as part of the overall optracker performance improvement (I > will submit that pull request soon). > > > > Thanks & Regards > > Somnath > > > > > PLEASE NOTE: T
Re: [ceph-users] OpTracker optimization
Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy wrote: > Hi Sam/Sage, > > As we discussed earlier, enabling the present OpTracker code degrading > performance severely. For example, in my setup a single OSD node with 10 > clients is reaching ~103K read iops with io served from memory while > optracking is disabled but enabling optracker it is reduced to ~39K iops. > Probably, running OSD without enabling OpTracker is not an option for many > of Ceph users. > > Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist > ops_in_flight) and removing some other bottlenecks I am able to match the > performance of OpTracking enabled OSD with OpTracking disabled, but with the > expense of ~1 extra cpu core. > > In this process I have also fixed the following tracker. > > > > http://tracker.ceph.com/issues/9384 > > > > and probably http://tracker.ceph.com/issues/8885 too. > > > > I have created following pull request for the same. Please review it. > > > > https://github.com/ceph/ceph/pull/2440 > > > > Thanks & Regards > > Somnath > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby > notified that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies > or electronically stored copies). > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OpTracker optimization
Responded with cosmetic nonsense. Once you've got that and the other comments addressed, I can put it in wip-sam-testing. -Sam On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy wrote: > Thanks Sam..I responded back :-) > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just > Sent: Wednesday, September 10, 2014 11:17 AM > To: Somnath Roy > Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; > ceph-users@lists.ceph.com > Subject: Re: OpTracker optimization > > Added a comment about the approach. > -Sam > > On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy wrote: >> Hi Sam/Sage, >> >> As we discussed earlier, enabling the present OpTracker code degrading >> performance severely. For example, in my setup a single OSD node with >> 10 clients is reaching ~103K read iops with io served from memory >> while optracking is disabled but enabling optracker it is reduced to ~39K >> iops. >> Probably, running OSD without enabling OpTracker is not an option for >> many of Ceph users. >> >> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist >> ops_in_flight) and removing some other bottlenecks I am able to match >> the performance of OpTracking enabled OSD with OpTracking disabled, >> but with the expense of ~1 extra cpu core. >> >> In this process I have also fixed the following tracker. >> >> >> >> http://tracker.ceph.com/issues/9384 >> >> >> >> and probably http://tracker.ceph.com/issues/8885 too. >> >> >> >> I have created following pull request for the same. Please review it. >> >> >> >> https://github.com/ceph/ceph/pull/2440 >> >> >> >> Thanks & Regards >> >> Somnath >> >> >> >> >> >> >> PLEASE NOTE: The information contained in this electronic mail message >> is intended only for the use of the designated recipient(s) named >> above. If the reader of this message is not the intended recipient, >> you are hereby notified that you have received this message in error >> and that any review, dissemination, distribution, or copying of this >> message is strictly prohibited. If you have received this >> communication in error, please notify the sender by telephone or >> e-mail (as shown above) immediately and destroy any and all copies of >> this message in your possession (whether hard copies or electronically >> stored copies). >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html > > > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OpTracker optimization
I don't quite understand. -Sam On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy wrote: > Thanks Sam. > So, you want me to go with optracker/shadedopWq , right ? > > Regards > Somnath > > -Original Message- > From: Samuel Just [mailto:sam.j...@inktank.com] > Sent: Wednesday, September 10, 2014 2:36 PM > To: Somnath Roy > Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; > ceph-users@lists.ceph.com > Subject: Re: OpTracker optimization > > Responded with cosmetic nonsense. Once you've got that and the other > comments addressed, I can put it in wip-sam-testing. > -Sam > > On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy wrote: >> Thanks Sam..I responded back :-) >> >> -Original Message- >> From: ceph-devel-ow...@vger.kernel.org >> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just >> Sent: Wednesday, September 10, 2014 11:17 AM >> To: Somnath Roy >> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; >> ceph-users@lists.ceph.com >> Subject: Re: OpTracker optimization >> >> Added a comment about the approach. >> -Sam >> >> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy wrote: >>> Hi Sam/Sage, >>> >>> As we discussed earlier, enabling the present OpTracker code >>> degrading performance severely. For example, in my setup a single OSD >>> node with >>> 10 clients is reaching ~103K read iops with io served from memory >>> while optracking is disabled but enabling optracker it is reduced to ~39K >>> iops. >>> Probably, running OSD without enabling OpTracker is not an option for >>> many of Ceph users. >>> >>> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist >>> ops_in_flight) and removing some other bottlenecks I am able to match >>> the performance of OpTracking enabled OSD with OpTracking disabled, >>> but with the expense of ~1 extra cpu core. >>> >>> In this process I have also fixed the following tracker. >>> >>> >>> >>> http://tracker.ceph.com/issues/9384 >>> >>> >>> >>> and probably http://tracker.ceph.com/issues/8885 too. >>> >>> >>> >>> I have created following pull request for the same. Please review it. >>> >>> >>> >>> https://github.com/ceph/ceph/pull/2440 >>> >>> >>> >>> Thanks & Regards >>> >>> Somnath >>> >>> >>> >>> >>> >>> >>> PLEASE NOTE: The information contained in this electronic mail >>> message is intended only for the use of the designated recipient(s) >>> named above. If the reader of this message is not the intended >>> recipient, you are hereby notified that you have received this >>> message in error and that any review, dissemination, distribution, or >>> copying of this message is strictly prohibited. If you have received >>> this communication in error, please notify the sender by telephone or >>> e-mail (as shown above) immediately and destroy any and all copies of >>> this message in your possession (whether hard copies or electronically >>> stored copies). >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majord...@vger.kernel.org More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> >> >> >> PLEASE NOTE: The information contained in this electronic mail message is >> intended only for the use of the designated recipient(s) named above. If the >> reader of this message is not the intended recipient, you are hereby >> notified that you have received this message in error and that any review, >> dissemination, distribution, or copying of this message is strictly >> prohibited. If you have received this communication in error, please notify >> the sender by telephone or e-mail (as shown above) immediately and destroy >> any and all copies of this message in your possession (whether hard copies >> or electronically stored copies). >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OpTracker optimization
Oh, I changed my mind, your approach is fine. I was unclear. Currently, I just need you to address the other comments. -Sam On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy wrote: > As I understand, you want me to implement the following. > > 1. Keep this implementation one sharded optracker for the ios going through > ms_dispatch path. > > 2. Additionally, for ios going through ms_fast_dispatch, you want me to > implement optracker (without internal shard) per opwq shard > > Am I right ? > > Thanks & Regards > Somnath > > -Original Message- > From: Samuel Just [mailto:sam.j...@inktank.com] > Sent: Wednesday, September 10, 2014 3:08 PM > To: Somnath Roy > Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; > ceph-users@lists.ceph.com > Subject: Re: OpTracker optimization > > I don't quite understand. > -Sam > > On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy wrote: >> Thanks Sam. >> So, you want me to go with optracker/shadedopWq , right ? >> >> Regards >> Somnath >> >> -Original Message- >> From: Samuel Just [mailto:sam.j...@inktank.com] >> Sent: Wednesday, September 10, 2014 2:36 PM >> To: Somnath Roy >> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; >> ceph-users@lists.ceph.com >> Subject: Re: OpTracker optimization >> >> Responded with cosmetic nonsense. Once you've got that and the other >> comments addressed, I can put it in wip-sam-testing. >> -Sam >> >> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy wrote: >>> Thanks Sam..I responded back :-) >>> >>> -Original Message- >>> From: ceph-devel-ow...@vger.kernel.org >>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just >>> Sent: Wednesday, September 10, 2014 11:17 AM >>> To: Somnath Roy >>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; >>> ceph-users@lists.ceph.com >>> Subject: Re: OpTracker optimization >>> >>> Added a comment about the approach. >>> -Sam >>> >>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy wrote: >>>> Hi Sam/Sage, >>>> >>>> As we discussed earlier, enabling the present OpTracker code >>>> degrading performance severely. For example, in my setup a single >>>> OSD node with >>>> 10 clients is reaching ~103K read iops with io served from memory >>>> while optracking is disabled but enabling optracker it is reduced to ~39K >>>> iops. >>>> Probably, running OSD without enabling OpTracker is not an option >>>> for many of Ceph users. >>>> >>>> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist >>>> ops_in_flight) and removing some other bottlenecks I am able to >>>> match the performance of OpTracking enabled OSD with OpTracking >>>> disabled, but with the expense of ~1 extra cpu core. >>>> >>>> In this process I have also fixed the following tracker. >>>> >>>> >>>> >>>> http://tracker.ceph.com/issues/9384 >>>> >>>> >>>> >>>> and probably http://tracker.ceph.com/issues/8885 too. >>>> >>>> >>>> >>>> I have created following pull request for the same. Please review it. >>>> >>>> >>>> >>>> https://github.com/ceph/ceph/pull/2440 >>>> >>>> >>>> >>>> Thanks & Regards >>>> >>>> Somnath >>>> >>>> >>>> >>>> >>>> >>>> >>>> PLEASE NOTE: The information contained in this electronic mail >>>> message is intended only for the use of the designated recipient(s) >>>> named above. If the reader of this message is not the intended >>>> recipient, you are hereby notified that you have received this >>>> message in error and that any review, dissemination, distribution, >>>> or copying of this message is strictly prohibited. If you have >>>> received this communication in error, please notify the sender by >>>> telephone or e-mail (as shown above) immediately and destroy any and >>>> all copies of this message in your possession (whether hard copies or >>>> electronically stored copies). >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>> in the body of a message to majord...@vger.kernel.org More majordomo >>> info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> PLEASE NOTE: The information contained in this electronic mail message is >>> intended only for the use of the designated recipient(s) named above. If >>> the reader of this message is not the intended recipient, you are hereby >>> notified that you have received this message in error and that any review, >>> dissemination, distribution, or copying of this message is strictly >>> prohibited. If you have received this communication in error, please notify >>> the sender by telephone or e-mail (as shown above) immediately and destroy >>> any and all copies of this message in your possession (whether hard copies >>> or electronically stored copies). >>> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OpTracker optimization
Just added it to wip-sam-testing. -Sam On Thu, Sep 11, 2014 at 11:30 AM, Somnath Roy wrote: > Sam/Sage, > I have addressed all of your comments and pushed the changes to the same pull > request. > > https://github.com/ceph/ceph/pull/2440 > > Thanks & Regards > Somnath > > -Original Message- > From: Sage Weil [mailto:sw...@redhat.com] > Sent: Wednesday, September 10, 2014 8:33 PM > To: Somnath Roy > Cc: Samuel Just; ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com > Subject: RE: OpTracker optimization > > I had two substantiative comments on the first patch and then some trivial > whitespace nits.Otherwise looks good! > > tahnks- > sage > > On Thu, 11 Sep 2014, Somnath Roy wrote: > >> Sam/Sage, >> I have incorporated all of your comments. Please have a look at the same >> pull request. >> >> https://github.com/ceph/ceph/pull/2440 >> >> Thanks & Regards >> Somnath >> >> -Original Message- >> From: Samuel Just [mailto:sam.j...@inktank.com] >> Sent: Wednesday, September 10, 2014 3:25 PM >> To: Somnath Roy >> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; >> ceph-users@lists.ceph.com >> Subject: Re: OpTracker optimization >> >> Oh, I changed my mind, your approach is fine. I was unclear. >> Currently, I just need you to address the other comments. >> -Sam >> >> On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy wrote: >> > As I understand, you want me to implement the following. >> > >> > 1. Keep this implementation one sharded optracker for the ios going >> > through ms_dispatch path. >> > >> > 2. Additionally, for ios going through ms_fast_dispatch, you want me >> > to implement optracker (without internal shard) per opwq shard >> > >> > Am I right ? >> > >> > Thanks & Regards >> > Somnath >> > >> > -Original Message- >> > From: Samuel Just [mailto:sam.j...@inktank.com] >> > Sent: Wednesday, September 10, 2014 3:08 PM >> > To: Somnath Roy >> > Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; >> > ceph-users@lists.ceph.com >> > Subject: Re: OpTracker optimization >> > >> > I don't quite understand. >> > -Sam >> > >> > On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy >> > wrote: >> >> Thanks Sam. >> >> So, you want me to go with optracker/shadedopWq , right ? >> >> >> >> Regards >> >> Somnath >> >> >> >> -Original Message- >> >> From: Samuel Just [mailto:sam.j...@inktank.com] >> >> Sent: Wednesday, September 10, 2014 2:36 PM >> >> To: Somnath Roy >> >> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; >> >> ceph-users@lists.ceph.com >> >> Subject: Re: OpTracker optimization >> >> >> >> Responded with cosmetic nonsense. Once you've got that and the other >> >> comments addressed, I can put it in wip-sam-testing. >> >> -Sam >> >> >> >> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy >> >> wrote: >> >>> Thanks Sam..I responded back :-) >> >>> >> >>> -Original Message- >> >>> From: ceph-devel-ow...@vger.kernel.org >> >>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just >> >>> Sent: Wednesday, September 10, 2014 11:17 AM >> >>> To: Somnath Roy >> >>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; >> >>> ceph-users@lists.ceph.com >> >>> Subject: Re: OpTracker optimization >> >>> >> >>> Added a comment about the approach. >> >>> -Sam >> >>> >> >>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy >> >>> wrote: >> >>>> Hi Sam/Sage, >> >>>> >> >>>> As we discussed earlier, enabling the present OpTracker code >> >>>> degrading performance severely. For example, in my setup a single >> >>>> OSD node with >> >>>> 10 clients is reaching ~103K read iops with io served from memory >> >>>> while optracking is disabled but enabling optracker it is reduced to >> >>>> ~39K iops. >> >>>> Probably, running OSD without enabling OpTracker is not an option >> >>>> for many of Ceph users. >> >>>>
Re: [ceph-users] v0.92 released
There seems to be a bug with the transaction encoding when upgrading from v0.91 to v0.92. Users probably want to hold off on upgrading to v0.92 until http://tracker.ceph.com/issues/10734 is resolved. -Sam On Tue, Feb 3, 2015 at 7:40 AM, Sage Weil wrote: > This is the second-to-last chunk of new stuff before Hammer. Big items > include additional checksums on OSD objects, proxied reads in the > cache tier, image locking in RBD, optimized OSD Transaction and > replication messages, and a big pile of RGW and MDS bug fixes. > > Upgrading > - > > * The experimental 'keyvaluestore-dev' OSD backend has been renamed > 'keyvaluestore' (for simplicity) and marked as experimental. To > enable this untested feature and acknowledge that you understand > that it is untested and may destroy data, you need to add the > following to your ceph.conf:: > > enable experimental unrecoverable data corrupting featuers = > keyvaluestore > > * The following librados C API function calls take a 'flags' argument > whose value is now correctly interpreted: > > rados_write_op_operate() > rados_aio_write_op_operate() > rados_read_op_operate() > rados_aio_read_op_operate() > > The flags were not correctly being translated from the librados > constants to the internal values. Now they are. Any code that is > passing flags to these methods should be audited to ensure that they are > using the correct LIBRADOS_OP_FLAG_* constants. > > * The 'rados' CLI 'copy' and 'cppool' commands now use the copy-from > operation, which means the latest CLI cannot run these commands against > pre-firefly OSDs. > > * The librados watch/notify API now includes a watch_flush() operation to > flush the async queue of notify operations. This should be called by > any watch/notify user prior to rados_shutdown(). > > Notable Changes > --- > > * add experimental features option (Sage Weil) > * build: fix 'make check' races (#10384 Loic Dachary) > * build: fix pkg names when libkeyutils is missing (Pankag Garg, Ken > Dreyer) > * ceph: make 'ceph -s' show PG state counts in sorted order (Sage Weil) > * ceph: make 'ceph tell mon.* version' work (Mykola Golub) > * ceph-monstore-tool: fix/improve CLI (Joao Eduardo Luis) > * ceph: show primary-affinity in 'ceph osd tree' (Mykola Golub) > * common: add TableFormatter (Andreas Peters) > * common: check syncfs() return code (Jianpeng Ma) > * doc: do not suggest dangerous XFS nobarrier option (Dan van der Ster) > * doc: misc updates (Nilamdyuti Goswami, John Wilkins) > * install-deps.sh: do not require sudo when root (Loic Dachary) > * libcephfs: fix dirfrag trimming (#10387 Yan, Zheng) > * libcephfs: fix mount timeout (#10041 Yan, Zheng) > * libcephfs: fix test (#10415 Yan, Zheng) > * libcephfs: fix use-afer-free on umount (#10412 Yan, Zheng) > * libcephfs: include ceph and git version in client metadata (Sage Weil) > * librados: add watch_flush() operation (Sage Weil, Haomai Wang) > * librados: avoid memcpy on getxattr, read (Jianpeng Ma) > * librados: create ioctx by pool id (Jason Dillaman) > * librados: do notify completion in fast-dispatch (Sage Weil) > * librados: remove shadowed variable (Kefu Chain) > * librados: translate op flags from C APIs (Matthew Richards) > * librbd: differentiate between R/O vs R/W features (Jason Dillaman) > * librbd: exclusive image locking (Jason Dillaman) > * librbd: fix write vs import race (#10590 Jason Dillaman) > * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) > * mds: asok command for fetching subtree map (John Spray) > * mds: constify MDSCacheObjects (John Spray) > * misc: various valgrind fixes and cleanups (Danny Al-Gaaf) > * mon: fix 'mds fail' for standby MDSs (John Spray) > * mon: fix stashed monmap encoding (#5203 Xie Rui) > * mon: implement 'fs reset' command (John Spray) > * mon: respect down flag when promoting standbys (John Spray) > * mount.ceph: fix suprious error message (#10351 Yan, Zheng) > * msgr: async: many fixes, unit tests (Haomai Wang) > * msgr: simple: retry binding to port on failure (#10029 Wido den > Hollander) > * osd: add fadvise flags to ObjectStore API (Jianpeng Ma) > * osd: add get_latest_osdmap asok command (#9483 #9484 Mykola Golub) > * osd: EIO on whole-object reads when checksum is wrong (Sage Weil) > * osd: filejournal: don't cache journal when not using direct IO (Jianpeng > Ma) > * osd: fix ioprio option (Mykola Golub) > * osd: fix scrub delay bug (#10693 Samuel Just) > * osd: fix watch reconnect race (#10441 Sage Weil) > * osd: handle no-op
Re: [ceph-users] Unexpected OSD down during deep-scrub
The fix for this should be in 0.93, so this must be something different, can you reproduce with debug osd = 20 debug ms = 1 debug filestore = 20 and post the log to http://tracker.ceph.com/issues/11027? On Wed, 2015-03-04 at 00:04 +0100, Yann Dupont wrote: > Le 03/03/2015 22:03, Italo Santos a écrit : > > > > I realised that when the first OSD goes down, the cluster was > > performing a deep-scrub and I found the bellow trace on the logs of > > osd.8, anyone can help me understand why the osd.8, and other osds, > > unexpected goes down? > > > > I'm afraid I've seen this this afternoon too on my test cluster, just > after upgrading from 0.87 to 0.93. After an initial migration success, > some OSD started to go down : All presented similar stack traces , with > magic word "scrub" in it : > > ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4) > 1: /usr/bin/ceph-osd() [0xbeb3dc] > 2: (()+0xf0a0) [0x7f8f3ca130a0] > 3: (gsignal()+0x35) [0x7f8f3b37d165] > 4: (abort()+0x180) [0x7f8f3b3803e0] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d] > 6: (()+0x63996) [0x7f8f3bbd1996] > 7: (()+0x639c3) [0x7f8f3bbd19c3] > 8: (()+0x63bee) [0x7f8f3bbd1bee] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x220) [0xcd74f0] > 10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, > utime_t)+0x1fc) [0x97259c] > 11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) > [0x97344a] > 12: (ReplicatedPG::_scrub(ScrubMap&, std::map std::pair, std::less, > std::allocator ir > > > const&)+0x2e4d) [0x9a5ded] > 13: (PG::scrub_compare_maps()+0x658) [0x916378] > 14: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x202) [0x917ee2] > 15: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x919f83] > 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7eff93] > 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49] > 18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40] > 19: (()+0x6b50) [0x7f8f3ca0ab50] > 20: (clone()+0x6d) [0x7f8f3b42695d] > > As a temporary measure, noscrub and nodeep-scrub are now set for this > cluster, and all is working fine right now. > > So there is probably something wrong here. Need to investigate further. > > Cheers, > > > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck PGs blocked_by non-existent OSDs
You'll probably have to recreate osds with the same ids (empty ones), let them boot, stop them, and mark them lost. There is a feature in the tracker to improve this behavior: http://tracker.ceph.com/issues/10976 -Sam On Mon, 2015-03-09 at 12:24 +, joel.merr...@gmail.com wrote: > Hi, > > I'm trying to fix an issue within 0.93 on our internal cloud related > to incomplete pg's (yes, I realise the folly of having the dev release > - it's a not-so-test env now, so I need to recover this really). I'll > detail the current outage info; > > 72 initial (now 65) OSDs > 6 nodes > > * Update to 0.92 from Giant. > * Fine for a day > * MDS outage overnight and subsequent node failure > * Massive increase in RAM utilisation (10G per OSD!) > * More failure > * OSD's 'out' to try to alleviate new large cluster requirements and a > couple died under additional load > * 'superfluous and faulty' OSD's rm, auth keys deleted > * RAM added to nodes (96GB each - serving 10-12 OSDs) > * Ugrade to 0.93 > * Fix broken journals due to 0.92 update > * No more missing objects or degredation > > So, that brings me to today, I still have 73/2264 PGs listed as stuck > incomplete/inactive. I also have requests that are blocked. > > Upon querying said placement groups, I notice that they are > 'blocked_by' non-existent OSDs (ones I have removed due to issues). > I have no way to tell them the OSD is lost (as it'a already been > removed, both from osdmap and crushmap). > Exporting the crushmap shows non-existant OSDs as deviceN (i.e. > device36 for the removed osd.36) > Deleting those and reimporting crush map makes no affect > > Some further pg detail - https://gist.github.com/joelio/cecca9b48aca6d44451b > > > So I'm stuck, I can't recover the pg's as I can't remove a > non-existent OSD that the PG think's blocking it. > > Help graciously accepted! > Joel > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck PGs blocked_by non-existent OSDs
What do you mean by "unblocked" but still "stuck"? -Sam On Mon, 2015-03-09 at 22:54 +, joel.merr...@gmail.com wrote: > On Mon, Mar 9, 2015 at 2:28 PM, Samuel Just wrote: > > You'll probably have to recreate osds with the same ids (empty ones), > > let them boot, stop them, and mark them lost. There is a feature in the > > tracker to improve this behavior: http://tracker.ceph.com/issues/10976 > > -Sam > > Thanks Sam, I've readded the OSDs, they became unblocked but there are > still the same number of pgs stuck. I looked at them in some more > detail and it seems they all have num_bytes='0'. Tried a repair too, > for good measure. Still nothing I'm afraid. > > Does this mean some underlying catastrophe has happened and they are > never going to recover? Following on, would that cause data loss. > There are no missing objects and I'm hoping there's appropriate > checksumming / replicas to balance that out, but now I'm not so sure. > > Thanks again, > Joel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issues with fresh 0.93 OSD adding to existing cluster
Can you reproduce this with debug osd = 20 debug filestore = 20 debug ms = 1 on the crashing osd? Also, what sha1 are the other osds and mons running? -Sam - Original Message - From: "Malcolm Haak" To: ceph-users@lists.ceph.com Sent: Tuesday, March 10, 2015 3:28:26 AM Subject: [ceph-users] Issues with fresh 0.93 OSD adding to existing cluster Hi all, I've just attempted to add a new node and OSD to an existing ceph cluster (it's a small one I use as a NAS at home, not like the big production ones I normally work on) and it seems to be throwing some odd errors... Just looking for where to poke it next... Log is below, It's a two node cluster with 3 osd's in node A and one osd in the new node (It's going to have more eventually and node one will be retired after node three gets added) And I've hit a weird snag. I was running 0.80 but I ran into the 'Invalid Command' bug on the new node so I opted to jump to the latest code with the required patches already. Please let me know what else you need.. This is the log content when attempting to start the new OSD: 2015-03-10 19:28:48.795318 7f0774108880 0 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4), process ceph-osd, pid 10810 2015-03-10 19:28:48.817803 7f0774108880 0 filestore(/var/lib/ceph/osd/ceph-3) backend xfs (magic 0x58465342) 2015-03-10 19:28:48.866862 7f0774108880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: FIEMAP ioctl is supported and appears to work 2015-03-10 19:28:48.866920 7f0774108880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-03-10 19:28:48.905069 7f0774108880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2015-03-10 19:28:48.905467 7f0774108880 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: extsize is supported and kernel 3.18.3-1-desktop >= 3.5 2015-03-10 19:28:49.077872 7f0774108880 0 filestore(/var/lib/ceph/osd/ceph-3) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2015-03-10 19:28:49.078321 7f0774108880 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2015-03-10 19:28:49.078328 7f0774108880 1 journal _open /var/lib/ceph/osd/ceph-3/journal fd 19: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0 2015-03-10 19:28:49.079721 7f0774108880 1 journal _open /var/lib/ceph/osd/ceph-3/journal fd 19: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0 2015-03-10 19:28:49.080948 7f0774108880 0 cls/hello/cls_hello.cc:271: loading cls_hello 2015-03-10 19:28:49.094194 7f0774108880 0 osd.3 2757 crush map has features 33816576, adjusting msgr requires for clients 2015-03-10 19:28:49.094211 7f0774108880 0 osd.3 2757 crush map has features 33816576 was 8705, adjusting msgr requires for mons 2015-03-10 19:28:49.094217 7f0774108880 0 osd.3 2757 crush map has features 33816576, adjusting msgr requires for osds 2015-03-10 19:28:49.094235 7f0774108880 0 osd.3 2757 load_pgs 2015-03-10 19:28:49.094279 7f0774108880 0 osd.3 2757 load_pgs opened 0 pgs 2015-03-10 19:28:49.095121 7f0774108880 -1 osd.3 2757 log_to_monitors {default=true} 2015-03-10 19:28:49.134104 7f0774108880 0 osd.3 2757 done with init, starting boot process 2015-03-10 19:28:49.149994 7f076384c700 -1 *** Caught signal (Aborted) ** in thread 7f076384c700 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4) 1: /usr/bin/ceph-osd() [0xac7cea] 2: (()+0x10050) [0x7f0773013050] 3: (gsignal()+0x37) [0x7f07714e60f7] 4: (abort()+0x13a) [0x7f07714e74ca] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f0771dcbfe5] 6: (()+0x63186) [0x7f0771dca186] 7: (()+0x631b3) [0x7f0771dca1b3] 8: (()+0x633d2) [0x7f0771dca3d2] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) [0xc2cea7] 10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x605) [0xb7b7b5] 11: (OSDMap::decode(ceph::buffer::list::iterator&)+0x8c) [0xb7bebc] 12: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb7dfbf] 13: (OSD::handle_osd_map(MOSDMap*)+0xd37) [0x6cd9a7] 14: (OSD::_dispatch(Message*)+0x3eb) [0x6d0afb] 15: (OSD::ms_dispatch(Message*)+0x257) [0x6d1007] 16: (DispatchQueue::entry()+0x649) [0xc6fe09] 17: (DispatchQueue::DispatchThread::entry()+0xd) [0xb9dd7d] 18: (()+0x83a4) [0x7f077300b3a4] 19: (clone()+0x6d) [0x7f0771595a4d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -135> 2015-03-10 19:28:48.790490 7f0774108880 5 asok(0x420) register_command perfcounters_dump hook 0x41b4030 -134> 2015-03-10 19:28:48.790565 7f0774108880 5 asok(0x420) register_command 1 hook 0x41b4030 -133> 2015-03-10 19:28:48.790571 7f0774108880 5 asok(0x420) register_command perf dump hook 0x41b4030 -132> 2015-03-10 19:28:48.790
Re: [ceph-users] Issues with fresh 0.93 OSD adding to existing cluster
Joao, it looks like map 2759 is causing trouble, how would he get the full and incremental maps for that out of the mons? -Sam On Tue, 2015-03-10 at 14:12 +, Malcolm Haak wrote: > Hi Samuel, > > The sha1? I'm going to admit ignorance as to what you are looking for. They > are all running the same release if that is what you are asking. > Same tarball built into rpms using rpmbuild on both nodes... > Only difference being that the other node has been upgraded and the problem > node is fresh. > > added the requested config here is the command line output > > microserver-1:/etc # /etc/init.d/ceph start osd.3 > === osd.3 === > Mounting xfs on microserver-1:/var/lib/ceph/osd/ceph-3 > 2015-03-11 01:00:13.492279 7f05b2f72700 1 -- :/0 messenger.start > 2015-03-11 01:00:13.492823 7f05b2f72700 1 -- :/1002795 --> > 192.168.0.10:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 > 0x7f05ac0290b0 con 0x7f05ac027c40 > 2015-03-11 01:00:13.510814 7f05b07ef700 1 -- 192.168.0.250:0/1002795 learned > my addr 192.168.0.250:0/1002795 > 2015-03-11 01:00:13.527653 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 1 mon_map magic: 0 v1 191+0+0 (1112175541 > 0 0) 0x7f05aab0 con 0x7f05ac027c40 > 2015-03-11 01:00:13.527899 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 2 auth_reply(proto 1 0 (0) Success) v1 > 24+0+0 (3859410672 0 0) 0x7f05ae70 con 0x7f05ac027c40 > 2015-03-11 01:00:13.527973 7f05abfff700 1 -- 192.168.0.250:0/1002795 --> > 192.168.0.10:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7f05ac029730 > con 0x7f05ac027c40 > 2015-03-11 01:00:13.528124 7f05b2f72700 1 -- 192.168.0.250:0/1002795 --> > 192.168.0.10:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 > 0x7f05ac029a50 con 0x7f05ac027c40 > 2015-03-11 01:00:13.528265 7f05b2f72700 1 -- 192.168.0.250:0/1002795 --> > 192.168.0.10:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 > 0x7f05ac029f20 con 0x7f05ac027c40 > 2015-03-11 01:00:13.530359 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 3 mon_map magic: 0 v1 191+0+0 (1112175541 > 0 0) 0x7f05aab0 con 0x7f05ac027c40 > 2015-03-11 01:00:13.530548 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 4 mon_subscribe_ack(300s) v1 20+0+0 > (3648139960 0 0) 0x7f05afb0 con 0x7f05ac027c40 > 2015-03-11 01:00:13.531114 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 5 osd_map(3277..3277 src has 2757..3277) v3 > 5366+0+0 (3110999244 0 0) 0x7f05a0002800 con 0x7f05ac027c40 > 2015-03-11 01:00:13.531772 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 6 mon_subscribe_ack(300s) v1 20+0+0 > (3648139960 0 0) 0x7f05afb0 con 0x7f05ac027c40 > 2015-03-11 01:00:13.532186 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 7 osd_map(3277..3277 src has 2757..3277) v3 > 5366+0+0 (3110999244 0 0) 0x7f05a0001250 con 0x7f05ac027c40 > 2015-03-11 01:00:13.532260 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 8 mon_subscribe_ack(300s) v1 20+0+0 > (3648139960 0 0) 0x7f05afb0 con 0x7f05ac027c40 > 2015-03-11 01:00:13.556748 7f05b2f72700 1 -- 192.168.0.250:0/1002795 --> > 192.168.0.10:6789/0 -- mon_command({"prefix": "get_command_descriptions"} v > 0) v1 -- ?+0 0x7f05ac016ac0 con 0x7f05ac027c40 > 2015-03-11 01:00:13.564968 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 9 mon_command_ack([{"prefix": > "get_command_descriptions"}]=0 v0) v1 72+0+34995 (1092875540 0 > 1727986498) 0x7f05aa70 con 0x7f05ac027c40 > 2015-03-11 01:00:13.770122 7f05b2f72700 1 -- 192.168.0.250:0/1002795 --> > 192.168.0.10:6789/0 -- mon_command({"prefix": "osd crush create-or-move", > "args": ["host=microserver-1", "root=default"], "id": 3, "weight": 1.81} v 0) > v1 -- ?+0 0x7f05ac016ac0 con 0x7f05ac027c40 > 2015-03-11 01:00:13.772299 7f05abfff700 1 -- 192.168.0.250:0/1002795 <== > mon.0 192.168.0.10:6789/0 10 mon_command_ack([{"prefix": "osd crush > create-or-move", "args": ["host=microserver-1", "root=default"], "id": 3, > "weight": 1.81}]=0 create-or-move updated item name 'osd.3' weight 1.81 at > location {host=microserver-1,root=default} to crush map v3277) v1 > 256+0+0 (1191546821 0 0) 0x7f05a0001000 con 0x7f05ac027c40 > create-or-move updated item name 'osd.3' weight 1.81 at location > {host=microserver-1,root=default} to crush map > 2015-03-11 01:00:13.776891 7f05b2f72700 1 -- 192.168.0.250:0/1002795 > mark_down 0x7f05ac027c40 -- 0x7f05ac0239a0 > 2015-03-11 01:00:13.777212 7f05b2f72700 1 -- 192.168.0.250:0/1002795 > mark_down_all > 2015-03-11 01:00:13.778120 7f05b2f72700 1 -- 192.168.0.250:0/1002795 > shutdown complete. > Starting Ceph osd.3 on microserver-1... > microserver-1:/etc # > > > Log file > > > 2015-03-11 01:00:13.876152 7f41a1ba
Re: [ceph-users] Stuck PGs blocked_by non-existent OSDs
Yeah, get a ceph pg query on one of the stuck ones. -Sam On Tue, 2015-03-10 at 14:41 +, joel.merr...@gmail.com wrote: > Stuck unclean and stuck inactive. I can fire up a full query and > health dump somewhere useful if you want (full pg query info on ones > listed in health detail, tree, osd dump etc). There were blocked_by > operations that no longer exist after doing the OSD addition. > > Side note, spent some time yesterday writing some bash to do this > programatically (might be useful to others, will throw on github) > > On Tue, Mar 10, 2015 at 1:41 PM, Samuel Just wrote: > > What do you mean by "unblocked" but still "stuck"? > > -Sam > > > > On Mon, 2015-03-09 at 22:54 +0000, joel.merr...@gmail.com wrote: > >> On Mon, Mar 9, 2015 at 2:28 PM, Samuel Just wrote: > >> > You'll probably have to recreate osds with the same ids (empty ones), > >> > let them boot, stop them, and mark them lost. There is a feature in the > >> > tracker to improve this behavior: http://tracker.ceph.com/issues/10976 > >> > -Sam > >> > >> Thanks Sam, I've readded the OSDs, they became unblocked but there are > >> still the same number of pgs stuck. I looked at them in some more > >> detail and it seems they all have num_bytes='0'. Tried a repair too, > >> for good measure. Still nothing I'm afraid. > >> > >> Does this mean some underlying catastrophe has happened and they are > >> never going to recover? Following on, would that cause data loss. > >> There are no missing objects and I'm hoping there's appropriate > >> checksumming / replicas to balance that out, but now I'm not so sure. > >> > >> Thanks again, > >> Joel > > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck PGs blocked_by non-existent OSDs
Ok, you lost all copies from an interval where the pgs went active. The recovery from this is going to be complicated and fragile. Are the pools valuable? -Sam On 03/11/2015 03:35 AM, joel.merr...@gmail.com wrote: For clarity too, I've tried to drop the min_size before as suggested, doesn't make a difference unfortunately On Wed, Mar 11, 2015 at 9:50 AM, joel.merr...@gmail.com wrote: Sure thing, n.b. I increased pg count to see if it would help. Alas not. :) Thanks again! health_detail https://gist.github.com/199bab6d3a9fe30fbcae osd_dump https://gist.github.com/499178c542fa08cc33bb osd_tree https://gist.github.com/02b62b2501cbd684f9b2 Random selected queries: queries/0.19.query https://gist.github.com/f45fea7c85d6e665edf8 queries/1.a1.query https://gist.github.com/dd68fbd5e862f94eb3be queries/7.100.query https://gist.github.com/d4fd1fb030c6f2b5e678 queries/7.467.query https://gist.github.com/05dbcdc9ee089bd52d0c On Tue, Mar 10, 2015 at 2:49 PM, Samuel Just wrote: Yeah, get a ceph pg query on one of the stuck ones. -Sam On Tue, 2015-03-10 at 14:41 +, joel.merr...@gmail.com wrote: Stuck unclean and stuck inactive. I can fire up a full query and health dump somewhere useful if you want (full pg query info on ones listed in health detail, tree, osd dump etc). There were blocked_by operations that no longer exist after doing the OSD addition. Side note, spent some time yesterday writing some bash to do this programatically (might be useful to others, will throw on github) On Tue, Mar 10, 2015 at 1:41 PM, Samuel Just wrote: What do you mean by "unblocked" but still "stuck"? -Sam On Mon, 2015-03-09 at 22:54 +, joel.merr...@gmail.com wrote: On Mon, Mar 9, 2015 at 2:28 PM, Samuel Just wrote: You'll probably have to recreate osds with the same ids (empty ones), let them boot, stop them, and mark them lost. There is a feature in the tracker to improve this behavior: http://tracker.ceph.com/issues/10976 -Sam Thanks Sam, I've readded the OSDs, they became unblocked but there are still the same number of pgs stuck. I looked at them in some more detail and it seems they all have num_bytes='0'. Tried a repair too, for good measure. Still nothing I'm afraid. Does this mean some underlying catastrophe has happened and they are never going to recover? Following on, would that cause data loss. There are no missing objects and I'm hoping there's appropriate checksumming / replicas to balance that out, but now I'm not so sure. Thanks again, Joel -- $ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge' ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck PGs blocked_by non-existent OSDs
For each of those pgs, you'll need to identify the pg copy you want to be the winner and either 1) Remove all of the other ones using ceph-objectstore-tool and hopefully the winner you left alone will allow the pg to recover and go active. 2) Export the winner using ceph-objectstore-tool, use ceph-objectstore-tool to delete *all* copies of the pg, use force_create_pg to recreate the pg empty, use ceph-objectstore-tool to do a rados import on the exported pg copy. Also, the pgs which are still down still have replicas which need to be brought back or marked lost. -Sam On 03/11/2015 07:29 AM, joel.merr...@gmail.com wrote: I'd like to not have to null them if possible, there's nothing outlandishly valuable, its more the time to reprovision (users have stuff on there, mainly testing but I have a nasty feeling some users won't have backed up their test instances). When you say complicated and fragile, could you expand? Thanks again! Joel On Wed, Mar 11, 2015 at 1:21 PM, Samuel Just wrote: Ok, you lost all copies from an interval where the pgs went active. The recovery from this is going to be complicated and fragile. Are the pools valuable? -Sam On 03/11/2015 03:35 AM, joel.merr...@gmail.com wrote: For clarity too, I've tried to drop the min_size before as suggested, doesn't make a difference unfortunately On Wed, Mar 11, 2015 at 9:50 AM, joel.merr...@gmail.com wrote: Sure thing, n.b. I increased pg count to see if it would help. Alas not. :) Thanks again! health_detail https://gist.github.com/199bab6d3a9fe30fbcae osd_dump https://gist.github.com/499178c542fa08cc33bb osd_tree https://gist.github.com/02b62b2501cbd684f9b2 Random selected queries: queries/0.19.query https://gist.github.com/f45fea7c85d6e665edf8 queries/1.a1.query https://gist.github.com/dd68fbd5e862f94eb3be queries/7.100.query https://gist.github.com/d4fd1fb030c6f2b5e678 queries/7.467.query https://gist.github.com/05dbcdc9ee089bd52d0c On Tue, Mar 10, 2015 at 2:49 PM, Samuel Just wrote: Yeah, get a ceph pg query on one of the stuck ones. -Sam On Tue, 2015-03-10 at 14:41 +, joel.merr...@gmail.com wrote: Stuck unclean and stuck inactive. I can fire up a full query and health dump somewhere useful if you want (full pg query info on ones listed in health detail, tree, osd dump etc). There were blocked_by operations that no longer exist after doing the OSD addition. Side note, spent some time yesterday writing some bash to do this programatically (might be useful to others, will throw on github) On Tue, Mar 10, 2015 at 1:41 PM, Samuel Just wrote: What do you mean by "unblocked" but still "stuck"? -Sam On Mon, 2015-03-09 at 22:54 +, joel.merr...@gmail.com wrote: On Mon, Mar 9, 2015 at 2:28 PM, Samuel Just wrote: You'll probably have to recreate osds with the same ids (empty ones), let them boot, stop them, and mark them lost. There is a feature in the tracker to improve this behavior: http://tracker.ceph.com/issues/10976 -Sam Thanks Sam, I've readded the OSDs, they became unblocked but there are still the same number of pgs stuck. I looked at them in some more detail and it seems they all have num_bytes='0'. Tried a repair too, for good measure. Still nothing I'm afraid. Does this mean some underlying catastrophe has happened and they are never going to recover? Following on, would that cause data loss. There are no missing objects and I'm hoping there's appropriate checksumming / replicas to balance that out, but now I'm not so sure. Thanks again, Joel -- $ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge' ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random OSD failures - FAILED assert
Most likely fixed in firefly. -Sam - Original Message - From: "Kostis Fardelas" To: "ceph-users" Sent: Tuesday, March 17, 2015 12:30:43 PM Subject: [ceph-users] Random OSD failures - FAILED assert Hi, we are running Ceph v.0.72.2 (emperor) from the ceph emperor repo. The latest week we had 2 random OSD crashes (one during cluster recovery and one while in healthy state) with the same symptom: osd process crashes, logs the following trace on its log and gets down and out. We are in the process of preparing our cluster upgrade to firefly, but we would like to know if this is a known bug fixed in more recent versions and more about troubleshooting the specific failure. On which subsystems could we increase their debugging level to provide more info? 2015-03-16 20:44:18.768488 7f516d4c9700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_modify(OpRequestRef)' thread 7f516d4c9700 time 2015-03-16 20:44:18.764353 osd/ReplicatedPG.cc: 5570: FAILED assert(!pg_log.get_missing().is_missing(soid)) ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: (ReplicatedPG::sub_op_modify(std::tr1::shared_ptr)+0xae0) [0x9182c0] 2: (ReplicatedPG::do_sub_op(std::tr1::shared_ptr)+0x117) [0x9184f7] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x381) [0x8f12a1] 4: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x316) [0x6f7096] 5: (OSD::OpWQ::_process(boost::intrusive_ptr, ThreadPool::TPHandle&)+0x198) [0x70e048] 6: (ThreadPool::WorkQueueVal, std::tr1::shared_ptr >, boost::intrusive_ptr >::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7494ce] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xa517fa] 8: (ThreadPool::WorkThread::entry()+0x10) [0xa52a50] 9: (()+0x6b50) [0x7f5199f52b50] 10: (clone()+0x6d) [0x7f519871e70d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- Regards, Kostis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] monitor 0.87.1 crashes
You'll want to at least include the backtrace. -Sam On 03/27/2015 10:55 AM, samuel wrote: Hi all, In a fully functional ceph installation today we suffer a problem with ceph monitors, that started crashing with following error: include/interval_set.h: 340: FAILED assert(0) Is there any related bug? Thanks a lot in advance, Samuel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer
I have a suspicion about what caused this. Can you restart one of the problem osds with debug osd = 20 debug filestore = 20 debug ms = 1 and attach the resulting log from startup to crash along with the osdmap binary (ceph osd getmap -o ). -Sam - Original Message - From: "Scott Laird" To: "Robert LeBlanc" Cc: "'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com)" Sent: Sunday, April 19, 2015 6:13:55 PM Subject: Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer Nope. Straight from 0.87 to 0.94.1. FWIW, at someone's suggestion, I just upgraded the kernel on one of the boxes from 3.14 to 3.18; no improvement. Rebooting didn't help, either. Still failing with the same error in the logs. On Sun, Apr 19, 2015 at 2:06 PM Robert LeBlanc < rob...@leblancnet.us > wrote: Did you upgrade from 0.92? If you did, did you flush the logs before upgrading? On Sun, Apr 19, 2015 at 1:02 PM, Scott Laird < sc...@sigkill.org > wrote: I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs die (and stay dead) with this error in the logs: 2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f61fa900900 time 2015-04-19 11:53:36.794951 osd/OSD.h: 716: FAILED assert(ret) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc271b] 2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f] 3: (OSD::load_pgs()+0x1769) [0x6c35d9] 4: (OSD::init()+0x71f) [0x6c4c7f] 5: (main()+0x2860) [0x651fc0] 6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5] 7: /usr/bin/ceph-osd() [0x66aff7] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu 14.04. So far, every single server that I've upgraded has had at least one disk that has failed to restart with this error, and one has had several disks in this state. Restarting the OSD after it dies with this doesn't help. I haven't lost any data through this due to my slow rollout, but it's really annoying. Here are two full logs from OSDs on two different machines: https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log Any suggestions? Scott ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer
Yep, you have hit bug 11429. At some point, you removed a pool and then restarted these osds. Due to the original bug, 10617, those osds never actually removed the pgs in that pool. I'm working on a fix, or you can manually remove pgs corresponding to pools which no longer exist from the crashing osds using the ceph-objectstore-tool. -Sam - Original Message - From: "Scott Laird" To: "Samuel Just" Cc: "Robert LeBlanc" , "'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com)" Sent: Monday, April 20, 2015 6:13:06 AM Subject: Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer They're kind of big; here are links: https://dl.dropboxusercontent.com/u/104949139/osdmap https://dl.dropboxusercontent.com/u/104949139/ceph-osd.36.log On Sun, Apr 19, 2015 at 8:42 PM Samuel Just wrote: > I have a suspicion about what caused this. Can you restart one of the > problem osds with > > debug osd = 20 > debug filestore = 20 > debug ms = 1 > > and attach the resulting log from startup to crash along with the osdmap > binary (ceph osd getmap -o ). > -Sam > > - Original Message - > From: "Scott Laird" > To: "Robert LeBlanc" > Cc: "'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com)" < > ceph-users@lists.ceph.com> > Sent: Sunday, April 19, 2015 6:13:55 PM > Subject: Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer > > Nope. Straight from 0.87 to 0.94.1. FWIW, at someone's suggestion, I just > upgraded the kernel on one of the boxes from 3.14 to 3.18; no improvement. > Rebooting didn't help, either. Still failing with the same error in the > logs. > > On Sun, Apr 19, 2015 at 2:06 PM Robert LeBlanc < rob...@leblancnet.us > > wrote: > > > > Did you upgrade from 0.92? If you did, did you flush the logs before > upgrading? > > On Sun, Apr 19, 2015 at 1:02 PM, Scott Laird < sc...@sigkill.org > wrote: > > > > I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs > die (and stay dead) with this error in the logs: > > 2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function > 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f61fa900900 time > 2015-04-19 11:53:36.794951 > osd/OSD.h: 716: FAILED assert(ret) > > ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0xbc271b] > 2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f] > 3: (OSD::load_pgs()+0x1769) [0x6c35d9] > 4: (OSD::init()+0x71f) [0x6c4c7f] > 5: (main()+0x2860) [0x651fc0] > 6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5] > 7: /usr/bin/ceph-osd() [0x66aff7] > NOTE: a copy of the executable, or `objdump -rdS ` is needed > to interpret this. > > This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu > 14.04. So far, every single server that I've upgraded has had at least one > disk that has failed to restart with this error, and one has had several > disks in this state. > > Restarting the OSD after it dies with this doesn't help. > > I haven't lost any data through this due to my slow rollout, but it's > really annoying. > > Here are two full logs from OSDs on two different machines: > > https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log > https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log > > Any suggestions? > > > Scott > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
Can you explain exactly what you mean by: "Also I created one pool for tier to be able to move data without outage." -Sam - Original Message - From: "tuomas juntunen" To: "Ian Colle" Cc: ceph-users@lists.ceph.com Sent: Monday, April 27, 2015 4:23:44 AM Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down Hi Any solution for this yet? Br, Tuomas > It looks like you may have hit http://tracker.ceph.com/issues/7915 > > Ian R. Colle > Global Director > of Software Engineering > Red Hat (Inktank is now part of Red Hat!) > http://www.linkedin.com/in/ircolle > http://www.twitter.com/ircolle > Cell: +1.303.601.7713 > Email: ico...@redhat.com > > - Original Message - > From: "tuomas juntunen" > To: ceph-users@lists.ceph.com > Sent: Monday, April 27, 2015 1:56:29 PM > Subject: [ceph-users] Upgrade from Giant to Hammer and after some basic > operations most of the OSD's went down > > > > I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer > > Then created new pools and deleted some old ones. Also I created one pool for > tier to be able to move data without outage. > > After these operations all but 10 OSD's are down and creating this kind of > messages to logs, I get more than 100gb of these in a night: > > -19> 2015-04-27 10:17:08.808584 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 > 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod > 0'0 inactive NOTIFY] enter Started >-18> 2015-04-27 10:17:08.808596 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 > 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod > 0'0 inactive NOTIFY] enter Start >-17> 2015-04-27 10:17:08.808608 7fd8e748d700 1 osd.23 pg_epoch: 17882 > pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 > 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod > 0'0 inactive NOTIFY] state: transitioning to Stray >-16> 2015-04-27 10:17:08.808621 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 > 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod > 0'0 inactive NOTIFY] exit Start 0.25 0 0.00 >-15> 2015-04-27 10:17:08.808637 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 > 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod > 0'0 inactive NOTIFY] enter Started/Stray >-14> 2015-04-27 10:17:08.808796 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[10.181( empty local-les=17879 n=0 ec=17863 les/c 17879/17879 > 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive NOTIFY] exit > Reset 0.119467 4 0.37 >-13> 2015-04-27 10:17:08.808817 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[10.181( empty local-les=17879 n=0 ec=17863 les/c 17879/17879 > 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive NOTIFY] enter > Started >-12> 2015-04-27 10:17:08.808828 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[10.181( empty local-les=17879 n=0 ec=17863 les/c 17879/17879 > 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive NOTIFY] enter > Start >-11> 2015-04-27 10:17:08.808838 7fd8e748d700 1 osd.23 pg_epoch: 17882 > pg[10.181( empty local-les=17879 n=0 ec=17863 les/c 17879/17879 > 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive NOTIFY] > state: transitioning to Stray >-10> 2015-04-27 10:17:08.808849 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[10.181( empty local-les=17879 n=0 ec=17863 les/c 17879/17879 > 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive NOTIFY] exit > Start 0.20 0 0.00 > -9> 2015-04-27 10:17:08.808861 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[10.181( empty local-les=17879 n=0 ec=17863 les/c 17879/17879 > 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive NOTIFY] enter > Started/Stray > -8> 2015-04-27 10:17:08.809427 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[2.189( empty local-les=16127 n=0 ec=1 les/c 16127/16344 > 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 inactive] exit > Reset 7.511623 45 0.000165 > -7> 2015-04-27 10:17:08.809445 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[2.189( empty local-les=16127 n=0 ec=1 les/c 16127/16344 > 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 inactive] enter > Started > -6> 2015-04-27 10:17:08.809456 7fd8e748d700 5 osd.23 pg_epoch: 17882 > pg[2.189( empty local-les=16127 n=0 ec=1 les/c 16127/16344 > 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 inactive] enter > Start > -5> 2015-04-27 10:17:08.809468 7fd8e748d700 1 osd.23 pg_epoch: 17882 > pg[2.189( empty local-les=16127 n=0 ec=1 les/c 16127/16344 > 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 inactive] > sta
Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
So, the base tier is what determines the snapshots for the cache/base pool amalgam. You added a populated pool complete with snapshots on top of a base tier without snapshots. Apparently, it caused an existential crisis for the snapshot code. That's one of the reasons why there is a --force-nonempty flag for that operation, I think. I think the immediate answer is probably to disallow pools with snapshots as a cache tier altogether until we think of a good way to make it work. -Sam - Original Message - From: "tuomas juntunen" To: "Samuel Just" Cc: ceph-users@lists.ceph.com Sent: Monday, April 27, 2015 4:56:58 AM Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down The following: ceph osd tier add img images --force-nonempty ceph osd tier cache-mode images forward ceph osd tier set-overlay img images Idea was to make images as a tier to img, move data to img then change clients to use the new img pool. Br, Tuomas > Can you explain exactly what you mean by: > > "Also I created one pool for tier to be able to move data without outage." > > -Sam > - Original Message - > From: "tuomas juntunen" > To: "Ian Colle" > Cc: ceph-users@lists.ceph.com > Sent: Monday, April 27, 2015 4:23:44 AM > Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic > operations most of the OSD's went down > > Hi > > Any solution for this yet? > > Br, > Tuomas > >> It looks like you may have hit http://tracker.ceph.com/issues/7915 >> >> Ian R. Colle >> Global Director >> of Software Engineering >> Red Hat (Inktank is now part of Red Hat!) >> http://www.linkedin.com/in/ircolle >> http://www.twitter.com/ircolle >> Cell: +1.303.601.7713 >> Email: ico...@redhat.com >> >> - Original Message - >> From: "tuomas juntunen" >> To: ceph-users@lists.ceph.com >> Sent: Monday, April 27, 2015 1:56:29 PM >> Subject: [ceph-users] Upgrade from Giant to Hammer and after some basic >> operations most of the OSD's went down >> >> >> >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer >> >> Then created new pools and deleted some old ones. Also I created one pool for >> tier to be able to move data without outage. >> >> After these operations all but 10 OSD's are down and creating this kind of >> messages to logs, I get more than 100gb of these in a night: >> >> -19> 2015-04-27 10:17:08.808584 7fd8e748d700 5 osd.23 pg_epoch: 17882 >> pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod >> 0'0 inactive NOTIFY] enter Started >>-18> 2015-04-27 10:17:08.808596 7fd8e748d700 5 osd.23 pg_epoch: 17882 >> pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod >> 0'0 inactive NOTIFY] enter Start >>-17> 2015-04-27 10:17:08.808608 7fd8e748d700 1 osd.23 pg_epoch: 17882 >> pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod >> 0'0 inactive NOTIFY] state: transitioning to Stray >>-16> 2015-04-27 10:17:08.808621 7fd8e748d700 5 osd.23 pg_epoch: 17882 >> pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod >> 0'0 inactive NOTIFY] exit Start 0.25 0 0.00 >>-15> 2015-04-27 10:17:08.808637 7fd8e748d700 5 osd.23 pg_epoch: 17882 >> pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 les/c 16609/16659 >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 crt=8480'7 lcod >> 0'0 inactive NOTIFY] enter Started/Stray >>-14> 2015-04-27 10:17:08.808796 7fd8e748d700 5 osd.23 pg_epoch: 17882 >> pg[10.181( empty local-les=17879 n=0 ec=17863 les/c 17879/17879 >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive NOTIFY] exit >> Reset 0.119467 4 0.37 >>-13> 2015-04-27 10:17:08.808817 7fd8e748d700 5 osd.23 pg_epoch: 17882 >> pg[10.181( empty local-les=17879 n=0 ec=17863 les/c 17879/17879 >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive NOTIFY] enter >> Started >>-12> 2015-04-27 10:17:08.808828 7fd8e748d700 5 osd.23 pg_epoch: 17882 >> pg[1
[ceph-users] straw vs straw2 mapping differences
I took a bit of time to get a feel for how different the straw2 mappings are vs straw1 mappings. For a bucket in which all weights are the same, I saw no changed mappings, which is as expected. However, on a map with 3 hosts each of which has 4 osds with weights 1,2,3, and 4 (crush-different-weight.straw), 1360/1 mappings wind up different. Some good news is that this effect is smaller with smaller weight differences. When the osd weights are 1, 1.1, 1.2, and 1.3 (crush-different-weight-close.straw), only 377/1 single osd mappings are different. I've attached a tarball with a script for reproducing the results and the maps I tested. -Sam strawtest.tgz Description: application/compressed-tar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes
In short, the drawback is false positives which can cause unnecessary cluster churn. -Sam - Original Message - From: "Robert LeBlanc" To: "Vasiliy Angapov" Cc: "Sage Weil" , "ceph-users" Sent: Wednesday, May 13, 2015 12:21:16 PM Subject: Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Don't let me deter you from pointing a gun at your foot. I think you may be confusing Ceph for a SAN, which it certainly is not. In an expensive SAN, 5 seconds it very unacceptable for a failover event between controllers. However there are many safeguards put in place to ensure that there is no corruption using custom communication channels and protocols to ensure fencing mis-behaving controllers. Ceph is a distributed storage system built on commodity hardware and is striving to fill a different role. In the case of distributed systems you have to get a consensus of the state of each member to know what is really going on. Since there are more external forces in a distributed system, it usually can't be as quick to make decisions. Since the direct communication channels don't exist and things like flaky network cables, cards, loops, etc call all impact the ability of the cluster to have a good idea of the state of each node, you usually have to wait some time. This is partly due to the protocols used like TCP which has certain time-out periods, etc. Recovery in Ceph is also more expensive than a SAN. SANs rely on controllers being able to access the same disks on the backend, Ceph replicates the data to provide availability. This can stress even very high bandwidth networks, disks, CPUs, etc. You don't want to jump into recovery too quickly as it can cause a cascading effect. This gets to answering your question directly. If you have very short timing for detecting failures, then you can get into a feedback loop that takes your whole cluster down and it can't get healthy. If a node fails and recovery starts, other nodes may be too busy to respond to heart beats fast enough or the heartbeats are lost in transit and then they start getting marked down incorrectly and another round of recoveries start stressing the remaining OSDs causing a downward spiral. At the same time, OSDs that were wrongly marked down start responding and are brought back in the cluster and the recovery restarts and the you get this constant OSD flapping. While 30 seconds seems like a long time, it is certainly better than being offline for a month due to corrupted data (true story, I have the scars to prove it). I make it a habit for any virtualized machines to increase the timeout of the file system to 300 seconds to help weather the storms that may happen anywhere in the pipe between the VM OS -> hypervisor -> storage system. 30 seconds is well within the time that the typical end user will just punch the refresh button if they are impatient and the page will load pretty quick after that and they won't think twice. However, if you really need 5 seconds or less for failover of unexpected failures, I would suggest that you reevaluate the things that are important to you and the trade-offs you are willing to make. Although Ceph is not a perfect storage system, I have been very happy with the resiliency and performance it provides and at a wonderful price point. Even spending lots of $$ on Fibre Channel SANs does not guarantee great reliability or performance. I've had my fair share of outages on these where failovers did not work properly or some bug that has left their engineers scratching their heads for years, never to be fixed. - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, May 13, 2015 at 10:29 AM, Vasiliy Angapov wrote: Thanks, Sage! In the meanwhile I asked the same question in #Ceph IRC channel and Be_El gave me exactly the same answer, which helped. I also realized that in http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ it is stated: "You may change this grace period by adding an osd heartbeatgrace setting under the [osd] section of your Ceph configuration file, or by setting the value at runtime.". But in reality you must add this option to the [global] sections. Settinng this value in [osd] section only influenced only osd daemons, but not monitors. Anyway, now IO resumes after only 5 seconds freeze. Thanks for help, guys! Regarding Ceph failure detection: in real environment it seems for me like 20-30 seconds of freeze after a single storage node outage is very expensive. Even when we talk about data consistency... 5 seconds is acceptable threshold. But, Sage, can you please explain in brief, what are the drawbacks of lowering the timeout? If for example I got stable 10 gig cluster network which is not likely to lag or interrupt - is 5 seconds dangerous anyhow? How OSDs can report false positives
Re: [ceph-users] OSD unable to start (giant -> hammer)
You have most likely hit http://tracker.ceph.com/issues/11429. There are some workarounds in the bugs marked as duplicates of that bug, or you can wait for the next hammer point release. -Sam - Original Message - From: "Berant Lemmenes" To: ceph-users@lists.ceph.com Sent: Monday, May 18, 2015 10:24:38 AM Subject: [ceph-users] OSD unable to start (giant -> hammer) Hello all, I've encountered a problem when upgrading my single node home cluster from giant to hammer, and I would greatly appreciate any insight. I upgraded the packages like normal, then proceeded to restart the mon and once that came back restarted the first OSD (osd.3). However it subsequently won't start and crashes with the following failed assertion: osd/OSD.h: 716: FAILED assert(ret) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1784f] 2: (OSD::load_pgs()+0x277b) [0x6850fb] 3: (OSD::init()+0x1448) [0x6930b8] 4: (main()+0x26b9) [0x62fd89] 5: (__libc_start_main()+0xed) [0x7f2345bc976d] 6: ceph-osd() [0x635679] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio -2/-2 (syslog threshold) 99/99 (stderr threshold) max_recent 1 max_new 1000 log_file --- end dump of recent events --- terminate called after throwing an instance of 'ceph::FailedAssertion' *** Caught signal (Aborted) ** in thread 7f2347f71780 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: ceph-osd() [0xa1fe55] 2: (()+0xfcb0) [0x7f2346fb1cb0] 3: (gsignal()+0x35) [0x7f2345bde0d5] 4: (abort()+0x17b) [0x7f2345be183b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d] 6: (()+0xb5846) [0x7f234652d846] 7: (()+0xb5873) [0x7f234652d873] 8: (()+0xb596e) [0x7f234652d96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xb17a29] 10: (OSD::load_pgs()+0x277b) [0x6850fb] 11: (OSD::init()+0x1448) [0x6930b8] 12: (main()+0x26b9) [0x62fd89] 13: (__libc_start_main()+0xed) [0x7f2345bc976d] 14: ceph-osd() [0x635679] 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted) ** in thread 7f2347f71780 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: ceph-osd() [0xa1fe55] 2: (()+0xfcb0) [0x7f2346fb1cb0] 3: (gsignal()+0x35) [0x7f2345bde0d5] 4: (abort()+0x17b) [0x7f2345be183b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d] 6: (()+0xb5846) [0x7f234652d846] 7: (()+0xb5873) [0x7f234652d873] 8: (()+0xb596e) [0x7f234652d96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xb17a29] 10: (OSD::load_pgs()+0x277b) [0x6850fb] 11: (OSD::init()+0x1448) [0x6930b8] 12: (main()+0x26b9) [0x62fd89] 13: (__libc_start_main()+0xed) [0x7f2345bc976d] 14: ceph-osd() [0x635679] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- 0> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted) ** in thread 7f2347f71780 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: ceph-osd() [0xa1fe55] 2: (()+0xfcb0) [0x7f2346fb1cb0] 3: (gsignal()+0x35) [0x7f2345bde0d5] 4: (abort()+0x17b) [0x7f2345be183b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d] 6: (()+0xb5846) [0x7f234652d846] 7: (()+0xb5873) [0x7f234652d873] 8: (()+0xb596e) [0x7f234652d96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xb17a29] 10: (OSD::load_pgs()+0x277b) [0x6850fb] 11: (OSD::init()+0x1448) [0x6930b8] 12: (main()+0x26b9) [0x62fd89] 13: (__libc_start_main()+0xed) [0x7f2345bc976d] 14: ceph-osd() [0x635679] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyval
Re: [ceph-users] OSD crashing over and over, taking cluster down
You appear to be using pool snapshots with radosgw, I suspect that's what is causing the issue. Can you post a longer log? Preferably with debug osd = 20 debug filestore = 20 debug ms = 1 from startup to crash on an osd? -Sam - Original Message - From: "Daniel Schneller" To: ceph-users@lists.ceph.com Sent: Tuesday, May 19, 2015 4:29:42 AM Subject: [ceph-users] OSD crashing over and over, taking cluster down Last night our Hammer cluster suffered a series of OSD crashes on all cluster nodes. We were running Hammer (0.94.1-98-g7df3eb5, built because we had a major problem a week ago which we suspected to be related to bugs we found in the tracker, that were not yet in 0.94.1). In the meantime we downgraded to the official 0.94.1 to rule out we introduced any instability by taking a HEAD version. No change. Around 22.00h our users started reporting a down web application, and at the same time we found lots of OSD crashes and restarts. The stack trace in the log looks like this on all of them (taken at a later time, when I had increased debug_osd=20): 2015-05-19 03:38:35.449476 7f7aed260700 -1 osd/osd_types.h: In function 'void ObjectContext::RWState::put_excl(std::list >*)' thread 7f7aed260700 time 2015-05-19 03:38:35.445434 osd/osd_types.h: 3167: FAILED assert(state == RWEXCL) ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc2b8b] 2: /usr/bin/ceph-osd() [0x8b9b05] 3: (ReplicatedPG::remove_repop(ReplicatedPG::RepGather*)+0xec) [0x84516c] 4: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x912) [0x857082] 5: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x16d) [0x857bbd] 6: (Context::complete(int)+0x9) [0x6caf09] 7: (ReplicatedBackend::op_applied(ReplicatedBackend::InProgressOp*)+0x1ec) [0xa081dc] 8: (Context::complete(int)+0x9) [0x6caf09] 9: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af634] 10: (Context::complete(int)+0x9) [0x6caf09] 11: (void finish_contexts(CephContext*, std::list >&, int)+0x94) [0x70b764] 12: (C_ContextsBase::complete(int)+0x9) [0x6cb759] 13: (Finisher::finisher_thread_entry()+0x158) [0xaef528] 14: (()+0x8182) [0x7f7afd540182] 15: (clone()+0x6d) [0x7f7afbaabfbd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. Right before the crash, log lines like this were output (see also: http://pastebin.com/W0srF0bW ) 2015-05-19 03:38:35.445346 7f7aed260700 15 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] do_osd_op_effects client.141679501 con 0x13fcd180 2015-05-19 03:38:35.445356 7f7aed260700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean] removing repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5) 2015-05-19 03:38:35.445370 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]q front is repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5) 2015-05-19 03:38:35.445381 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean] remove_repop repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5) 2015-05-19 03:38:35.445393 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean] obc obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/head//81(dne) rwstate(excl n=1 w=0)) 2015-05-19 03:38:35.445401 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local
Re: [ceph-users] OSD unable to start (giant -> hammer)
If 2.14 is part of a non-existent pool, you should be able to rename it out of current/ in the osd directory to prevent the osd from seeing it on startup. -Sam - Original Message - From: "Berant Lemmenes" To: "Samuel Just" Cc: ceph-users@lists.ceph.com Sent: Tuesday, May 19, 2015 12:58:30 PM Subject: Re: [ceph-users] OSD unable to start (giant -> hammer) Hello, So here are the steps I performed and where I sit now. Step 1) Using 'ceph-objectstore-tool list' to create a list of all PGs not associated with the 3 pools (rbd, data, metadata) that are actually in use on this cluster. Step 2) I then did a 'ceph-objectstore-tool remove' of those PGs Then when starting the OSD it would complain about PGs that were NOT in the list of 'ceph-objectstore-tool list' but WERE present on the filesystem of the OSD in question. Step 3) Iterating over all of the PGs that were on disk and using 'ceph-objectstore-tool info' I made a list of all PGs that returned ENOENT, Step 4) 'ceph-objectstore-tool remove' to remove all those as well. Now when starting osd.3 I get an "unable to load metadata' error for a PG that according to 'ceph pg 2.14 query' is not present (and shouldn't be) on osd.3. Shown below with OSD debugging at 20: -23> 2015-05-19 15:15:12.712036 7fb079a20780 20 read_log 39533'174051 (39533'174050) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2811937 2015-05-18 07:18:42.859501 -22> 2015-05-19 15:15:12.712066 7fb079a20780 20 read_log 39533'174052 (39533'174051) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2812374 2015-05-18 07:33:21.973157 -21> 2015-05-19 15:15:12.712096 7fb079a20780 20 read_log 39533'174053 (39533'174052) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2812861 2015-05-18 07:48:23.098343 -20> 2015-05-19 15:15:12.712127 7fb079a20780 20 read_log 39533'174054 (39533'174053) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2813371 2015-05-18 08:03:54.226512 -19> 2015-05-19 15:15:12.712157 7fb079a20780 20 read_log 39533'174055 (39533'174054) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2813922 2015-05-18 08:18:20.351421 -18> 2015-05-19 15:15:12.712187 7fb079a20780 20 read_log 39533'174056 (39533'174055) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2814396 2015-05-18 08:33:56.476035 -17> 2015-05-19 15:15:12.712221 7fb079a20780 20 read_log 39533'174057 (39533'174056) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2814971 2015-05-18 08:48:22.605674 -16> 2015-05-19 15:15:12.712252 7fb079a20780 20 read_log 39533'174058 (39533'174057) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2815407 2015-05-18 09:02:48.720181 -15> 2015-05-19 15:15:12.712282 7fb079a20780 20 read_log 39533'174059 (39533'174058) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2815434 2015-05-18 09:03:43.727839 -14> 2015-05-19 15:15:12.712312 7fb079a20780 20 read_log 39533'174060 (39533'174059) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2815889 2015-05-18 09:17:49.846406 -13> 2015-05-19 15:15:12.712342 7fb079a20780 20 read_log 39533'174061 (39533'174060) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2816358 2015-05-18 09:32:50.969457 -12> 2015-05-19 15:15:12.712372 7fb079a20780 20 read_log 39533'174062 (39533'174061) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2816840 2015-05-18 09:47:52.091524 -11> 2015-05-19 15:15:12.712403 7fb079a20780 20 read_log 39533'174063 (39533'174062) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2816861 2015-05-18 09:48:22.096309 -10> 2015-05-19 15:15:12.712433 7fb079a20780 20 read_log 39533'174064 (39533'174063) modify 49277412/rb.0.100f.2ae8944a.00029945/head//2 by client.18119.0:2817714 2015-05-18 10:02:53.222749 -9> 2015-05-19 15:15:12.713130 7fb079a20780 10 read_log done -8> 2015-05-19 15:15:12.713550 7fb079a20780 10 osd.3 pg_epoch: 39533 pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101 ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0 pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive] handle_loaded -7> 2015-05-19 15:15:12.713570 7fb079a20780 5 osd.3 pg_epoch: 39533 pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101 ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0 pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIF
[ceph-users] Discuss: New default recovery config settings
Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) We'd like a bit of feedback first though. Is anyone happy with the current configs? Is anyone using something between these values and the current defaults? What kind of workload? I'd guess that lowering osd_max_backfills to 1 is probably a good idea, but I wonder whether lowering osd_recovery_max_active and osd_recovery_max_single_start will cause small objects to recover unacceptably slowly. Thoughts? -Sam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] librados clone_range
ObjectWriteOperations currently allow you to perform a clone_range from another object with the same object locator. Years ago, rgw used this as part of multipart upload. Today, the implementation complicates the OSD considerably, and it doesn't appear to have any users left. Is there anyone who would be sad to see it removed from the librados interface? -Sam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs going inconsistent after stopping the primary
Looks like it's just a stat error. The primary appears to have the correct stats, but the replica for some reason doesn't (thinks there's an object for some reason). I bet it clears itself it you perform a write on the pg since the primary will send over its stats. We'd need information from when the stat error originally occurred to debug further. -Sam - Original Message - From: "Dan van der Ster" To: ceph-users@lists.ceph.com Sent: Wednesday, July 22, 2015 7:49:00 AM Subject: [ceph-users] PGs going inconsistent after stopping the primary Hi Ceph community, Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64 We wanted to post here before the tracker to see if someone else has had this problem. We have a few PGs (different pools) which get marked inconsistent when we stop the primary OSD. The problem is strange because once we restart the primary, then scrub the PG, the PG is marked active+clean. But inevitably next time we stop the primary OSD, the same PG is marked inconsistent again. There is no user activity on this PG, and nothing interesting is logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line mentioning the PG already says inactive+inconsistent). We suspect this is related to garbage files left in the PG folder. One of our PGs is acting basically like above, except it goes through this cycle: active+clean -> (deep-scrub) -> active+clean+inconsistent -> (repair) -> active+clean -> (restart primary OSD) -> (deep-scrub) -> active+clean+inconsistent. This one at least logs: 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors and this should be debuggable because there is only one object in the pool: tapetest 55 0 073575G 1 even though rados ls returns no objects: # rados ls -p tapetest # Any ideas? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs going inconsistent after stopping the primary
Annoying that we don't know what caused the replica's stat structure to get out of sync. Let us know if you see it recur. What were those pools used for? -Sam - Original Message - From: "Dan van der Ster" To: "Samuel Just" Cc: ceph-users@lists.ceph.com Sent: Wednesday, July 22, 2015 12:36:53 PM Subject: Re: [ceph-users] PGs going inconsistent after stopping the primary Cool, writing some objects to the affected PGs has stopped the consistent/inconsistent cycle. I'll keep an eye on them but this seems to have fixed the problem. Thanks!! Dan On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just wrote: > Looks like it's just a stat error. The primary appears to have the correct > stats, but the replica for some reason doesn't (thinks there's an object for > some reason). I bet it clears itself it you perform a write on the pg since > the primary will send over its stats. We'd need information from when the > stat error originally occurred to debug further. > -Sam > > - Original Message - > From: "Dan van der Ster" > To: ceph-users@lists.ceph.com > Sent: Wednesday, July 22, 2015 7:49:00 AM > Subject: [ceph-users] PGs going inconsistent after stopping the primary > > Hi Ceph community, > > Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64 > > We wanted to post here before the tracker to see if someone else has > had this problem. > > We have a few PGs (different pools) which get marked inconsistent when > we stop the primary OSD. The problem is strange because once we > restart the primary, then scrub the PG, the PG is marked active+clean. > But inevitably next time we stop the primary OSD, the same PG is > marked inconsistent again. > > There is no user activity on this PG, and nothing interesting is > logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line > mentioning the PG already says inactive+inconsistent). > > > We suspect this is related to garbage files left in the PG folder. One > of our PGs is acting basically like above, except it goes through this > cycle: active+clean -> (deep-scrub) -> active+clean+inconsistent -> > (repair) -> active+clean -> (restart primary OSD) -> (deep-scrub) -> > active+clean+inconsistent. This one at least logs: > > 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts > 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat > mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 > hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. > 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors > > and this should be debuggable because there is only one object in the pool: > > tapetest 55 0 073575G 1 > > even though rados ls returns no objects: > > # rados ls -p tapetest > # > > Any ideas? > > Cheers, Dan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs going inconsistent after stopping the primary
Oh, if you were running dev releases, it's not super surprising that the stat tracking was at some point buggy. -Sam - Original Message - From: "Dan van der Ster" To: "Samuel Just" Cc: ceph-users@lists.ceph.com Sent: Thursday, July 23, 2015 8:21:07 AM Subject: Re: [ceph-users] PGs going inconsistent after stopping the primary Those pools were a few things: rgw.buckets plus a couple pools we use for developing new librados clients. But the source of this issue is likely related to the few pre-hammer development releases (and crashes) we upgraded through whilst running a large scale test. Anyway, now I'll know how to better debug this in future so we'll let you know if it reoccurs. Cheers, Dan On Wed, Jul 22, 2015 at 9:42 PM, Samuel Just wrote: > Annoying that we don't know what caused the replica's stat structure to get > out of sync. Let us know if you see it recur. What were those pools used > for? > -Sam > > - Original Message - > From: "Dan van der Ster" > To: "Samuel Just" > Cc: ceph-users@lists.ceph.com > Sent: Wednesday, July 22, 2015 12:36:53 PM > Subject: Re: [ceph-users] PGs going inconsistent after stopping the primary > > Cool, writing some objects to the affected PGs has stopped the > consistent/inconsistent cycle. I'll keep an eye on them but this seems > to have fixed the problem. > Thanks!! > Dan > > On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just wrote: >> Looks like it's just a stat error. The primary appears to have the correct >> stats, but the replica for some reason doesn't (thinks there's an object for >> some reason). I bet it clears itself it you perform a write on the pg since >> the primary will send over its stats. We'd need information from when the >> stat error originally occurred to debug further. >> -Sam >> >> - Original Message - >> From: "Dan van der Ster" >> To: ceph-users@lists.ceph.com >> Sent: Wednesday, July 22, 2015 7:49:00 AM >> Subject: [ceph-users] PGs going inconsistent after stopping the primary >> >> Hi Ceph community, >> >> Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64 >> >> We wanted to post here before the tracker to see if someone else has >> had this problem. >> >> We have a few PGs (different pools) which get marked inconsistent when >> we stop the primary OSD. The problem is strange because once we >> restart the primary, then scrub the PG, the PG is marked active+clean. >> But inevitably next time we stop the primary OSD, the same PG is >> marked inconsistent again. >> >> There is no user activity on this PG, and nothing interesting is >> logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line >> mentioning the PG already says inactive+inconsistent). >> >> >> We suspect this is related to garbage files left in the PG folder. One >> of our PGs is acting basically like above, except it goes through this >> cycle: active+clean -> (deep-scrub) -> active+clean+inconsistent -> >> (repair) -> active+clean -> (restart primary OSD) -> (deep-scrub) -> >> active+clean+inconsistent. This one at least logs: >> >> 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts >> 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat >> mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 >> hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. >> 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors >> >> and this should be debuggable because there is only one object in the pool: >> >> tapetest 55 0 073575G 1 >> >> even though rados ls returns no objects: >> >> # rados ls -p tapetest >> # >> >> Any ideas? >> >> Cheers, Dan >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why are there "degraded" PGs when adding OSDs?
Hmm, that's odd. Can you attach the osdmap and ceph pg dump prior to the addition (with all pgs active+clean), then the osdmap and ceph pg dump afterwards? -Sam - Original Message - From: "Chad William Seys" To: "Samuel Just" , "ceph-users" Sent: Monday, July 27, 2015 12:57:23 PM Subject: Re: [ceph-users] why are there "degraded" PGs when adding OSDs? Hi Sam, > The pg might also be degraded right after a map change which changes the > up/acting sets since the few objects updated right before the map change > might be new on some replicas and old on the other replicas. While in that > state, those specific objects are degraded, and the pg would report > degraded until they are recovered (which would happen asap, prior to > backfilling the new replica). -Sam That sounds like only a few PGs should be degraded. I instead have about 45% (and higher earlier). # ceph -s cluster 7797e50e-f4b3-42f6-8454-2e2b19fa41d6 health HEALTH_WARN 2081 pgs backfill 6745 pgs degraded 17 pgs recovering 6728 pgs recovery_wait 6745 pgs stuck degraded 8826 pgs stuck unclean recovery 2530124/5557452 objects degraded (45.527%) recovery 33594/5557452 objects misplaced (0.604%) monmap e5: 3 mons at {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=10.128.198.51:6789/0} election epoch 16458, quorum 0,1,2 mon03,mon01,mon02 mdsmap e3032: 1/1/1 up {0=mds01.hep.wisc.edu=up:active} osdmap e149761: 27 osds: 27 up, 27 in; 2083 remapped pgs pgmap v13464928: 18432 pgs, 9 pools, 5401 GB data, 1364 kobjects 11122 GB used, 11786 GB / 22908 GB avail 2530124/5557452 objects degraded (45.527%) 33594/5557452 objects misplaced (0.604%) 9606 active+clean 6726 active+recovery_wait+degraded 2081 active+remapped+wait_backfill 17 active+recovering+degraded 2 active+recovery_wait+degraded+remapped recovery io 24861 kB/s, 6 objects/s Chad. > > - Original Message - > From: "Chad William Seys" > To: "ceph-users" > Sent: Monday, July 27, 2015 12:27:26 PM > Subject: [ceph-users] why are there "degraded" PGs when adding OSDs? > > Hi All, > > I recently added some OSDs to the Ceph cluster (0.94.2). I noticed that > 'ceph -s' reported both misplaced AND degraded PGs. > > Why should any PGs become degraded? Seems as though Ceph should only be > reporting misplaced PGs? > > From the Giant release notes: > Degraded vs misplaced: the Ceph health reports from ‘ceph -s’ and related > commands now make a distinction between data that is degraded (there are > fewer than the desired number of copies) and data that is misplaced (stored > in the wrong location in the cluster). The distinction is important because > the latter does not compromise data safety. > > Does Ceph delete some replicas of the PGs (leading to degradation) before > re- replicating on the new OSD? > > This does not seem to be the safest algorithm. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why are there "degraded" PGs when adding OSDs?
If it wouldn't be too much trouble, I'd actually like the binary osdmap as well (it contains the crushmap, but also a bunch of other stuff). There is a command that lets you get old osdmaps from the mon by epoch as long as they haven't been trimmed. -Sam - Original Message - From: "Chad William Seys" To: "Samuel Just" Cc: "ceph-users" Sent: Tuesday, July 28, 2015 7:40:31 AM Subject: Re: [ceph-users] why are there "degraded" PGs when adding OSDs? Hi Sam, Trying again today with crush tunables set to firefly. Degraded peaked around 46.8%. I've attached the ceph pg dump and the crushmap (same as osdmap) from before and after the OSD additions. 3 osds were added on host osd03. This added 5TB to about 17TB for a total of around 22TB. 5TB/22TB = 22.7% Is it expected for 46.8% of PGs to be degraded after adding 22% of the storage? Another weird thing is that the kernel RBD clients froze up after the OSDs were added, but worked fine after reboot. (Debian kernel 3.16.7) Thanks for checking! C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix
Hrm, that's certainly supposed to work. Can you make a bug? Be sure to note what version you are running (output of ceph-osd -v). -Sam On Mon, Aug 3, 2015 at 12:34 PM, Andras Pataki wrote: > Summary: I am having problems with inconsistent PG's that the 'ceph pg > repair' command does not fix. Below are the details. Any help would be > appreciated. > > # Find the inconsistent PG's > ~# ceph pg dump | grep inconsistent > dumped all in format plain > 2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03 > 14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78 > 77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03 > 14:49:17.292538 > 2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03 > 14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7 > 77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03 > 14:22:47.834063 > > # Look at the first one: > ~# ceph pg deep-scrub 2.439 > instructing pg 2.439 on osd.78 to deep-scrub > > # The logs of osd.78 show: > 2015-08-03 15:16:34.409738 7f09ec04a700 0 log_channel(cluster) log [INF] : > 2.439 deep-scrub starts > 2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log [ERR] : > deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest > 0xb3d78a6e != 0xa3944ad0 > 2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log [ERR] : > 2.439 deep-scrub 1 errors > > # Finding the object in question: > ~# find ~ceph/osd/ceph-78/current/2.439_head -name 1022d93.0f0c* -ls > 21510412310 4100 -rw-r--r-- 1 root root 4194304 Jun 30 17:09 > /var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2 > ~# md5sum > /var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2 > 4e4523244deec051cfe53dd48489a5db > /var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2 > > # The object on the backup osd: > ~# find ~ceph/osd/ceph-54/current/2.439_head -name 1022d93.0f0c* -ls > 6442614367 4100 -rw-r--r-- 1 root root 4194304 Jun 30 17:09 > /var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2 > ~# md5sum > /var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2 > 4e4523244deec051cfe53dd48489a5db > /var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2 > > # They don't seem to be different. > # When I try repair: > ~# ceph pg repair 2.439 > instructing pg 2.439 on osd.78 to repair > > # The osd.78 logs show: > 2015-08-03 15:19:21.775933 7f09ec04a700 0 log_channel(cluster) log [INF] : > 2.439 repair starts > 2015-08-03 15:19:38.088673 7f09ec04a700 -1 log_channel(cluster) log [ERR] : > repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest > 0xb3d78a6e != 0xa3944ad0 > 2015-08-03 15:19:39.958019 7f09ec04a700 -1 log_channel(cluster) log [ERR] : > 2.439 repair 1 errors, 0 fixed > 2015-08-03 15:19:39.962406 7f09ec04a700 0 log_channel(cluster) log [INF] : > 2.439 deep-scrub starts > 2015-08-03 15:19:56.510874 7f09ec04a700 -1 log_channel(cluster) log [ERR] : > deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest > 0xb3d78a6e != 0xa3944ad0 > 2015-08-03 15:19:58.348083 7f09ec04a700 -1 log_channel(cluster) log [ERR] : > 2.439 deep-scrub 1 errors > > The inconsistency is not fixed. Any hints of what should be done next? > I have tried a few things: > * Stop the primary osd, remove the object from the filesystem, restart the > OSD and issue a repair. It didn't work - it sais that one object is > missing, but did not copy it from the backup. > * I tried the same on the backup (remove the file) - it also didn't get > copied back from the primary in a repair. > > Any help would be appreciated. > > Thanks, > > Andras > apat...@simonsfoundation.org > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] C++11 and librados C++
It seems like it's about time for us to make the jump to C++11. This is probably going to have an impact on users of the librados C++ bindings. It seems like such users would have to recompile code using the librados C++ libraries after upgrading the librados library version. Is that reasonable? What do people expect here? -Sam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is it safe to increase pg number in a production environment
It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 wrote: > Hi Cephers, > > This is a greeting from Jevon. Currently, I'm experiencing an issue which > suffers me a lot, so I'm writing to ask for your comments/help/suggestions. > More details are provided bellow. > > Issue: > I set up a cluster having 24 OSDs and created one pool with 1024 placement > groups on it for a small startup company. The number 1024 was calculated per > the equation 'OSDs * 100'/pool size. The cluster have been running quite > well for a long time. But recently, our monitoring system always complains > that some disks' usage exceed 85%. I log into the system and find out that > some disks' usage are really very high, but some are not(less than 60%). > Each time when the issue happens, I have to manually re-balance the > distribution. This is a short-term solution, I'm not willing to do it all > the time. > > Two long-term solutions come in my mind, > 1) Ask the customers to expand their clusters by adding more OSDs. But I > think they will ask me to explain the reason of the imbalance data > distribution. We've already done some analysis on the environment, we > learned that the most imbalance part in the CRUSH is the mapping between > object and pg. The biggest pg has 613 objects, while the smallest pg only > has 226 objects. > > 2) Increase the number of placement groups. It can be of great help for > statistically uniform data distribution, but it can also incur significant > data movement as PGs are effective being split. I just cannot do it in our > customers' environment before we 100% understand the consequence. So anyone > did this under a production environment? How much does this operation affect > the performance of Clients? > > Any comments/help/suggestions will be highly appreciated. > > -- > Best Regards > Jevon > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o )? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor wrote: > ceph - 0.94.2 > Its happen during rebalancing > > I thought too, that some OSD miss copy, but looks like all miss... > So any advice in which direction i need to go > > 2015-08-18 14:14 GMT+03:00 Gregory Farnum : >> >> From a quick peek it looks like some of the OSDs are missing clones of >> objects. I'm not sure how that could happen and I'd expect the pg >> repair to handle that but if it's not there's probably something >> wrong; what version of Ceph are you running? Sam, is this something >> you've seen, a new bug, or some kind of config issue? >> -Greg >> >> On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor >> wrote: >> > Hi all, at our production cluster, due high rebalancing ((( we have 2 >> > pgs in >> > inconsistent state... >> > >> > root@temp:~# ceph health detail | grep inc >> > HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >> > pg 2.490 is active+clean+inconsistent, acting [56,15,29] >> > pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >> > >> > From OSD logs, after recovery attempt: >> > >> > root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do >> > ceph pg repair ${i} ; done >> > dumped all in format plain >> > instructing pg 2.490 on osd.56 to repair >> > instructing pg 2.c4 on osd.56 to repair >> > >> > /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> > f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone >> > 90c59490/rbd_data.eb486436f2beb.7a65/141//2 >> > /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> > fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone >> > f5759490/rbd_data.1631755377d7e.04da/141//2 >> > /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> > a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone >> > fee49490/rbd_data.12483d3ba0794b.522f/141//2 >> > /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> > bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone >> > a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 >> > /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> > 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone >> > bac19490/rbd_data.1238e82ae8944a.032e/141//2 >> > /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> > c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone >> > 98519490/rbd_data.123e9c2ae8944a.0807/141//2 >> > /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> > 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone >> > c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 >> > /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> > e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone >> > 28809490/rbd_data.edea7460fe42b.01d9/141//2 >> > /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 >> > -1 >> > log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors >> > >> > So, how i can solve "expected clone" situation by hand? >> > Thank in advance! >> > >> > >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
Also, what command are you using to take snapshots? -Sam On Tue, Aug 18, 2015 at 8:48 AM, Samuel Just wrote: > Is the number of inconsistent objects growing? Can you attach the > whole ceph.log from the 6 hours before and after the snippet you > linked above? Are you using cache/tiering? Can you attach the osdmap > (ceph osd getmap -o )? > -Sam > > On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor > wrote: >> ceph - 0.94.2 >> Its happen during rebalancing >> >> I thought too, that some OSD miss copy, but looks like all miss... >> So any advice in which direction i need to go >> >> 2015-08-18 14:14 GMT+03:00 Gregory Farnum : >>> >>> From a quick peek it looks like some of the OSDs are missing clones of >>> objects. I'm not sure how that could happen and I'd expect the pg >>> repair to handle that but if it's not there's probably something >>> wrong; what version of Ceph are you running? Sam, is this something >>> you've seen, a new bug, or some kind of config issue? >>> -Greg >>> >>> On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor >>> wrote: >>> > Hi all, at our production cluster, due high rebalancing ((( we have 2 >>> > pgs in >>> > inconsistent state... >>> > >>> > root@temp:~# ceph health detail | grep inc >>> > HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >>> > pg 2.490 is active+clean+inconsistent, acting [56,15,29] >>> > pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >>> > >>> > From OSD logs, after recovery attempt: >>> > >>> > root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do >>> > ceph pg repair ${i} ; done >>> > dumped all in format plain >>> > instructing pg 2.490 on osd.56 to repair >>> > instructing pg 2.c4 on osd.56 to repair >>> > >>> > /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> > f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone >>> > 90c59490/rbd_data.eb486436f2beb.7a65/141//2 >>> > /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> > fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone >>> > f5759490/rbd_data.1631755377d7e.04da/141//2 >>> > /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> > a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone >>> > fee49490/rbd_data.12483d3ba0794b.522f/141//2 >>> > /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> > bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone >>> > a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 >>> > /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> > 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone >>> > bac19490/rbd_data.1238e82ae8944a.032e/141//2 >>> > /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> > c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone >>> > 98519490/rbd_data.123e9c2ae8944a.0807/141//2 >>> > /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> > 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone >>> > c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 >>> > /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> > e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone >>> > 28809490/rbd_data.edea7460fe42b.01d9/141//2 >>> > /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 >>> > -1 >>> > log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors >>> > >>> > So, how i can solve "expected clone" situation by hand? >>> > Thank in advance! >>> > >>> > >>> > >>> > ___ >>> > ceph-users mailing list >>> > ceph-users@lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
1. We've kicked this around a bit. What kind of failure semantics would you be comfortable with here (that is, what would be reasonable behavior if the client side cache fails)? 2. We've got a branch which should merge soon (tomorrow probably) which actually does allow writes to be proxied, so that should alleviate some of these pain points somewhat. I'm not sure it is clever enough to allow through writefulls for an ec base tier though (but it would be a good idea!) -Sam On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk wrote: > > > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Mark Nelson >> Sent: 18 August 2015 18:51 >> To: Nick Fisk ; 'Jan Schermer' >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO? >> >> >> >> On 08/18/2015 11:52 AM, Nick Fisk wrote: >> > >> >> Here's kind of how I see the field right now: >> >> 1) Cache at the client level. Likely fastest but obvious issues like > above. >> RAID1 might be an option at increased cost. Lack of barriers in >> some implementations scary. >> >>> >> >>> Agreed. >> >>> >> >> 2) Cache below the OSD. Not much recent data on this. Not likely >> as fast as client side cache, but likely cheaper (fewer OSD nodes >> than client >> >> nodes?). >> Lack of barriers in some implementations scary. >> >>> >> >>> This also has the benefit of caching the leveldb on the OSD, so get >> >>> a big >> >> performance gain from there too for small sequential writes. I looked >> >> at using Flashcache for this too but decided it was adding to much >> >> complexity and risk. >> >>> >> >>> I thought I read somewhere that RocksDB allows you to move its WAL >> >>> to >> >> SSD, is there anything in the pipeline for something like moving the >> >> filestore to use RocksDB? >> >> >> >> I believe you can already do this, though I haven't tested it. You >> >> can certainly move the monitors to rocksdb (tested) and newstore uses >> rocksdb as well. >> >> >> > >> > Interesting, I might have a look into this. >> > >> >>> >> >> 3) Ceph Cache Tiering. Network overhead and write amplification on >> promotion makes this primarily useful when workloads fit mostly >> into the cache tier. Overall safe design but care must be taken to >> not over- >> >> promote. >> >> 4) separate SSD pool. Manual and not particularly flexible, but >> perhaps >> >> best >> for applications that need consistently high performance. >> >>> >> >>> I think it depends on the definition of performance. Currently even >> >>> very >> >> fast CPU's and SSD's in their own pool will still struggle to get >> >> less than 1ms of write latency. If your performance requirements are >> >> for large queue depths then you will probably be alright. If you >> >> require something that mirrors the performance of traditional write >> >> back cache, then even pure SSD Pools can start to struggle. >> >> >> >> Agreed. This is definitely the crux of the problem. The example >> >> below is a great start! It'd would be fantastic if we could get more >> >> feedback from the list on the relative importance of low latency >> >> operations vs high IOPS through concurrency. We have general >> >> suspicions but not a ton of actual data regarding what folks are >> >> seeing in practice and under what scenarios. >> >> >> > >> > If you have any specific questions that you think I might be able to > answer, >> please let me know. The only other main app that I can really think of > where >> these sort of write latency is critical is SQL, particularly the > transaction logs. >> >> Probably the big question is what are the pain points? The most common >> answer we get when asking folks what applications they run on top of Ceph >> is "everything!". This is wonderful, but not helpful when trying to > figure out >> what performance issues matter most! :) > > Sort of like someone telling you their pc is broken and when asked for > details getting "It's not working" in return. > > In general I think a lot of it comes down to people not appreciating the > differences between Ceph and say a Raid array. For most things like larger > block IO performance tends to scale with cluster size and the cost > effectiveness of Ceph makes this a no brainer not to just add a handful of > extra OSD's. > > I will try and be more precise. Here is my list of pain points / wishes that > I have come across in the last 12 months of running Ceph. > > 1. Improve small IO write latency > As discussed in depth in this thread. If it's possible just to make Ceph a > lot faster then great, but I fear even a doubling in performance will still > fall short compared to if you are caching writes at the client. Most things > in Ceph tend to improve with scale, but write latency is the same with 2 > OSD's as it is with 2000. I would urge some sort of investigation into t
Re: [ceph-users] Repair inconsistent pgs..
Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor wrote: > Hi Samuel, we try to fix it in trick way. > > we check all rbd_data chunks from logs (OSD) which are affected, then query > rbd info to compare which rbd consist bad rbd_data, after that we mount this > rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. > > But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos > try to out OSD which was lead, but after rebalancing this 2 pgs still have > 35 scrub errors... > > ceph osd getmap -o - attached > > > 2015-08-18 18:48 GMT+03:00 Samuel Just : >> >> Is the number of inconsistent objects growing? Can you attach the >> whole ceph.log from the 6 hours before and after the snippet you >> linked above? Are you using cache/tiering? Can you attach the osdmap >> (ceph osd getmap -o )? >> -Sam >> >> On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor >> wrote: >> > ceph - 0.94.2 >> > Its happen during rebalancing >> > >> > I thought too, that some OSD miss copy, but looks like all miss... >> > So any advice in which direction i need to go >> > >> > 2015-08-18 14:14 GMT+03:00 Gregory Farnum : >> >> >> >> From a quick peek it looks like some of the OSDs are missing clones of >> >> objects. I'm not sure how that could happen and I'd expect the pg >> >> repair to handle that but if it's not there's probably something >> >> wrong; what version of Ceph are you running? Sam, is this something >> >> you've seen, a new bug, or some kind of config issue? >> >> -Greg >> >> >> >> On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor >> >> wrote: >> >> > Hi all, at our production cluster, due high rebalancing ((( we have 2 >> >> > pgs in >> >> > inconsistent state... >> >> > >> >> > root@temp:~# ceph health detail | grep inc >> >> > HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >> >> > pg 2.490 is active+clean+inconsistent, acting [56,15,29] >> >> > pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >> >> > >> >> > From OSD logs, after recovery attempt: >> >> > >> >> > root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; >> >> > do >> >> > ceph pg repair ${i} ; done >> >> > dumped all in format plain >> >> > instructing pg 2.490 on osd.56 to repair >> >> > instructing pg 2.c4 on osd.56 to repair >> >> > >> >> > /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 >> >> > 7f94663b3700 >> >> > -1 >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> >> > f5759490/rbd_data.1631755377d7e.04da/head//2 expected >> >> > clone >> >> > 90c59490/rbd_data.eb486436f2beb.7a65/141//2 >> >> > /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 >> >> > 7f94663b3700 >> >> > -1 >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> >> > fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected >> >> > clone >> >> > f5759490/rbd_data.1631755377d7e.04da/141//2 >> >> > /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 >> >> > 7f94663b3700 >> >> > -1 >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> >> > a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected >> >> > clone >> >> > fee49490/rbd_data.12483d3ba0794b.522f/141//2 >> >> > /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 >> >> > 7f94663b3700 >> >> > -1 >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> >> > bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected >> >> > clone >> >> > a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 >> >> > /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 >> >> > 7f94663b3700 >> >> > -1 >> >> > log_channel(cluster) log [ERR] : dee
Re: [ceph-users] Repair inconsistent pgs..
Also, was there at any point a power failure/power cycle event, perhaps on osd 56? -Sam On Thu, Aug 20, 2015 at 9:23 AM, Samuel Just wrote: > Ok, you appear to be using a replicated cache tier in front of a > replicated base tier. Please scrub both inconsistent pgs and post the > ceph.log from before when you started the scrub until after. Also, > what command are you using to take snapshots? > -Sam > > On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor > wrote: >> Hi Samuel, we try to fix it in trick way. >> >> we check all rbd_data chunks from logs (OSD) which are affected, then query >> rbd info to compare which rbd consist bad rbd_data, after that we mount this >> rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. >> >> But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos >> try to out OSD which was lead, but after rebalancing this 2 pgs still have >> 35 scrub errors... >> >> ceph osd getmap -o - attached >> >> >> 2015-08-18 18:48 GMT+03:00 Samuel Just : >>> >>> Is the number of inconsistent objects growing? Can you attach the >>> whole ceph.log from the 6 hours before and after the snippet you >>> linked above? Are you using cache/tiering? Can you attach the osdmap >>> (ceph osd getmap -o )? >>> -Sam >>> >>> On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor >>> wrote: >>> > ceph - 0.94.2 >>> > Its happen during rebalancing >>> > >>> > I thought too, that some OSD miss copy, but looks like all miss... >>> > So any advice in which direction i need to go >>> > >>> > 2015-08-18 14:14 GMT+03:00 Gregory Farnum : >>> >> >>> >> From a quick peek it looks like some of the OSDs are missing clones of >>> >> objects. I'm not sure how that could happen and I'd expect the pg >>> >> repair to handle that but if it's not there's probably something >>> >> wrong; what version of Ceph are you running? Sam, is this something >>> >> you've seen, a new bug, or some kind of config issue? >>> >> -Greg >>> >> >>> >> On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor >>> >> wrote: >>> >> > Hi all, at our production cluster, due high rebalancing ((( we have 2 >>> >> > pgs in >>> >> > inconsistent state... >>> >> > >>> >> > root@temp:~# ceph health detail | grep inc >>> >> > HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >>> >> > pg 2.490 is active+clean+inconsistent, acting [56,15,29] >>> >> > pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >>> >> > >>> >> > From OSD logs, after recovery attempt: >>> >> > >>> >> > root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; >>> >> > do >>> >> > ceph pg repair ${i} ; done >>> >> > dumped all in format plain >>> >> > instructing pg 2.490 on osd.56 to repair >>> >> > instructing pg 2.c4 on osd.56 to repair >>> >> > >>> >> > /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 >>> >> > 7f94663b3700 >>> >> > -1 >>> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> >> > f5759490/rbd_data.1631755377d7e.04da/head//2 expected >>> >> > clone >>> >> > 90c59490/rbd_data.eb486436f2beb.7a65/141//2 >>> >> > /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 >>> >> > 7f94663b3700 >>> >> > -1 >>> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> >> > fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected >>> >> > clone >>> >> > f5759490/rbd_data.1631755377d7e.04da/141//2 >>> >> > /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 >>> >> > 7f94663b3700 >>> >> > -1 >>> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >>> >> > a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected >>> >> > clone >>> >> > fee49490/rbd_data.12483d3ba0794b.522f/141//2 >>> >> > /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 >>> >> > 7f94663b37
Re: [ceph-users] Repair inconsistent pgs..
What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor wrote: > Samuel, we turned off cache layer few hours ago... > I will post ceph.log in few minutes > > For snap - we found issue, was connected with cache tier.. > > 2015-08-20 19:23 GMT+03:00 Samuel Just : >> >> Ok, you appear to be using a replicated cache tier in front of a >> replicated base tier. Please scrub both inconsistent pgs and post the >> ceph.log from before when you started the scrub until after. Also, >> what command are you using to take snapshots? >> -Sam >> >> On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor >> wrote: >> > Hi Samuel, we try to fix it in trick way. >> > >> > we check all rbd_data chunks from logs (OSD) which are affected, then >> > query >> > rbd info to compare which rbd consist bad rbd_data, after that we mount >> > this >> > rbd as rbd0, create empty rbd, and DD all info from bad volume to new >> > one. >> > >> > But after that - scrub errors growing... Was 15 errors.. .Now 35... We >> > laos >> > try to out OSD which was lead, but after rebalancing this 2 pgs still >> > have >> > 35 scrub errors... >> > >> > ceph osd getmap -o - attached >> > >> > >> > 2015-08-18 18:48 GMT+03:00 Samuel Just : >> >> >> >> Is the number of inconsistent objects growing? Can you attach the >> >> whole ceph.log from the 6 hours before and after the snippet you >> >> linked above? Are you using cache/tiering? Can you attach the osdmap >> >> (ceph osd getmap -o )? >> >> -Sam >> >> >> >> On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor >> >> wrote: >> >> > ceph - 0.94.2 >> >> > Its happen during rebalancing >> >> > >> >> > I thought too, that some OSD miss copy, but looks like all miss... >> >> > So any advice in which direction i need to go >> >> > >> >> > 2015-08-18 14:14 GMT+03:00 Gregory Farnum : >> >> >> >> >> >> From a quick peek it looks like some of the OSDs are missing clones >> >> >> of >> >> >> objects. I'm not sure how that could happen and I'd expect the pg >> >> >> repair to handle that but if it's not there's probably something >> >> >> wrong; what version of Ceph are you running? Sam, is this something >> >> >> you've seen, a new bug, or some kind of config issue? >> >> >> -Greg >> >> >> >> >> >> On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor >> >> >> wrote: >> >> >> > Hi all, at our production cluster, due high rebalancing ((( we >> >> >> > have 2 >> >> >> > pgs in >> >> >> > inconsistent state... >> >> >> > >> >> >> > root@temp:~# ceph health detail | grep inc >> >> >> > HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >> >> >> > pg 2.490 is active+clean+inconsistent, acting [56,15,29] >> >> >> > pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >> >> >> > >> >> >> > From OSD logs, after recovery attempt: >> >> >> > >> >> >> > root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read >> >> >> > i; >> >> >> > do >> >> >> > ceph pg repair ${i} ; done >> >> >> > dumped all in format plain >> >> >> > instructing pg 2.490 on osd.56 to repair >> >> >> > instructing pg 2.c4 on osd.56 to repair >> >> >> > >> >> >> > /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 >> >> >> > 7f94663b3700 >> >> >> > -1 >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> >> >> > f5759490/rbd_data.1631755377d7e.04da/head//2 expected >> >> >> > clone >> >> >> > 90c59490/rbd_data.eb486436f2beb.7a65/141//2 >> >> >> > /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 >> >> >> > 7f94663b3700 >> >> >> > -1 >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490 >> >> >> > fee49490/rbd_data.12483d3ba0794b.
Re: [ceph-users] Repair inconsistent pgs..
Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor wrote: > Issue, that in forward mode, fstrim doesn't work proper, and when we take > snapshot - data not proper update in cache layer, and client (ceph) see > damaged snap.. As headers requested from cache layer. > > 2015-08-20 19:53 GMT+03:00 Samuel Just : >> >> What was the issue? >> -Sam >> >> On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor >> wrote: >> > Samuel, we turned off cache layer few hours ago... >> > I will post ceph.log in few minutes >> > >> > For snap - we found issue, was connected with cache tier.. >> > >> > 2015-08-20 19:23 GMT+03:00 Samuel Just : >> >> >> >> Ok, you appear to be using a replicated cache tier in front of a >> >> replicated base tier. Please scrub both inconsistent pgs and post the >> >> ceph.log from before when you started the scrub until after. Also, >> >> what command are you using to take snapshots? >> >> -Sam >> >> >> >> On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor >> >> wrote: >> >> > Hi Samuel, we try to fix it in trick way. >> >> > >> >> > we check all rbd_data chunks from logs (OSD) which are affected, then >> >> > query >> >> > rbd info to compare which rbd consist bad rbd_data, after that we >> >> > mount >> >> > this >> >> > rbd as rbd0, create empty rbd, and DD all info from bad volume to new >> >> > one. >> >> > >> >> > But after that - scrub errors growing... Was 15 errors.. .Now 35... >> >> > We >> >> > laos >> >> > try to out OSD which was lead, but after rebalancing this 2 pgs still >> >> > have >> >> > 35 scrub errors... >> >> > >> >> > ceph osd getmap -o - attached >> >> > >> >> > >> >> > 2015-08-18 18:48 GMT+03:00 Samuel Just : >> >> >> >> >> >> Is the number of inconsistent objects growing? Can you attach the >> >> >> whole ceph.log from the 6 hours before and after the snippet you >> >> >> linked above? Are you using cache/tiering? Can you attach the >> >> >> osdmap >> >> >> (ceph osd getmap -o )? >> >> >> -Sam >> >> >> >> >> >> On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor >> >> >> wrote: >> >> >> > ceph - 0.94.2 >> >> >> > Its happen during rebalancing >> >> >> > >> >> >> > I thought too, that some OSD miss copy, but looks like all miss... >> >> >> > So any advice in which direction i need to go >> >> >> > >> >> >> > 2015-08-18 14:14 GMT+03:00 Gregory Farnum : >> >> >> >> >> >> >> >> From a quick peek it looks like some of the OSDs are missing >> >> >> >> clones >> >> >> >> of >> >> >> >> objects. I'm not sure how that could happen and I'd expect the pg >> >> >> >> repair to handle that but if it's not there's probably something >> >> >> >> wrong; what version of Ceph are you running? Sam, is this >> >> >> >> something >> >> >> >> you've seen, a new bug, or some kind of config issue? >> >> >> >> -Greg >> >> >> >> >> >> >> >> On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor >> >> >> >> wrote: >> >> >> >> > Hi all, at our production cluster, due high rebalancing ((( we >> >> >> >> > have 2 >> >> >> >> > pgs in >> >> >> >> > inconsistent state... >> >> >> >> > >> >> >> >> > root@temp:~# ceph health detail | grep inc >> >> >> >> > HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >> >> >> >> > pg 2.490 is active+clean+inconsistent, acting [56,15,29] >> >> >> >> > pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >> >> >> >> > >> >> >> >> > From OSD logs, after recovery attempt: >> >> >> >> > >> >> >> >
Re: [ceph-users] Repair inconsistent pgs..
Which docs? -Sam On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor wrote: > Not yet. I will create. > But according to mail lists and Inktank docs - it's expected behaviour when > cache enable > > 2015-08-20 19:56 GMT+03:00 Samuel Just : >> >> Is there a bug for this in the tracker? >> -Sam >> >> On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor >> wrote: >> > Issue, that in forward mode, fstrim doesn't work proper, and when we >> > take >> > snapshot - data not proper update in cache layer, and client (ceph) see >> > damaged snap.. As headers requested from cache layer. >> > >> > 2015-08-20 19:53 GMT+03:00 Samuel Just : >> >> >> >> What was the issue? >> >> -Sam >> >> >> >> On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor >> >> wrote: >> >> > Samuel, we turned off cache layer few hours ago... >> >> > I will post ceph.log in few minutes >> >> > >> >> > For snap - we found issue, was connected with cache tier.. >> >> > >> >> > 2015-08-20 19:23 GMT+03:00 Samuel Just : >> >> >> >> >> >> Ok, you appear to be using a replicated cache tier in front of a >> >> >> replicated base tier. Please scrub both inconsistent pgs and post >> >> >> the >> >> >> ceph.log from before when you started the scrub until after. Also, >> >> >> what command are you using to take snapshots? >> >> >> -Sam >> >> >> >> >> >> On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor >> >> >> wrote: >> >> >> > Hi Samuel, we try to fix it in trick way. >> >> >> > >> >> >> > we check all rbd_data chunks from logs (OSD) which are affected, >> >> >> > then >> >> >> > query >> >> >> > rbd info to compare which rbd consist bad rbd_data, after that we >> >> >> > mount >> >> >> > this >> >> >> > rbd as rbd0, create empty rbd, and DD all info from bad volume to >> >> >> > new >> >> >> > one. >> >> >> > >> >> >> > But after that - scrub errors growing... Was 15 errors.. .Now >> >> >> > 35... >> >> >> > We >> >> >> > laos >> >> >> > try to out OSD which was lead, but after rebalancing this 2 pgs >> >> >> > still >> >> >> > have >> >> >> > 35 scrub errors... >> >> >> > >> >> >> > ceph osd getmap -o - attached >> >> >> > >> >> >> > >> >> >> > 2015-08-18 18:48 GMT+03:00 Samuel Just : >> >> >> >> >> >> >> >> Is the number of inconsistent objects growing? Can you attach >> >> >> >> the >> >> >> >> whole ceph.log from the 6 hours before and after the snippet you >> >> >> >> linked above? Are you using cache/tiering? Can you attach the >> >> >> >> osdmap >> >> >> >> (ceph osd getmap -o )? >> >> >> >> -Sam >> >> >> >> >> >> >> >> On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor >> >> >> >> wrote: >> >> >> >> > ceph - 0.94.2 >> >> >> >> > Its happen during rebalancing >> >> >> >> > >> >> >> >> > I thought too, that some OSD miss copy, but looks like all >> >> >> >> > miss... >> >> >> >> > So any advice in which direction i need to go >> >> >> >> > >> >> >> >> > 2015-08-18 14:14 GMT+03:00 Gregory Farnum : >> >> >> >> >> >> >> >> >> >> From a quick peek it looks like some of the OSDs are missing >> >> >> >> >> clones >> >> >> >> >> of >> >> >> >> >> objects. I'm not sure how that could happen and I'd expect the >> >> >> >> >> pg >> >> >> >> >> repair to handle that but if it's not there's probably >> >> >> >> >> something >> >> >>
Re: [ceph-users] Repair inconsistent pgs..
Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes sense if you moved the image to a new image and deleted the old one) and is only being kept around since there are live snapshots. I suggest you leave the snapshots for those images alone for the time being -- removing them might cause the osd to crash trying to clean up the wierd on disk state. Other than the leaked space from those two image snapshots and the annoying spurious scrub errors, I think no actual corruption is going on though. I created a tracker ticket for a feature that would let ceph-objectstore-tool remove the spurious clone from the head/snapdir metadata. Am I right that you haven't actually seen any osd crashes or user visible corruption (except possibly on snapshots of those two images)? -Sam On Thu, Aug 20, 2015 at 10:07 AM, Voloshanenko Igor wrote: > Inktank: > https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf > > Mail-list: > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18338.html > > 2015-08-20 20:06 GMT+03:00 Samuel Just : >> >> Which docs? >> -Sam >> >> On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor >> wrote: >> > Not yet. I will create. >> > But according to mail lists and Inktank docs - it's expected behaviour >> > when >> > cache enable >> > >> > 2015-08-20 19:56 GMT+03:00 Samuel Just : >> >> >> >> Is there a bug for this in the tracker? >> >> -Sam >> >> >> >> On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor >> >> wrote: >> >> > Issue, that in forward mode, fstrim doesn't work proper, and when we >> >> > take >> >> > snapshot - data not proper update in cache layer, and client (ceph) >> >> > see >> >> > damaged snap.. As headers requested from cache layer. >> >> > >> >> > 2015-08-20 19:53 GMT+03:00 Samuel Just : >> >> >> >> >> >> What was the issue? >> >> >> -Sam >> >> >> >> >> >> On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor >> >> >> wrote: >> >> >> > Samuel, we turned off cache layer few hours ago... >> >> >> > I will post ceph.log in few minutes >> >> >> > >> >> >> > For snap - we found issue, was connected with cache tier.. >> >> >> > >> >> >> > 2015-08-20 19:23 GMT+03:00 Samuel Just : >> >> >> >> >> >> >> >> Ok, you appear to be using a replicated cache tier in front of a >> >> >> >> replicated base tier. Please scrub both inconsistent pgs and >> >> >> >> post >> >> >> >> the >> >> >> >> ceph.log from before when you started the scrub until after. >> >> >> >> Also, >> >> >> >> what command are you using to take snapshots? >> >> >> >> -Sam >> >> >> >> >> >> >> >> On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor >> >> >> >> wrote: >> >> >> >> > Hi Samuel, we try to fix it in trick way. >> >> >> >> > >> >> >
Re: [ceph-users] Repair inconsistent pgs..
The feature bug for the tool is http://tracker.ceph.com/issues/12740. -Sam On Thu, Aug 20, 2015 at 2:52 PM, Samuel Just wrote: > Ah, this is kind of silly. I think you don't have 37 errors, but 2 > errors. pg 2.490 object > 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing > snap 141. If you look at the objects after that in the log: > > 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster > [ERR] repair 2.490 > 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected > clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 > 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster > [ERR] repair 2.490 > ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected > clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 > > The clone from the second line matches the head object from the > previous line, and they have the same clone id. I *think* that the > first error is real, and the subsequent ones are just scrub being > dumb. Same deal with pg 2.c4. I just opened > http://tracker.ceph.com/issues/12738. > > The original problem is that > 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and > 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both > missing a clone. Not sure how that happened, my money is on a > cache/tiering evict racing with a snap trim. If you have any logging > or relevant information from when that happened, you should open a > bug. The 'snapdir' in the two object names indicates that the head > object has actually been deleted (which makes sense if you moved the > image to a new image and deleted the old one) and is only being kept > around since there are live snapshots. I suggest you leave the > snapshots for those images alone for the time being -- removing them > might cause the osd to crash trying to clean up the wierd on disk > state. Other than the leaked space from those two image snapshots and > the annoying spurious scrub errors, I think no actual corruption is > going on though. I created a tracker ticket for a feature that would > let ceph-objectstore-tool remove the spurious clone from the > head/snapdir metadata. > > Am I right that you haven't actually seen any osd crashes or user > visible corruption (except possibly on snapshots of those two images)? > -Sam > > On Thu, Aug 20, 2015 at 10:07 AM, Voloshanenko Igor > wrote: >> Inktank: >> https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf >> >> Mail-list: >> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18338.html >> >> 2015-08-20 20:06 GMT+03:00 Samuel Just : >>> >>> Which docs? >>> -Sam >>> >>> On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor >>> wrote: >>> > Not yet. I will create. >>> > But according to mail lists and Inktank docs - it's expected behaviour >>> > when >>> > cache enable >>> > >>> > 2015-08-20 19:56 GMT+03:00 Samuel Just : >>> >> >>> >> Is there a bug for this in the tracker? >>> >> -Sam >>> >> >>> >> On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor >>> >> wrote: >>> >> > Issue, that in forward mode, fstrim doesn't work proper, and when we >>> >> > take >>> >> > snapshot - data not proper update in cache layer, and client (ceph) >>> >> > see >>> >> > damaged snap.. As headers requested from cache layer. >>> >> > >>> >> > 2015-08-20 19:53 GMT+03:00 Samuel Just : >>> >> >> >>> >> >> What was the issue? >>> >> >> -Sam >>> >> >> >>> >> >> On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor >>> >> >> wrote: >>> >> >> > Samuel, we turned off cache layer few hours ago... >>> >> >> > I will post ceph.log in few minutes >>> >> >> > >>> >> >> > For snap - we found issue, was connected with cache tier.. >>> >> >> > >>> >> >> > 2015-08-20 19:23 GMT+03:00 Samuel Just : >>> >> >> >> >>> >> >> >> Ok, you appear to be using a replicated cache tier in front of a >>> >> >> >> replicated base tier. Please scrub both inconsistent pgs and >>> >> >> >> post >>> >> >> >> the >>> >> >
Re: [ceph-users] Repair inconsistent pgs..
Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2, but other images (that's why the scrub errors went down briefly, those objects -- which were fine -- went away). You might want to export and reimport those two images into new images, but leave the old ones alone until you can clean up the on disk state (image and snapshots) and clear the scrub errors. You probably don't want to read the snapshots for those images either. Everything else is, I think, harmless. The ceph-objectstore-tool feature would probably not be too hard, actually. Each head/snapdir image has two attrs (possibly stored in leveldb -- that's why you want to modify the ceph-objectstore-tool and use its interfaces rather than mucking about with the files directly) '_' and 'snapset' which contain encoded representations of object_info_t and SnapSet (both can be found in src/osd/osd_types.h). SnapSet has a set of clones and related metadata -- you want to read the SnapSet attr off disk and commit a transaction writing out a new version with that clone removed. I'd start by cloning the repo, starting a vstart cluster locally, and reproducing the issue. Next, get familiar with using ceph-objectstore-tool on the osds in that vstart cluster. A good first change would be creating a ceph-objectstore-tool op that lets you dump json for the object_info_t and SnapSet (both types have format() methods which make that easy) on an object to stdout so you can confirm what's actually there. oftc #ceph-devel or the ceph-devel mailing list would be the right place to ask questions. Otherwise, it'll probably get done in the next few weeks. -Sam On Thu, Aug 20, 2015 at 3:10 PM, Voloshanenko Igor wrote: > thank you Sam! > I also noticed this linked errors during scrub... > > Now all lools like reasonable! > > So we will wait for bug to be closed. > > do you need any help on it? > > I mean i can help with coding/testing/etc... > > 2015-08-21 0:52 GMT+03:00 Samuel Just : >> >> Ah, this is kind of silly. I think you don't have 37 errors, but 2 >> errors. pg 2.490 object >> 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing >> snap 141. If you look at the objects after that in the log: >> >> 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster >> [ERR] repair 2.490 >> 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected >> clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 >> 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster >> [ERR] repair 2.490 >> ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected >> clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 >> >> The clone from the second line matches the head object from the >> previous line, and they have the same clone id. I *think* that the >> first error is real, and the subsequent ones are just scrub being >> dumb. Same deal with pg 2.c4. I just opened >> http://tracker.ceph.com/issues/12738. >> >> The original problem is that >> 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and >> 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both >> missing a clone. Not sure how that happened, my money is on a >> cache/tiering evict racing with a snap trim. If you have any logging >> or relevant information from when that happened, you should open a >> bug. The 'snapdir' in the two object names indicates that the head >> object has actually been deleted (which makes sense if you moved the >> image to a new image and deleted the old one) and is only being kept >> around since there are live snapshots. I suggest you leave the >> snapshots for those images alone for the time being -- removing them >> might cause the osd to crash trying to clean up the wierd on disk >> state. Other than the leaked space from those two image snapshots and >> the annoying spurious scrub errors, I think no actual corruption is >> going on though. I created a tracker ticket for a feature that would >> let ceph-objectstore-tool remove the spurious clone from the >> head/snapdir metadata. >> >> Am I right that you haven't actually seen any osd crashes or user >> visible corruption (except possibly on snapshots of those two images)? >> -Sam >> >> On Thu, Aug 20, 2015 at 10:07 AM, Voloshanenko Igor >> wrote: >> > Inktank: >> > >> > https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf &g
Re: [ceph-users] Repair inconsistent pgs..
Interesting. How often do you delete an image? I'm wondering if whatever this is happened when you deleted these two images. -Sam On Thu, Aug 20, 2015 at 3:42 PM, Voloshanenko Igor wrote: > Sam, i try to understand which rbd contain this chunks.. but no luck. No rbd > images block names started with this... > >> Actually, now that I think about it, you probably didn't remove the >> images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 >> and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 > > > > > 2015-08-21 1:36 GMT+03:00 Samuel Just : >> >> Actually, now that I think about it, you probably didn't remove the >> images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 >> and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2, but >> other images (that's why the scrub errors went down briefly, those >> objects -- which were fine -- went away). You might want to export >> and reimport those two images into new images, but leave the old ones >> alone until you can clean up the on disk state (image and snapshots) >> and clear the scrub errors. You probably don't want to read the >> snapshots for those images either. Everything else is, I think, >> harmless. >> >> The ceph-objectstore-tool feature would probably not be too hard, >> actually. Each head/snapdir image has two attrs (possibly stored in >> leveldb -- that's why you want to modify the ceph-objectstore-tool and >> use its interfaces rather than mucking about with the files directly) >> '_' and 'snapset' which contain encoded representations of >> object_info_t and SnapSet (both can be found in src/osd/osd_types.h). >> SnapSet has a set of clones and related metadata -- you want to read >> the SnapSet attr off disk and commit a transaction writing out a new >> version with that clone removed. I'd start by cloning the repo, >> starting a vstart cluster locally, and reproducing the issue. Next, >> get familiar with using ceph-objectstore-tool on the osds in that >> vstart cluster. A good first change would be creating a >> ceph-objectstore-tool op that lets you dump json for the object_info_t >> and SnapSet (both types have format() methods which make that easy) on >> an object to stdout so you can confirm what's actually there. oftc >> #ceph-devel or the ceph-devel mailing list would be the right place to >> ask questions. >> >> Otherwise, it'll probably get done in the next few weeks. >> -Sam >> >> On Thu, Aug 20, 2015 at 3:10 PM, Voloshanenko Igor >> wrote: >> > thank you Sam! >> > I also noticed this linked errors during scrub... >> > >> > Now all lools like reasonable! >> > >> > So we will wait for bug to be closed. >> > >> > do you need any help on it? >> > >> > I mean i can help with coding/testing/etc... >> > >> > 2015-08-21 0:52 GMT+03:00 Samuel Just : >> >> >> >> Ah, this is kind of silly. I think you don't have 37 errors, but 2 >> >> errors. pg 2.490 object >> >> 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing >> >> snap 141. If you look at the objects after that in the log: >> >> >> >> 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster >> >> [ERR] repair 2.490 >> >> 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected >> >> clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 >> >> 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster >> >> [ERR] repair 2.490 >> >> ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected >> >> clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 >> >> >> >> The clone from the second line matches the head object from the >> >> previous line, and they have the same clone id. I *think* that the >> >> first error is real, and the subsequent ones are just scrub being >> >> dumb. Same deal with pg 2.c4. I just opened >> >> http://tracker.ceph.com/issues/12738. >> >> >> >> The original problem is that >> >> 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and >> >> 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both >> >> missing a clone. Not sure how that happened, my money is on a >> >> cache/tiering evict racing with a snap trim. If you have any logging >> >> or rele
Re: [ceph-users] Repair inconsistent pgs..
Ok, so images are regularly removed. In that case, these two objects probably are left over from previously removed images. Once ceph-objectstore-tool can dump the SnapSet from those two objects, you will probably find that those two snapdir objects each have only one bogus clone, in which case you'll probably just remove the images. -Sam On Thu, Aug 20, 2015 at 3:45 PM, Voloshanenko Igor wrote: > Image? One? > > We start deleting images only to fix thsi (export/import)m before - 1-4 > times per day (when VM destroyed)... > > > > 2015-08-21 1:44 GMT+03:00 Samuel Just : >> >> Interesting. How often do you delete an image? I'm wondering if >> whatever this is happened when you deleted these two images. >> -Sam >> >> On Thu, Aug 20, 2015 at 3:42 PM, Voloshanenko Igor >> wrote: >> > Sam, i try to understand which rbd contain this chunks.. but no luck. No >> > rbd >> > images block names started with this... >> > >> >> Actually, now that I think about it, you probably didn't remove the >> >> images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 >> >> and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 >> > >> > >> > >> > >> > 2015-08-21 1:36 GMT+03:00 Samuel Just : >> >> >> >> Actually, now that I think about it, you probably didn't remove the >> >> images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 >> >> and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2, but >> >> other images (that's why the scrub errors went down briefly, those >> >> objects -- which were fine -- went away). You might want to export >> >> and reimport those two images into new images, but leave the old ones >> >> alone until you can clean up the on disk state (image and snapshots) >> >> and clear the scrub errors. You probably don't want to read the >> >> snapshots for those images either. Everything else is, I think, >> >> harmless. >> >> >> >> The ceph-objectstore-tool feature would probably not be too hard, >> >> actually. Each head/snapdir image has two attrs (possibly stored in >> >> leveldb -- that's why you want to modify the ceph-objectstore-tool and >> >> use its interfaces rather than mucking about with the files directly) >> >> '_' and 'snapset' which contain encoded representations of >> >> object_info_t and SnapSet (both can be found in src/osd/osd_types.h). >> >> SnapSet has a set of clones and related metadata -- you want to read >> >> the SnapSet attr off disk and commit a transaction writing out a new >> >> version with that clone removed. I'd start by cloning the repo, >> >> starting a vstart cluster locally, and reproducing the issue. Next, >> >> get familiar with using ceph-objectstore-tool on the osds in that >> >> vstart cluster. A good first change would be creating a >> >> ceph-objectstore-tool op that lets you dump json for the object_info_t >> >> and SnapSet (both types have format() methods which make that easy) on >> >> an object to stdout so you can confirm what's actually there. oftc >> >> #ceph-devel or the ceph-devel mailing list would be the right place to >> >> ask questions. >> >> >> >> Otherwise, it'll probably get done in the next few weeks. >> >> -Sam >> >> >> >> On Thu, Aug 20, 2015 at 3:10 PM, Voloshanenko Igor >> >> wrote: >> >> > thank you Sam! >> >> > I also noticed this linked errors during scrub... >> >> > >> >> > Now all lools like reasonable! >> >> > >> >> > So we will wait for bug to be closed. >> >> > >> >> > do you need any help on it? >> >> > >> >> > I mean i can help with coding/testing/etc... >> >> > >> >> > 2015-08-21 0:52 GMT+03:00 Samuel Just : >> >> >> >> >> >> Ah, this is kind of silly. I think you don't have 37 errors, but 2 >> >> >> errors. pg 2.490 object >> >> >> 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is >> >> >> missing >> >> >> snap 141. If you look at the objects after that in the log: >> >> >> >> >> >> 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : >> >> >>
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Snapshotting with cache/tiering *is* supposed to work. Can you open a bug? -Sam On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic wrote: > This was related to the caching layer, which doesnt support snapshooting per > docs...for sake of closing the thread. > > On 17 August 2015 at 21:15, Voloshanenko Igor > wrote: >> >> Hi all, can you please help me with unexplained situation... >> >> All snapshot inside ceph broken... >> >> So, as example, we have VM template, as rbd inside ceph. >> We can map it and mount to check that all ok with it >> >> root@test:~# rbd map cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5 >> /dev/rbd0 >> root@test:~# parted /dev/rbd0 print >> Model: Unknown (unknown) >> Disk /dev/rbd0: 10.7GB >> Sector size (logical/physical): 512B/512B >> Partition Table: msdos >> >> Number Start End SizeType File system Flags >> 1 1049kB 525MB 524MB primary ext4 boot >> 2 525MB 10.7GB 10.2GB primary lvm >> >> Than i want to create snap, so i do: >> root@test:~# rbd snap create >> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> >> And now i want to map it: >> >> root@test:~# rbd map >> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> /dev/rbd1 >> root@test:~# parted /dev/rbd1 print >> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >> /dev/rbd1 has been opened read-only. >> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >> /dev/rbd1 has been opened read-only. >> Error: /dev/rbd1: unrecognised disk label >> >> Even md5 different... >> root@ix-s2:~# md5sum /dev/rbd0 >> 9a47797a07fee3a3d71316e22891d752 /dev/rbd0 >> root@ix-s2:~# md5sum /dev/rbd1 >> e450f50b9ffa0073fae940ee858a43ce /dev/rbd1 >> >> >> Ok, now i protect snap and create clone... but same thing... >> md5 for clone same as for snap,, >> >> root@test:~# rbd unmap /dev/rbd1 >> root@test:~# rbd snap protect >> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> root@test:~# rbd clone >> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> cold-storage/test-image >> root@test:~# rbd map cold-storage/test-image >> /dev/rbd1 >> root@test:~# md5sum /dev/rbd1 >> e450f50b9ffa0073fae940ee858a43ce /dev/rbd1 >> >> but it's broken... >> root@test:~# parted /dev/rbd1 print >> Error: /dev/rbd1: unrecognised disk label >> >> >> = >> >> tech details: >> >> root@test:~# ceph -v >> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) >> >> We have 2 inconstistent pgs, but all images not placed on this pgs... >> >> root@test:~# ceph health detail >> HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >> pg 2.490 is active+clean+inconsistent, acting [56,15,29] >> pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >> 18 scrub errors >> >> >> >> root@test:~# ceph osd map cold-storage >> 0e23c701-401d-4465-b9b4-c02939d57bb5 >> osdmap e16770 pool 'cold-storage' (2) object >> '0e23c701-401d-4465-b9b4-c02939d57bb5' -> pg 2.74458f70 (2.770) -> up >> ([37,15,14], p37) acting ([37,15,14], p37) >> root@test:~# ceph osd map cold-storage >> 0e23c701-401d-4465-b9b4-c02939d57bb5@snap >> osdmap e16770 pool 'cold-storage' (2) object >> '0e23c701-401d-4465-b9b4-c02939d57bb5@snap' -> pg 2.793cd4a3 (2.4a3) -> up >> ([12,23,17], p12) acting ([12,23,17], p12) >> root@test:~# ceph osd map cold-storage >> 0e23c701-401d-4465-b9b4-c02939d57bb5@test-image >> osdmap e16770 pool 'cold-storage' (2) object >> '0e23c701-401d-4465-b9b4-c02939d57bb5@test-image' -> pg 2.9519c2a9 (2.2a9) >> -> up ([12,44,23], p12) acting ([12,44,23], p12) >> >> >> Also we use cache layer, which in current moment - in forward mode... >> >> Can you please help me with this.. As my brain stop to understand what is >> going on... >> >> Thank in advance! >> >> >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > > Andrija Panić > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Also, can you include the kernel version? -Sam On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just wrote: > Snapshotting with cache/tiering *is* supposed to work. Can you open a bug? > -Sam > > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic > wrote: >> This was related to the caching layer, which doesnt support snapshooting per >> docs...for sake of closing the thread. >> >> On 17 August 2015 at 21:15, Voloshanenko Igor >> wrote: >>> >>> Hi all, can you please help me with unexplained situation... >>> >>> All snapshot inside ceph broken... >>> >>> So, as example, we have VM template, as rbd inside ceph. >>> We can map it and mount to check that all ok with it >>> >>> root@test:~# rbd map cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5 >>> /dev/rbd0 >>> root@test:~# parted /dev/rbd0 print >>> Model: Unknown (unknown) >>> Disk /dev/rbd0: 10.7GB >>> Sector size (logical/physical): 512B/512B >>> Partition Table: msdos >>> >>> Number Start End SizeType File system Flags >>> 1 1049kB 525MB 524MB primary ext4 boot >>> 2 525MB 10.7GB 10.2GB primary lvm >>> >>> Than i want to create snap, so i do: >>> root@test:~# rbd snap create >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>> >>> And now i want to map it: >>> >>> root@test:~# rbd map >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>> /dev/rbd1 >>> root@test:~# parted /dev/rbd1 print >>> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >>> /dev/rbd1 has been opened read-only. >>> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >>> /dev/rbd1 has been opened read-only. >>> Error: /dev/rbd1: unrecognised disk label >>> >>> Even md5 different... >>> root@ix-s2:~# md5sum /dev/rbd0 >>> 9a47797a07fee3a3d71316e22891d752 /dev/rbd0 >>> root@ix-s2:~# md5sum /dev/rbd1 >>> e450f50b9ffa0073fae940ee858a43ce /dev/rbd1 >>> >>> >>> Ok, now i protect snap and create clone... but same thing... >>> md5 for clone same as for snap,, >>> >>> root@test:~# rbd unmap /dev/rbd1 >>> root@test:~# rbd snap protect >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>> root@test:~# rbd clone >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>> cold-storage/test-image >>> root@test:~# rbd map cold-storage/test-image >>> /dev/rbd1 >>> root@test:~# md5sum /dev/rbd1 >>> e450f50b9ffa0073fae940ee858a43ce /dev/rbd1 >>> >>> but it's broken... >>> root@test:~# parted /dev/rbd1 print >>> Error: /dev/rbd1: unrecognised disk label >>> >>> >>> = >>> >>> tech details: >>> >>> root@test:~# ceph -v >>> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) >>> >>> We have 2 inconstistent pgs, but all images not placed on this pgs... >>> >>> root@test:~# ceph health detail >>> HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >>> pg 2.490 is active+clean+inconsistent, acting [56,15,29] >>> pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >>> 18 scrub errors >>> >>> >>> >>> root@test:~# ceph osd map cold-storage >>> 0e23c701-401d-4465-b9b4-c02939d57bb5 >>> osdmap e16770 pool 'cold-storage' (2) object >>> '0e23c701-401d-4465-b9b4-c02939d57bb5' -> pg 2.74458f70 (2.770) -> up >>> ([37,15,14], p37) acting ([37,15,14], p37) >>> root@test:~# ceph osd map cold-storage >>> 0e23c701-401d-4465-b9b4-c02939d57bb5@snap >>> osdmap e16770 pool 'cold-storage' (2) object >>> '0e23c701-401d-4465-b9b4-c02939d57bb5@snap' -> pg 2.793cd4a3 (2.4a3) -> up >>> ([12,23,17], p12) acting ([12,23,17], p12) >>> root@test:~# ceph osd map cold-storage >>> 0e23c701-401d-4465-b9b4-c02939d57bb5@test-image >>> osdmap e16770 pool 'cold-storage' (2) object >>> '0e23c701-401d-4465-b9b4-c02939d57bb5@test-image' -> pg 2.9519c2a9 (2.2a9) >>> -> up ([12,44,23], p12) acting ([12,44,23], p12) >>> >>> >>> Also we use cache layer, which in current moment - in forward mode... >>> >>> Can you please help me with this.. As my brain stop to understand what is >>> going on... >>> >>> Thank in advance! >>> >>> >>> >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> >> -- >> >> Andrija Panić >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Hmm, that might actually be client side. Can you attempt to reproduce with rbd-fuse (different client side implementation from the kernel)? -Sam On Thu, Aug 20, 2015 at 3:56 PM, Voloshanenko Igor wrote: > root@test:~# uname -a > Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun May 17 17:37:22 UTC > 2015 x86_64 x86_64 x86_64 GNU/Linux > > 2015-08-21 1:54 GMT+03:00 Samuel Just : >> >> Also, can you include the kernel version? >> -Sam >> >> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just wrote: >> > Snapshotting with cache/tiering *is* supposed to work. Can you open a >> > bug? >> > -Sam >> > >> > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic >> > wrote: >> >> This was related to the caching layer, which doesnt support >> >> snapshooting per >> >> docs...for sake of closing the thread. >> >> >> >> On 17 August 2015 at 21:15, Voloshanenko Igor >> >> >> >> wrote: >> >>> >> >>> Hi all, can you please help me with unexplained situation... >> >>> >> >>> All snapshot inside ceph broken... >> >>> >> >>> So, as example, we have VM template, as rbd inside ceph. >> >>> We can map it and mount to check that all ok with it >> >>> >> >>> root@test:~# rbd map cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5 >> >>> /dev/rbd0 >> >>> root@test:~# parted /dev/rbd0 print >> >>> Model: Unknown (unknown) >> >>> Disk /dev/rbd0: 10.7GB >> >>> Sector size (logical/physical): 512B/512B >> >>> Partition Table: msdos >> >>> >> >>> Number Start End SizeType File system Flags >> >>> 1 1049kB 525MB 524MB primary ext4 boot >> >>> 2 525MB 10.7GB 10.2GB primary lvm >> >>> >> >>> Than i want to create snap, so i do: >> >>> root@test:~# rbd snap create >> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> >>> >> >>> And now i want to map it: >> >>> >> >>> root@test:~# rbd map >> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> >>> /dev/rbd1 >> >>> root@test:~# parted /dev/rbd1 print >> >>> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >> >>> /dev/rbd1 has been opened read-only. >> >>> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >> >>> /dev/rbd1 has been opened read-only. >> >>> Error: /dev/rbd1: unrecognised disk label >> >>> >> >>> Even md5 different... >> >>> root@ix-s2:~# md5sum /dev/rbd0 >> >>> 9a47797a07fee3a3d71316e22891d752 /dev/rbd0 >> >>> root@ix-s2:~# md5sum /dev/rbd1 >> >>> e450f50b9ffa0073fae940ee858a43ce /dev/rbd1 >> >>> >> >>> >> >>> Ok, now i protect snap and create clone... but same thing... >> >>> md5 for clone same as for snap,, >> >>> >> >>> root@test:~# rbd unmap /dev/rbd1 >> >>> root@test:~# rbd snap protect >> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> >>> root@test:~# rbd clone >> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> >>> cold-storage/test-image >> >>> root@test:~# rbd map cold-storage/test-image >> >>> /dev/rbd1 >> >>> root@test:~# md5sum /dev/rbd1 >> >>> e450f50b9ffa0073fae940ee858a43ce /dev/rbd1 >> >>> >> >>> but it's broken... >> >>> root@test:~# parted /dev/rbd1 print >> >>> Error: /dev/rbd1: unrecognised disk label >> >>> >> >>> >> >>> = >> >>> >> >>> tech details: >> >>> >> >>> root@test:~# ceph -v >> >>> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) >> >>> >> >>> We have 2 inconstistent pgs, but all images not placed on this pgs... >> >>> >> >>> root@test:~# ceph health detail >> >>> HEALTH_ERR 2 pgs inconsistent; 18 scrub errors >> >>> pg 2.490 is active+clean+inconsistent, acting [56,15,29] >> >>> pg 2.c4 is active+clean+inconsistent, acting [56,10,42] >> >>> 18 scrub err
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
What's supposed to happen is that the client transparently directs all requests to the cache pool rather than the cold pool when there is a cache pool. If the kernel is sending requests to the cold pool, that's probably where the bug is. Odd. It could also be a bug specific 'forward' mode either in the client or on the osd. Why did you have it in that mode? -Sam On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor wrote: > We used 4.x branch, as we have "very good" Samsung 850 pro in production, > and they don;t support ncq_trim... > > And 4,x first branch which include exceptions for this in libsata.c. > > sure we can backport this 1 line to 3.x branch, but we prefer no to go > deeper if packege for new kernel exist. > > 2015-08-21 1:56 GMT+03:00 Voloshanenko Igor : >> >> root@test:~# uname -a >> Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun May 17 17:37:22 UTC >> 2015 x86_64 x86_64 x86_64 GNU/Linux >> >> 2015-08-21 1:54 GMT+03:00 Samuel Just : >>> >>> Also, can you include the kernel version? >>> -Sam >>> >>> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just wrote: >>> > Snapshotting with cache/tiering *is* supposed to work. Can you open a >>> > bug? >>> > -Sam >>> > >>> > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic >>> > wrote: >>> >> This was related to the caching layer, which doesnt support >>> >> snapshooting per >>> >> docs...for sake of closing the thread. >>> >> >>> >> On 17 August 2015 at 21:15, Voloshanenko Igor >>> >> >>> >> wrote: >>> >>> >>> >>> Hi all, can you please help me with unexplained situation... >>> >>> >>> >>> All snapshot inside ceph broken... >>> >>> >>> >>> So, as example, we have VM template, as rbd inside ceph. >>> >>> We can map it and mount to check that all ok with it >>> >>> >>> >>> root@test:~# rbd map >>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5 >>> >>> /dev/rbd0 >>> >>> root@test:~# parted /dev/rbd0 print >>> >>> Model: Unknown (unknown) >>> >>> Disk /dev/rbd0: 10.7GB >>> >>> Sector size (logical/physical): 512B/512B >>> >>> Partition Table: msdos >>> >>> >>> >>> Number Start End SizeType File system Flags >>> >>> 1 1049kB 525MB 524MB primary ext4 boot >>> >>> 2 525MB 10.7GB 10.2GB primary lvm >>> >>> >>> >>> Than i want to create snap, so i do: >>> >>> root@test:~# rbd snap create >>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>> >>> >>> >>> And now i want to map it: >>> >>> >>> >>> root@test:~# rbd map >>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>> >>> /dev/rbd1 >>> >>> root@test:~# parted /dev/rbd1 print >>> >>> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >>> >>> /dev/rbd1 has been opened read-only. >>> >>> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >>> >>> /dev/rbd1 has been opened read-only. >>> >>> Error: /dev/rbd1: unrecognised disk label >>> >>> >>> >>> Even md5 different... >>> >>> root@ix-s2:~# md5sum /dev/rbd0 >>> >>> 9a47797a07fee3a3d71316e22891d752 /dev/rbd0 >>> >>> root@ix-s2:~# md5sum /dev/rbd1 >>> >>> e450f50b9ffa0073fae940ee858a43ce /dev/rbd1 >>> >>> >>> >>> >>> >>> Ok, now i protect snap and create clone... but same thing... >>> >>> md5 for clone same as for snap,, >>> >>> >>> >>> root@test:~# rbd unmap /dev/rbd1 >>> >>> root@test:~# rbd snap protect >>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>> >>> root@test:~# rbd clone >>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>> >>> cold-storage/test-image >>> >>> root@test:~# rbd map cold-storage/test-image >>> >>> /dev/rbd1 >>> >>> root@test:~# md5sum /dev/rbd1 >>> >&
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Certainly, don't reproduce this with a cluster you care about :). -Sam On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just wrote: > What's supposed to happen is that the client transparently directs all > requests to the cache pool rather than the cold pool when there is a > cache pool. If the kernel is sending requests to the cold pool, > that's probably where the bug is. Odd. It could also be a bug > specific 'forward' mode either in the client or on the osd. Why did > you have it in that mode? > -Sam > > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor > wrote: >> We used 4.x branch, as we have "very good" Samsung 850 pro in production, >> and they don;t support ncq_trim... >> >> And 4,x first branch which include exceptions for this in libsata.c. >> >> sure we can backport this 1 line to 3.x branch, but we prefer no to go >> deeper if packege for new kernel exist. >> >> 2015-08-21 1:56 GMT+03:00 Voloshanenko Igor : >>> >>> root@test:~# uname -a >>> Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun May 17 17:37:22 UTC >>> 2015 x86_64 x86_64 x86_64 GNU/Linux >>> >>> 2015-08-21 1:54 GMT+03:00 Samuel Just : >>>> >>>> Also, can you include the kernel version? >>>> -Sam >>>> >>>> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just wrote: >>>> > Snapshotting with cache/tiering *is* supposed to work. Can you open a >>>> > bug? >>>> > -Sam >>>> > >>>> > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic >>>> > wrote: >>>> >> This was related to the caching layer, which doesnt support >>>> >> snapshooting per >>>> >> docs...for sake of closing the thread. >>>> >> >>>> >> On 17 August 2015 at 21:15, Voloshanenko Igor >>>> >> >>>> >> wrote: >>>> >>> >>>> >>> Hi all, can you please help me with unexplained situation... >>>> >>> >>>> >>> All snapshot inside ceph broken... >>>> >>> >>>> >>> So, as example, we have VM template, as rbd inside ceph. >>>> >>> We can map it and mount to check that all ok with it >>>> >>> >>>> >>> root@test:~# rbd map >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5 >>>> >>> /dev/rbd0 >>>> >>> root@test:~# parted /dev/rbd0 print >>>> >>> Model: Unknown (unknown) >>>> >>> Disk /dev/rbd0: 10.7GB >>>> >>> Sector size (logical/physical): 512B/512B >>>> >>> Partition Table: msdos >>>> >>> >>>> >>> Number Start End SizeType File system Flags >>>> >>> 1 1049kB 525MB 524MB primary ext4 boot >>>> >>> 2 525MB 10.7GB 10.2GB primary lvm >>>> >>> >>>> >>> Than i want to create snap, so i do: >>>> >>> root@test:~# rbd snap create >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>>> >>> >>>> >>> And now i want to map it: >>>> >>> >>>> >>> root@test:~# rbd map >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >>>> >>> /dev/rbd1 >>>> >>> root@test:~# parted /dev/rbd1 print >>>> >>> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >>>> >>> /dev/rbd1 has been opened read-only. >>>> >>> Warning: Unable to open /dev/rbd1 read-write (Read-only file system). >>>> >>> /dev/rbd1 has been opened read-only. >>>> >>> Error: /dev/rbd1: unrecognised disk label >>>> >>> >>>> >>> Even md5 different... >>>> >>> root@ix-s2:~# md5sum /dev/rbd0 >>>> >>> 9a47797a07fee3a3d71316e22891d752 /dev/rbd0 >>>> >>> root@ix-s2:~# md5sum /dev/rbd1 >>>> >>> e450f50b9ffa0073fae940ee858a43ce /dev/rbd1 >>>> >>> >>>> >>> >>>> >>> Ok, now i protect snap and create clone... but same thing... >>>> >>> md5 for clone same as for snap,, >>>> >>> >>>> >>> root@test:~# rbd unmap /dev/rbd1
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
So you started draining the cache pool before you saw either the inconsistent pgs or the anomalous snap behavior? (That is, writeback mode was working correctly?) -Sam On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor wrote: > Good joke ) > > 2015-08-21 2:06 GMT+03:00 Samuel Just : >> >> Certainly, don't reproduce this with a cluster you care about :). >> -Sam >> >> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just wrote: >> > What's supposed to happen is that the client transparently directs all >> > requests to the cache pool rather than the cold pool when there is a >> > cache pool. If the kernel is sending requests to the cold pool, >> > that's probably where the bug is. Odd. It could also be a bug >> > specific 'forward' mode either in the client or on the osd. Why did >> > you have it in that mode? >> > -Sam >> > >> > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor >> > wrote: >> >> We used 4.x branch, as we have "very good" Samsung 850 pro in >> >> production, >> >> and they don;t support ncq_trim... >> >> >> >> And 4,x first branch which include exceptions for this in libsata.c. >> >> >> >> sure we can backport this 1 line to 3.x branch, but we prefer no to go >> >> deeper if packege for new kernel exist. >> >> >> >> 2015-08-21 1:56 GMT+03:00 Voloshanenko Igor >> >> : >> >>> >> >>> root@test:~# uname -a >> >>> Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun May 17 17:37:22 >> >>> UTC >> >>> 2015 x86_64 x86_64 x86_64 GNU/Linux >> >>> >> >>> 2015-08-21 1:54 GMT+03:00 Samuel Just : >> >>>> >> >>>> Also, can you include the kernel version? >> >>>> -Sam >> >>>> >> >>>> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just >> >>>> wrote: >> >>>> > Snapshotting with cache/tiering *is* supposed to work. Can you >> >>>> > open a >> >>>> > bug? >> >>>> > -Sam >> >>>> > >> >>>> > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic >> >>>> > wrote: >> >>>> >> This was related to the caching layer, which doesnt support >> >>>> >> snapshooting per >> >>>> >> docs...for sake of closing the thread. >> >>>> >> >> >>>> >> On 17 August 2015 at 21:15, Voloshanenko Igor >> >>>> >> >> >>>> >> wrote: >> >>>> >>> >> >>>> >>> Hi all, can you please help me with unexplained situation... >> >>>> >>> >> >>>> >>> All snapshot inside ceph broken... >> >>>> >>> >> >>>> >>> So, as example, we have VM template, as rbd inside ceph. >> >>>> >>> We can map it and mount to check that all ok with it >> >>>> >>> >> >>>> >>> root@test:~# rbd map >> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5 >> >>>> >>> /dev/rbd0 >> >>>> >>> root@test:~# parted /dev/rbd0 print >> >>>> >>> Model: Unknown (unknown) >> >>>> >>> Disk /dev/rbd0: 10.7GB >> >>>> >>> Sector size (logical/physical): 512B/512B >> >>>> >>> Partition Table: msdos >> >>>> >>> >> >>>> >>> Number Start End SizeType File system Flags >> >>>> >>> 1 1049kB 525MB 524MB primary ext4 boot >> >>>> >>> 2 525MB 10.7GB 10.2GB primary lvm >> >>>> >>> >> >>>> >>> Than i want to create snap, so i do: >> >>>> >>> root@test:~# rbd snap create >> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> >>>> >>> >> >>>> >>> And now i want to map it: >> >>>> >>> >> >>>> >>> root@test:~# rbd map >> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap >> >>>> >>> /dev/rbd1 >> >&
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Created a ticket to improve our testing here -- this appears to be a hole. http://tracker.ceph.com/issues/12742 -Sam On Thu, Aug 20, 2015 at 4:09 PM, Samuel Just wrote: > So you started draining the cache pool before you saw either the > inconsistent pgs or the anomalous snap behavior? (That is, writeback > mode was working correctly?) > -Sam > > On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor > wrote: >> Good joke ) >> >> 2015-08-21 2:06 GMT+03:00 Samuel Just : >>> >>> Certainly, don't reproduce this with a cluster you care about :). >>> -Sam >>> >>> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just wrote: >>> > What's supposed to happen is that the client transparently directs all >>> > requests to the cache pool rather than the cold pool when there is a >>> > cache pool. If the kernel is sending requests to the cold pool, >>> > that's probably where the bug is. Odd. It could also be a bug >>> > specific 'forward' mode either in the client or on the osd. Why did >>> > you have it in that mode? >>> > -Sam >>> > >>> > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor >>> > wrote: >>> >> We used 4.x branch, as we have "very good" Samsung 850 pro in >>> >> production, >>> >> and they don;t support ncq_trim... >>> >> >>> >> And 4,x first branch which include exceptions for this in libsata.c. >>> >> >>> >> sure we can backport this 1 line to 3.x branch, but we prefer no to go >>> >> deeper if packege for new kernel exist. >>> >> >>> >> 2015-08-21 1:56 GMT+03:00 Voloshanenko Igor >>> >> : >>> >>> >>> >>> root@test:~# uname -a >>> >>> Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun May 17 17:37:22 >>> >>> UTC >>> >>> 2015 x86_64 x86_64 x86_64 GNU/Linux >>> >>> >>> >>> 2015-08-21 1:54 GMT+03:00 Samuel Just : >>> >>>> >>> >>>> Also, can you include the kernel version? >>> >>>> -Sam >>> >>>> >>> >>>> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just >>> >>>> wrote: >>> >>>> > Snapshotting with cache/tiering *is* supposed to work. Can you >>> >>>> > open a >>> >>>> > bug? >>> >>>> > -Sam >>> >>>> > >>> >>>> > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic >>> >>>> > wrote: >>> >>>> >> This was related to the caching layer, which doesnt support >>> >>>> >> snapshooting per >>> >>>> >> docs...for sake of closing the thread. >>> >>>> >> >>> >>>> >> On 17 August 2015 at 21:15, Voloshanenko Igor >>> >>>> >> >>> >>>> >> wrote: >>> >>>> >>> >>> >>>> >>> Hi all, can you please help me with unexplained situation... >>> >>>> >>> >>> >>>> >>> All snapshot inside ceph broken... >>> >>>> >>> >>> >>>> >>> So, as example, we have VM template, as rbd inside ceph. >>> >>>> >>> We can map it and mount to check that all ok with it >>> >>>> >>> >>> >>>> >>> root@test:~# rbd map >>> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5 >>> >>>> >>> /dev/rbd0 >>> >>>> >>> root@test:~# parted /dev/rbd0 print >>> >>>> >>> Model: Unknown (unknown) >>> >>>> >>> Disk /dev/rbd0: 10.7GB >>> >>>> >>> Sector size (logical/physical): 512B/512B >>> >>>> >>> Partition Table: msdos >>> >>>> >>> >>> >>>> >>> Number Start End SizeType File system Flags >>> >>>> >>> 1 1049kB 525MB 524MB primary ext4 boot >>> >>>> >>> 2 525MB 10.7GB 10.2GB primary lvm >>> >>>> >>> >>> >>>> >>> Than i want to create snap, so i do: >>&
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Not sure what you mean by: but it's stop to work in same moment, when cache layer fulfilled with data and evict/flush started... -Sam On Thu, Aug 20, 2015 at 4:11 PM, Voloshanenko Igor wrote: > No, when we start draining cache - bad pgs was in place... > We have big rebalance (disk by disk - to change journal side on both > hot/cold layers).. All was Ok, but after 2 days - arrived scrub errors and 2 > pgs inconsistent... > > In writeback - yes, looks like snapshot works good. but it's stop to work in > same moment, when cache layer fulfilled with data and evict/flush started... > > > > 2015-08-21 2:09 GMT+03:00 Samuel Just : >> >> So you started draining the cache pool before you saw either the >> inconsistent pgs or the anomalous snap behavior? (That is, writeback >> mode was working correctly?) >> -Sam >> >> On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor >> wrote: >> > Good joke ) >> > >> > 2015-08-21 2:06 GMT+03:00 Samuel Just : >> >> >> >> Certainly, don't reproduce this with a cluster you care about :). >> >> -Sam >> >> >> >> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just wrote: >> >> > What's supposed to happen is that the client transparently directs >> >> > all >> >> > requests to the cache pool rather than the cold pool when there is a >> >> > cache pool. If the kernel is sending requests to the cold pool, >> >> > that's probably where the bug is. Odd. It could also be a bug >> >> > specific 'forward' mode either in the client or on the osd. Why did >> >> > you have it in that mode? >> >> > -Sam >> >> > >> >> > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor >> >> > wrote: >> >> >> We used 4.x branch, as we have "very good" Samsung 850 pro in >> >> >> production, >> >> >> and they don;t support ncq_trim... >> >> >> >> >> >> And 4,x first branch which include exceptions for this in libsata.c. >> >> >> >> >> >> sure we can backport this 1 line to 3.x branch, but we prefer no to >> >> >> go >> >> >> deeper if packege for new kernel exist. >> >> >> >> >> >> 2015-08-21 1:56 GMT+03:00 Voloshanenko Igor >> >> >> : >> >> >>> >> >> >>> root@test:~# uname -a >> >> >>> Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun May 17 >> >> >>> 17:37:22 >> >> >>> UTC >> >> >>> 2015 x86_64 x86_64 x86_64 GNU/Linux >> >> >>> >> >> >>> 2015-08-21 1:54 GMT+03:00 Samuel Just : >> >> >>>> >> >> >>>> Also, can you include the kernel version? >> >> >>>> -Sam >> >> >>>> >> >> >>>> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just >> >> >>>> wrote: >> >> >>>> > Snapshotting with cache/tiering *is* supposed to work. Can you >> >> >>>> > open a >> >> >>>> > bug? >> >> >>>> > -Sam >> >> >>>> > >> >> >>>> > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic >> >> >>>> > wrote: >> >> >>>> >> This was related to the caching layer, which doesnt support >> >> >>>> >> snapshooting per >> >> >>>> >> docs...for sake of closing the thread. >> >> >>>> >> >> >> >>>> >> On 17 August 2015 at 21:15, Voloshanenko Igor >> >> >>>> >> >> >> >>>> >> wrote: >> >> >>>> >>> >> >> >>>> >>> Hi all, can you please help me with unexplained situation... >> >> >>>> >>> >> >> >>>> >>> All snapshot inside ceph broken... >> >> >>>> >>> >> >> >>>> >>> So, as example, we have VM template, as rbd inside ceph. >> >> >>>> >>> We can map it and mount to check that all ok with it >> >> >>>> >>> >> >> >>>> >>> root@test:~# rbd map >> >
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Also, what do you mean by "change journal side"? -Sam On Thu, Aug 20, 2015 at 4:15 PM, Samuel Just wrote: > Not sure what you mean by: > > but it's stop to work in same moment, when cache layer fulfilled with > data and evict/flush started... > -Sam > > On Thu, Aug 20, 2015 at 4:11 PM, Voloshanenko Igor > wrote: >> No, when we start draining cache - bad pgs was in place... >> We have big rebalance (disk by disk - to change journal side on both >> hot/cold layers).. All was Ok, but after 2 days - arrived scrub errors and 2 >> pgs inconsistent... >> >> In writeback - yes, looks like snapshot works good. but it's stop to work in >> same moment, when cache layer fulfilled with data and evict/flush started... >> >> >> >> 2015-08-21 2:09 GMT+03:00 Samuel Just : >>> >>> So you started draining the cache pool before you saw either the >>> inconsistent pgs or the anomalous snap behavior? (That is, writeback >>> mode was working correctly?) >>> -Sam >>> >>> On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor >>> wrote: >>> > Good joke ))))) >>> > >>> > 2015-08-21 2:06 GMT+03:00 Samuel Just : >>> >> >>> >> Certainly, don't reproduce this with a cluster you care about :). >>> >> -Sam >>> >> >>> >> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just wrote: >>> >> > What's supposed to happen is that the client transparently directs >>> >> > all >>> >> > requests to the cache pool rather than the cold pool when there is a >>> >> > cache pool. If the kernel is sending requests to the cold pool, >>> >> > that's probably where the bug is. Odd. It could also be a bug >>> >> > specific 'forward' mode either in the client or on the osd. Why did >>> >> > you have it in that mode? >>> >> > -Sam >>> >> > >>> >> > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor >>> >> > wrote: >>> >> >> We used 4.x branch, as we have "very good" Samsung 850 pro in >>> >> >> production, >>> >> >> and they don;t support ncq_trim... >>> >> >> >>> >> >> And 4,x first branch which include exceptions for this in libsata.c. >>> >> >> >>> >> >> sure we can backport this 1 line to 3.x branch, but we prefer no to >>> >> >> go >>> >> >> deeper if packege for new kernel exist. >>> >> >> >>> >> >> 2015-08-21 1:56 GMT+03:00 Voloshanenko Igor >>> >> >> : >>> >> >>> >>> >> >>> root@test:~# uname -a >>> >> >>> Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun May 17 >>> >> >>> 17:37:22 >>> >> >>> UTC >>> >> >>> 2015 x86_64 x86_64 x86_64 GNU/Linux >>> >> >>> >>> >> >>> 2015-08-21 1:54 GMT+03:00 Samuel Just : >>> >> >>>> >>> >> >>>> Also, can you include the kernel version? >>> >> >>>> -Sam >>> >> >>>> >>> >> >>>> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just >>> >> >>>> wrote: >>> >> >>>> > Snapshotting with cache/tiering *is* supposed to work. Can you >>> >> >>>> > open a >>> >> >>>> > bug? >>> >> >>>> > -Sam >>> >> >>>> > >>> >> >>>> > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic >>> >> >>>> > wrote: >>> >> >>>> >> This was related to the caching layer, which doesnt support >>> >> >>>> >> snapshooting per >>> >> >>>> >> docs...for sake of closing the thread. >>> >> >>>> >> >>> >> >>>> >> On 17 August 2015 at 21:15, Voloshanenko Igor >>> >> >>>> >> >>> >> >>>> >> wrote: >>> >> >>>> >>> >>> >> >>>> >>> Hi all, can you please help me with unexplained situation... >>> >> >>>> >>> >>> &
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
But that was still in writeback mode, right? -Sam On Thu, Aug 20, 2015 at 4:18 PM, Voloshanenko Igor wrote: > WE haven't set values for max_bytes / max_objects.. and all data initially > writes only to cache layer and not flushed at all to cold layer. > > Then we received notification from monitoring that we collect about 750GB in > hot pool ) So i changed values for max_object_bytes to be 0,9 of disk > size... And then evicting/flushing started... > > And issue with snapshots arrived > > 2015-08-21 2:15 GMT+03:00 Samuel Just : >> >> Not sure what you mean by: >> >> but it's stop to work in same moment, when cache layer fulfilled with >> data and evict/flush started... >> -Sam >> >> On Thu, Aug 20, 2015 at 4:11 PM, Voloshanenko Igor >> wrote: >> > No, when we start draining cache - bad pgs was in place... >> > We have big rebalance (disk by disk - to change journal side on both >> > hot/cold layers).. All was Ok, but after 2 days - arrived scrub errors >> > and 2 >> > pgs inconsistent... >> > >> > In writeback - yes, looks like snapshot works good. but it's stop to >> > work in >> > same moment, when cache layer fulfilled with data and evict/flush >> > started... >> > >> > >> > >> > 2015-08-21 2:09 GMT+03:00 Samuel Just : >> >> >> >> So you started draining the cache pool before you saw either the >> >> inconsistent pgs or the anomalous snap behavior? (That is, writeback >> >> mode was working correctly?) >> >> -Sam >> >> >> >> On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor >> >> wrote: >> >> > Good joke ) >> >> > >> >> > 2015-08-21 2:06 GMT+03:00 Samuel Just : >> >> >> >> >> >> Certainly, don't reproduce this with a cluster you care about :). >> >> >> -Sam >> >> >> >> >> >> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just >> >> >> wrote: >> >> >> > What's supposed to happen is that the client transparently directs >> >> >> > all >> >> >> > requests to the cache pool rather than the cold pool when there is >> >> >> > a >> >> >> > cache pool. If the kernel is sending requests to the cold pool, >> >> >> > that's probably where the bug is. Odd. It could also be a bug >> >> >> > specific 'forward' mode either in the client or on the osd. Why >> >> >> > did >> >> >> > you have it in that mode? >> >> >> > -Sam >> >> >> > >> >> >> > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor >> >> >> > wrote: >> >> >> >> We used 4.x branch, as we have "very good" Samsung 850 pro in >> >> >> >> production, >> >> >> >> and they don;t support ncq_trim... >> >> >> >> >> >> >> >> And 4,x first branch which include exceptions for this in >> >> >> >> libsata.c. >> >> >> >> >> >> >> >> sure we can backport this 1 line to 3.x branch, but we prefer no >> >> >> >> to >> >> >> >> go >> >> >> >> deeper if packege for new kernel exist. >> >> >> >> >> >> >> >> 2015-08-21 1:56 GMT+03:00 Voloshanenko Igor >> >> >> >> : >> >> >> >>> >> >> >> >>> root@test:~# uname -a >> >> >> >>> Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun May 17 >> >> >> >>> 17:37:22 >> >> >> >>> UTC >> >> >> >>> 2015 x86_64 x86_64 x86_64 GNU/Linux >> >> >> >>> >> >> >> >>> 2015-08-21 1:54 GMT+03:00 Samuel Just : >> >> >> >>>> >> >> >> >>>> Also, can you include the kernel version? >> >> >> >>>> -Sam >> >> >> >>>> >> >> >> >>>> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just >> >> >> >>>> wrote: >> >> >> >>>> > Snapshotting with cache/tiering *is* supposed to work. Can >> >> >> >>>&
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Yeah, I'm trying to confirm that the issues did happen in writeback mode. -Sam On Thu, Aug 20, 2015 at 4:21 PM, Voloshanenko Igor wrote: > Right. But issues started... > > 2015-08-21 2:20 GMT+03:00 Samuel Just : >> >> But that was still in writeback mode, right? >> -Sam >> >> On Thu, Aug 20, 2015 at 4:18 PM, Voloshanenko Igor >> wrote: >> > WE haven't set values for max_bytes / max_objects.. and all data >> > initially >> > writes only to cache layer and not flushed at all to cold layer. >> > >> > Then we received notification from monitoring that we collect about >> > 750GB in >> > hot pool ) So i changed values for max_object_bytes to be 0,9 of disk >> > size... And then evicting/flushing started... >> > >> > And issue with snapshots arrived >> > >> > 2015-08-21 2:15 GMT+03:00 Samuel Just : >> >> >> >> Not sure what you mean by: >> >> >> >> but it's stop to work in same moment, when cache layer fulfilled with >> >> data and evict/flush started... >> >> -Sam >> >> >> >> On Thu, Aug 20, 2015 at 4:11 PM, Voloshanenko Igor >> >> wrote: >> >> > No, when we start draining cache - bad pgs was in place... >> >> > We have big rebalance (disk by disk - to change journal side on both >> >> > hot/cold layers).. All was Ok, but after 2 days - arrived scrub >> >> > errors >> >> > and 2 >> >> > pgs inconsistent... >> >> > >> >> > In writeback - yes, looks like snapshot works good. but it's stop to >> >> > work in >> >> > same moment, when cache layer fulfilled with data and evict/flush >> >> > started... >> >> > >> >> > >> >> > >> >> > 2015-08-21 2:09 GMT+03:00 Samuel Just : >> >> >> >> >> >> So you started draining the cache pool before you saw either the >> >> >> inconsistent pgs or the anomalous snap behavior? (That is, >> >> >> writeback >> >> >> mode was working correctly?) >> >> >> -Sam >> >> >> >> >> >> On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor >> >> >> wrote: >> >> >> > Good joke ) >> >> >> > >> >> >> > 2015-08-21 2:06 GMT+03:00 Samuel Just : >> >> >> >> >> >> >> >> Certainly, don't reproduce this with a cluster you care about :). >> >> >> >> -Sam >> >> >> >> >> >> >> >> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just >> >> >> >> wrote: >> >> >> >> > What's supposed to happen is that the client transparently >> >> >> >> > directs >> >> >> >> > all >> >> >> >> > requests to the cache pool rather than the cold pool when there >> >> >> >> > is >> >> >> >> > a >> >> >> >> > cache pool. If the kernel is sending requests to the cold >> >> >> >> > pool, >> >> >> >> > that's probably where the bug is. Odd. It could also be a bug >> >> >> >> > specific 'forward' mode either in the client or on the osd. >> >> >> >> > Why >> >> >> >> > did >> >> >> >> > you have it in that mode? >> >> >> >> > -Sam >> >> >> >> > >> >> >> >> > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor >> >> >> >> > wrote: >> >> >> >> >> We used 4.x branch, as we have "very good" Samsung 850 pro in >> >> >> >> >> production, >> >> >> >> >> and they don;t support ncq_trim... >> >> >> >> >> >> >> >> >> >> And 4,x first branch which include exceptions for this in >> >> >> >> >> libsata.c. >> >> >> >> >> >> >> >> >> >> sure we can backport this 1 line to 3.x branch, but we prefer >> >> >> >> >> no >> >> >> >> >> to >> >> >> >> >> go >> >> >> &
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Specifically, the snap behavior (we already know that the pgs went inconsistent while the pool was in writeback mode, right?). -Sam On Thu, Aug 20, 2015 at 4:22 PM, Samuel Just wrote: > Yeah, I'm trying to confirm that the issues did happen in writeback mode. > -Sam > > On Thu, Aug 20, 2015 at 4:21 PM, Voloshanenko Igor > wrote: >> Right. But issues started... >> >> 2015-08-21 2:20 GMT+03:00 Samuel Just : >>> >>> But that was still in writeback mode, right? >>> -Sam >>> >>> On Thu, Aug 20, 2015 at 4:18 PM, Voloshanenko Igor >>> wrote: >>> > WE haven't set values for max_bytes / max_objects.. and all data >>> > initially >>> > writes only to cache layer and not flushed at all to cold layer. >>> > >>> > Then we received notification from monitoring that we collect about >>> > 750GB in >>> > hot pool ) So i changed values for max_object_bytes to be 0,9 of disk >>> > size... And then evicting/flushing started... >>> > >>> > And issue with snapshots arrived >>> > >>> > 2015-08-21 2:15 GMT+03:00 Samuel Just : >>> >> >>> >> Not sure what you mean by: >>> >> >>> >> but it's stop to work in same moment, when cache layer fulfilled with >>> >> data and evict/flush started... >>> >> -Sam >>> >> >>> >> On Thu, Aug 20, 2015 at 4:11 PM, Voloshanenko Igor >>> >> wrote: >>> >> > No, when we start draining cache - bad pgs was in place... >>> >> > We have big rebalance (disk by disk - to change journal side on both >>> >> > hot/cold layers).. All was Ok, but after 2 days - arrived scrub >>> >> > errors >>> >> > and 2 >>> >> > pgs inconsistent... >>> >> > >>> >> > In writeback - yes, looks like snapshot works good. but it's stop to >>> >> > work in >>> >> > same moment, when cache layer fulfilled with data and evict/flush >>> >> > started... >>> >> > >>> >> > >>> >> > >>> >> > 2015-08-21 2:09 GMT+03:00 Samuel Just : >>> >> >> >>> >> >> So you started draining the cache pool before you saw either the >>> >> >> inconsistent pgs or the anomalous snap behavior? (That is, >>> >> >> writeback >>> >> >> mode was working correctly?) >>> >> >> -Sam >>> >> >> >>> >> >> On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor >>> >> >> wrote: >>> >> >> > Good joke ) >>> >> >> > >>> >> >> > 2015-08-21 2:06 GMT+03:00 Samuel Just : >>> >> >> >> >>> >> >> >> Certainly, don't reproduce this with a cluster you care about :). >>> >> >> >> -Sam >>> >> >> >> >>> >> >> >> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just >>> >> >> >> wrote: >>> >> >> >> > What's supposed to happen is that the client transparently >>> >> >> >> > directs >>> >> >> >> > all >>> >> >> >> > requests to the cache pool rather than the cold pool when there >>> >> >> >> > is >>> >> >> >> > a >>> >> >> >> > cache pool. If the kernel is sending requests to the cold >>> >> >> >> > pool, >>> >> >> >> > that's probably where the bug is. Odd. It could also be a bug >>> >> >> >> > specific 'forward' mode either in the client or on the osd. >>> >> >> >> > Why >>> >> >> >> > did >>> >> >> >> > you have it in that mode? >>> >> >> >> > -Sam >>> >> >> >> > >>> >> >> >> > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor >>> >> >> >> > wrote: >>> >> >> >> >> We used 4.x branch, as we have "very good" Samsung 850 pro in >>> >> >> >> >> production, >>> >> >> >> >> and they
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
And you adjusted the journals by removing the osd, recreating it with a larger journal, and reinserting it? -Sam On Thu, Aug 20, 2015 at 4:24 PM, Voloshanenko Igor wrote: > Right ( but also was rebalancing cycle 2 day before pgs corrupted) > > 2015-08-21 2:23 GMT+03:00 Samuel Just : >> >> Specifically, the snap behavior (we already know that the pgs went >> inconsistent while the pool was in writeback mode, right?). >> -Sam >> >> On Thu, Aug 20, 2015 at 4:22 PM, Samuel Just wrote: >> > Yeah, I'm trying to confirm that the issues did happen in writeback >> > mode. >> > -Sam >> > >> > On Thu, Aug 20, 2015 at 4:21 PM, Voloshanenko Igor >> > wrote: >> >> Right. But issues started... >> >> >> >> 2015-08-21 2:20 GMT+03:00 Samuel Just : >> >>> >> >>> But that was still in writeback mode, right? >> >>> -Sam >> >>> >> >>> On Thu, Aug 20, 2015 at 4:18 PM, Voloshanenko Igor >> >>> wrote: >> >>> > WE haven't set values for max_bytes / max_objects.. and all data >> >>> > initially >> >>> > writes only to cache layer and not flushed at all to cold layer. >> >>> > >> >>> > Then we received notification from monitoring that we collect about >> >>> > 750GB in >> >>> > hot pool ) So i changed values for max_object_bytes to be 0,9 of >> >>> > disk >> >>> > size... And then evicting/flushing started... >> >>> > >> >>> > And issue with snapshots arrived >> >>> > >> >>> > 2015-08-21 2:15 GMT+03:00 Samuel Just : >> >>> >> >> >>> >> Not sure what you mean by: >> >>> >> >> >>> >> but it's stop to work in same moment, when cache layer fulfilled >> >>> >> with >> >>> >> data and evict/flush started... >> >>> >> -Sam >> >>> >> >> >>> >> On Thu, Aug 20, 2015 at 4:11 PM, Voloshanenko Igor >> >>> >> wrote: >> >>> >> > No, when we start draining cache - bad pgs was in place... >> >>> >> > We have big rebalance (disk by disk - to change journal side on >> >>> >> > both >> >>> >> > hot/cold layers).. All was Ok, but after 2 days - arrived scrub >> >>> >> > errors >> >>> >> > and 2 >> >>> >> > pgs inconsistent... >> >>> >> > >> >>> >> > In writeback - yes, looks like snapshot works good. but it's stop >> >>> >> > to >> >>> >> > work in >> >>> >> > same moment, when cache layer fulfilled with data and evict/flush >> >>> >> > started... >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > 2015-08-21 2:09 GMT+03:00 Samuel Just : >> >>> >> >> >> >>> >> >> So you started draining the cache pool before you saw either the >> >>> >> >> inconsistent pgs or the anomalous snap behavior? (That is, >> >>> >> >> writeback >> >>> >> >> mode was working correctly?) >> >>> >> >> -Sam >> >>> >> >> >> >>> >> >> On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor >> >>> >> >> wrote: >> >>> >> >> > Good joke ) >> >>> >> >> > >> >>> >> >> > 2015-08-21 2:06 GMT+03:00 Samuel Just : >> >>> >> >> >> >> >>> >> >> >> Certainly, don't reproduce this with a cluster you care about >> >>> >> >> >> :). >> >>> >> >> >> -Sam >> >>> >> >> >> >> >>> >> >> >> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just >> >>> >> >> >> >> >>> >> >> >> wrote: >> >>> >> >> >> > What's supposed to happen is that the client transparently >> >>> >> >> >> > directs >> >>> >> >> >> > all >> >>> &
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Ok, create a ticket with a timeline and all of this information, I'll try to look into it more tomorrow. -Sam On Thu, Aug 20, 2015 at 4:25 PM, Voloshanenko Igor wrote: > Exactly > > пятница, 21 августа 2015 г. пользователь Samuel Just написал: > >> And you adjusted the journals by removing the osd, recreating it with >> a larger journal, and reinserting it? >> -Sam >> >> On Thu, Aug 20, 2015 at 4:24 PM, Voloshanenko Igor >> wrote: >> > Right ( but also was rebalancing cycle 2 day before pgs corrupted) >> > >> > 2015-08-21 2:23 GMT+03:00 Samuel Just : >> >> >> >> Specifically, the snap behavior (we already know that the pgs went >> >> inconsistent while the pool was in writeback mode, right?). >> >> -Sam >> >> >> >> On Thu, Aug 20, 2015 at 4:22 PM, Samuel Just wrote: >> >> > Yeah, I'm trying to confirm that the issues did happen in writeback >> >> > mode. >> >> > -Sam >> >> > >> >> > On Thu, Aug 20, 2015 at 4:21 PM, Voloshanenko Igor >> >> > wrote: >> >> >> Right. But issues started... >> >> >> >> >> >> 2015-08-21 2:20 GMT+03:00 Samuel Just : >> >> >>> >> >> >>> But that was still in writeback mode, right? >> >> >>> -Sam >> >> >>> >> >> >>> On Thu, Aug 20, 2015 at 4:18 PM, Voloshanenko Igor >> >> >>> wrote: >> >> >>> > WE haven't set values for max_bytes / max_objects.. and all data >> >> >>> > initially >> >> >>> > writes only to cache layer and not flushed at all to cold layer. >> >> >>> > >> >> >>> > Then we received notification from monitoring that we collect >> >> >>> > about >> >> >>> > 750GB in >> >> >>> > hot pool ) So i changed values for max_object_bytes to be 0,9 of >> >> >>> > disk >> >> >>> > size... And then evicting/flushing started... >> >> >>> > >> >> >>> > And issue with snapshots arrived >> >> >>> > >> >> >>> > 2015-08-21 2:15 GMT+03:00 Samuel Just : >> >> >>> >> >> >> >>> >> Not sure what you mean by: >> >> >>> >> >> >> >>> >> but it's stop to work in same moment, when cache layer fulfilled >> >> >>> >> with >> >> >>> >> data and evict/flush started... >> >> >>> >> -Sam >> >> >>> >> >> >> >>> >> On Thu, Aug 20, 2015 at 4:11 PM, Voloshanenko Igor >> >> >>> >> wrote: >> >> >>> >> > No, when we start draining cache - bad pgs was in place... >> >> >>> >> > We have big rebalance (disk by disk - to change journal side >> >> >>> >> > on >> >> >>> >> > both >> >> >>> >> > hot/cold layers).. All was Ok, but after 2 days - arrived >> >> >>> >> > scrub >> >> >>> >> > errors >> >> >>> >> > and 2 >> >> >>> >> > pgs inconsistent... >> >> >>> >> > >> >> >>> >> > In writeback - yes, looks like snapshot works good. but it's >> >> >>> >> > stop >> >> >>> >> > to >> >> >>> >> > work in >> >> >>> >> > same moment, when cache layer fulfilled with data and >> >> >>> >> > evict/flush >> >> >>> >> > started... >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > 2015-08-21 2:09 GMT+03:00 Samuel Just : >> >> >>> >> >> >> >> >>> >> >> So you started draining the cache pool before you saw either >> >> >>> >> >> the >> >> >>> >> >> inconsistent pgs or the anomalous snap behavior? (That is, >> >> >>> >> >> writeback >> >> >>> >> >
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
It would help greatly if, on a disposable cluster, you could reproduce the snapshot problem with debug osd = 20 debug filestore = 20 debug ms = 1 on all of the osds and attach the logs to the bug report. That should make it easier to work out what is going on. -Sam On Thu, Aug 20, 2015 at 4:40 PM, Voloshanenko Igor wrote: > Attachment blocked, so post as text... > > root@zzz:~# cat update_osd.sh > #!/bin/bash > > ID=$1 > echo "Process OSD# ${ID}" > > DEV=`mount | grep "ceph-${ID} " | cut -d " " -f 1` > echo "OSD# ${ID} hosted on ${DEV::-1}" > > TYPE_RAW=`smartctl -a ${DEV} | grep Rota | cut -d " " -f 6` > if [ "${TYPE_RAW}" == "Solid" ] > then > TYPE="ssd" > elif [ "${TYPE_RAW}" == "7200" ] > then > TYPE="platter" > fi > > echo "OSD Type = ${TYPE}" > > HOST=`hostname` > echo "Current node hostname: ${HOST}" > > echo "Set noout option for CEPH cluster" > ceph osd set noout > > echo "Marked OSD # ${ID} out" > [19/1857] > ceph osd out ${ID} > > echo "Remove OSD # ${ID} from CRUSHMAP" > ceph osd crush remove osd.${ID} > > echo "Delete auth for OSD# ${ID}" > ceph auth del osd.${ID} > > echo "Stop OSD# ${ID}" > stop ceph-osd id=${ID} > > echo "Remove OSD # ${ID} from cluster" > ceph osd rm ${ID} > > echo "Unmount OSD# ${ID}" > umount ${DEV} > > echo "ZAP ${DEV::-1}" > ceph-disk zap ${DEV::-1} > > echo "Create new OSD with ${DEV::-1}" > ceph-disk-prepare ${DEV::-1} > > echo "Activate new OSD" > ceph-disk-activate ${DEV} > > echo "Dump current CRUSHMAP" > ceph osd getcrushmap -o cm.old > > echo "Decompile CRUSHMAP" > crushtool -d cm.old -o cm > > echo "Place new OSD in proper place" > sed -i "s/device${ID}/osd.${ID}/" cm > LINE=`cat -n cm | sed -n "/${HOST}-${TYPE} {/,/}/p" | tail -n 1 | awk > '{print $1}'` > sed -i "${LINE}iitem osd.${ID} weight 1.000" cm > > echo "Modify ${HOST} weight into CRUSHMAP" > sed -i "s/item ${HOST}-${TYPE} weight 9.000/item ${HOST}-${TYPE} weight > 1.000/" cm > > echo "Compile new CRUSHMAP" > crushtool -c cm -o cm.new > > echo "Inject new CRUSHMAP" > ceph osd setcrushmap -i cm.new > > #echo "Clean..." > #rm -rf cm cm.new > > echo "Unset noout option for CEPH cluster" > ceph osd unset noout > > echo "OSD recreated... Waiting for rebalancing..." > > 2015-08-21 2:37 GMT+03:00 Voloshanenko Igor : >> >> As i we use journal collocation for journal now (because we want to >> utilize cache layer ((( ) i use ceph-disk to create new OSD (changed journal >> size on ceph.conf). I don;t prefer manual work)) >> >> So create very simple script to update journal size >> >> 2015-08-21 2:25 GMT+03:00 Voloshanenko Igor : >>> >>> Exactly >>> >>> пятница, 21 августа 2015 г. пользователь Samuel Just написал: >>> >>>> And you adjusted the journals by removing the osd, recreating it with >>>> a larger journal, and reinserting it? >>>> -Sam >>>> >>>> On Thu, Aug 20, 2015 at 4:24 PM, Voloshanenko Igor >>>> wrote: >>>> > Right ( but also was rebalancing cycle 2 day before pgs corrupted) >>>> > >>>> > 2015-08-21 2:23 GMT+03:00 Samuel Just : >>>> >> >>>> >> Specifically, the snap behavior (we already know that the pgs went >>>> >> inconsistent while the pool was in writeback mode, right?). >>>> >> -Sam >>>> >> >>>> >> On Thu, Aug 20, 2015 at 4:22 PM, Samuel Just >>>> >> wrote: >>>> >> > Yeah, I'm trying to confirm that the issues did happen in writeback >>>> >> > mode. >>>> >> > -Sam >>>> >> > >>>> >> > On Thu, Aug 20, 2015 at 4:21 PM, Voloshanenko Igor >>>> >> > wrote: >>>> >> >> Right. But issues started... >>>> >> >> >>>> >> >> 2015-08-21 2:20 GMT+03:00 Samuel Just : >>>> >> >>> >>>> >> >>> But that was still in writeback mode, right? >>>> >> >>> -Sam >>>> >> >>> >>>> >> >>> On Thu, Aug 20, 2015 at 4:18
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Odd, did you happen to capture osd logs? -Sam On Thu, Aug 20, 2015 at 8:10 PM, Ilya Dryomov wrote: > On Fri, Aug 21, 2015 at 2:02 AM, Samuel Just wrote: >> What's supposed to happen is that the client transparently directs all >> requests to the cache pool rather than the cold pool when there is a >> cache pool. If the kernel is sending requests to the cold pool, >> that's probably where the bug is. Odd. It could also be a bug >> specific 'forward' mode either in the client or on the osd. Why did >> you have it in that mode? > > I think I reproduced this on today's master. > > Setup, cache mode is writeback: > > $ ./ceph osd pool create foo 12 12 > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > pool 'foo' created > $ ./ceph osd pool create foo-hot 12 12 > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > pool 'foo-hot' created > $ ./ceph osd tier add foo foo-hot > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > pool 'foo-hot' is now (or already was) a tier of 'foo' > $ ./ceph osd tier cache-mode foo-hot writeback > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > set cache-mode for pool 'foo-hot' to writeback > $ ./ceph osd tier set-overlay foo foo-hot > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > overlay for 'foo' is now (or already was) 'foo-hot' > > Create an image: > > $ ./rbd create --size 10M --image-format 2 foo/bar > $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt > $ sudo mkfs.ext4 /mnt/bar > $ sudo umount /mnt > > Create a snapshot, take md5sum: > > $ ./rbd snap create foo/bar@snap > $ ./rbd export foo/bar /tmp/foo-1 > Exporting image: 100% complete...done. > $ ./rbd export foo/bar@snap /tmp/snap-1 > Exporting image: 100% complete...done. > $ md5sum /tmp/foo-1 > 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-1 > $ md5sum /tmp/snap-1 > 83f5d244bb65eb19eddce0dc94bf6dda /tmp/snap-1 > > Set the cache mode to forward and do a flush, hashes don't match - the > snap is empty - we bang on the hot tier and don't get redirected to the > cold tier, I suspect: > > $ ./ceph osd tier cache-mode foo-hot forward > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > set cache-mode for pool 'foo-hot' to forward > $ ./rados -p foo-hot cache-flush-evict-all > rbd_data.100a6b8b4567.0002 > rbd_id.bar > rbd_directory > rbd_header.100a6b8b4567 > bar.rbd > rbd_data.100a6b8b4567.0001 > rbd_data.100a6b8b4567. > $ ./rados -p foo-hot cache-flush-evict-all > $ ./rbd export foo/bar /tmp/foo-2 > Exporting image: 100% complete...done. > $ ./rbd export foo/bar@snap /tmp/snap-2 > Exporting image: 100% complete...done. > $ md5sum /tmp/foo-2 > 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-2 > $ md5sum /tmp/snap-2 > f1c9645dbc14efddc7d8a322685f26eb /tmp/snap-2 > $ od /tmp/snap-2 > 000 00 00 00 00 00 00 00 00 > * > 5000 > > Disable the cache tier and we are back to normal: > > $ ./ceph osd tier remove-overlay foo > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > there is now (or already was) no overlay for 'foo' > $ ./rbd export foo/bar /tmp/foo-3 > Exporting image: 100% complete...done. > $ ./rbd export foo/bar@snap /tmp/snap-3 > Exporting image: 100% complete...done. > $ md5sum /tmp/foo-3 > 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-3 > $ md5sum /tmp/snap-3 > 83f5d244bb65eb19eddce0dc94bf6dda /tmp/snap-3 > > I first reproduced it with the kernel client, rbd export was just to > take it out of the equation. > > > Also, Igor sort of raised a question in his second message: if, after > setting the cache mode to forward and doing a flush, I open an image > (not a snapshot, so may not be related to the above) for write (e.g. > with rbd-fuse), I get an rbd header object in the hot pool, even though > it's in forward mode: > > $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt > $ sudo mount /mnt/bar /media > $ sudo umount /media > $ sudo umount /mnt > $ ./rados -p foo-hot ls > rbd_header.100a6b8b4567 > $ ./rados -p foo ls | grep rbd_header > rbd_header.100a6b8b4567 > > It's been a while since I looked into tiering, is that how it's > supposed to work? It looks like it happens because rbd_header op > replies don't redirect? > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
I think I found the bug -- need to whiteout the snapset (or decache it) upon evict. http://tracker.ceph.com/issues/12748 -Sam On Fri, Aug 21, 2015 at 8:04 AM, Ilya Dryomov wrote: > On Fri, Aug 21, 2015 at 5:59 PM, Samuel Just wrote: >> Odd, did you happen to capture osd logs? > > No, but the reproducer is trivial to cut & paste. > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2
David, does this look familiar? -Sam On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay wrote: > Hi Cephers, > > I'm trying to resolve an inconsistent pg on an erasure-coded pool, running > Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub the > pg again. Here's the background, with my attempted resolution steps below. > Hopefully someone can steer me in the right direction. Thanks in advance! > > Current state: > # ceph health detail > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set > pg 2.36 is active+clean+inconsistent, acting > [1,21,12,9,0,10,14,7,18,20,5,4,22,16] > 1 scrub errors > noout flag(s) set > > I started by looking at the log file for osd.1, where I found the cause of > the inconsistent report: > > 2015-08-24 00:43:10.391621 7f09fcff9700 0 log_channel(cluster) log [INF] : > 2.36 deep-scrub starts > 2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate had > a read error > 2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36s0 deep-scrub 0 missing, 1 inconsistent objects > 2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36 deep-scrub 1 errors > > I checked osd.21, where this report appears: > > 2015-08-24 01:54:56.477020 7f707cbd4700 0 osd.21 pg_epoch: 31958 pg[2.36s1( > v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 les/c > 31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1 > lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list > 576340b6/1005990.0199/head//2 got incorrect hash on read > > So, based upon the ceph documentation, I thought I could repair the pg by > executing "ceph pg repair 2.36". When I run this, while watching the mon > log, I see the command dispatch: > > 2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? 10.42.5.61:0/1002181' > entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.36"}]: > dispatch > > But I never see a "finish" in the mon log, like most ceph commands return. > (Not sure if I should expect to see a finish, just noting it doesn't occur.) > > Also, tailing the logs for any OSD in the acting set for pg 2.36, I never > see anything about a repair. The same case holds when I try "ceph pg 2.36 > deep-scrub" - command dispatched, but none of the OSDs care. In the past on > other clusters, I've seen "[INF] : pg.id repair starts" messages in the OSD > log after executing "ceph pg nn.yy repair". > > Further confusing me, I do see osd.1 start and finish other pg deep-scrubs, > before and after executing "ceph pg 2.36 deep-scrub". > > I know EC pools are special in several ways, but nothing in the Ceph manual > seems to indicate I can't deep-scrub or repair pgs in an EC pool... > > Thanks for reading and any suggestions. I'm happy to provide complete log > files or more details if I've left out any information that could be > helpful. > > ceph -s: http://hastebin.com/xetohugibi > ceph pg dump: http://hastebin.com/bijehoheve > ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0) > ceph osd dump: http://hastebin.com/fitajuzeca > > -Aaron > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph -w warning "I don't have pgid 0.2c8"?
What version are you running? How did you move the osds from 2TB to 4TB? -Sam On Wed, Jul 17, 2013 at 12:59 AM, Ta Ba Tuan wrote: > Hi everyone, > > I converted every osds from 2TB to 4TB, and when moving complete, show log > Ceph realtime"ceph -w": > displays error: "I don't have pgid 0.2c8" > > after then, I run: "ceph pg force_create_pg 0.2c8" > Ceph warning: pgmap v55175: 22944 pgs: 1 creating, 22940 active+clean, 3 > stale+active+degraded > > then, I can't read/write data to mounted CephFS on Client-side ==> notify on > client side: "Operation not permitted" > > now, "ceph -w" still notify "22944 pgs: 1 creating, 22940 active+clean, 3 > stale+active+degraded" > and I don't understand something occuring? > Please, help me!!! > > Thanks to everyone. > > --tuantaba > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph -w warning "I don't have pgid 0.2c8"?
What is the output of ceph pg dump | grep 'stale\|creating' ? On Wed, Jul 17, 2013 at 7:56 PM, Ta Ba Tuan wrote: > zombie pgs might occured when i remove some data pools. > but, with pgs in stale state, i can't delete it? > > I found this guide, but I don't understand it. > http://ceph.com/docs/next/dev/osd_internals/pg_removal/ > > Thanks! > --tuantaba > > > On 07/18/2013 09:22 AM, Ta Ba Tuan wrote: > > I'm using Ceph-0.61.4, > I removed each osds (2TB) on data hosts and re-create with disks (4TB). > When converting finish, Ceph warns that have 4 pgs in stale state and > warning: i don't have pgid > after, I created 4 pgs by command: ceph pg force_create_pg > > Now (after the long time), Ceph still warning: "pgmap v57451: 22944 pgs: 4 > creating, 22940 active+clean;" > > I don't know how to remove those pgs?. > Please guiding this error help me! > > Thank you! > --tuantaba > TA BA TUAN > > > On 07/18/2013 01:16 AM, Samuel Just wrote: > > What version are you running? How did you move the osds from 2TB to 4TB? > -Sam > > On Wed, Jul 17, 2013 at 12:59 AM, Ta Ba Tuan wrote: > > Hi everyone, > > I converted every osds from 2TB to 4TB, and when moving complete, show log > Ceph realtime"ceph -w": > displays error: "I don't have pgid 0.2c8" > > after then, I run: "ceph pg force_create_pg 0.2c8" > Ceph warning: pgmap v55175: 22944 pgs: 1 creating, 22940 active+clean, 3 > stale+active+degraded > > then, I can't read/write data to mounted CephFS on Client-side ==> notify on > client side: "Operation not permitted" > > now, "ceph -w" still notify "22944 pgs: 1 creating, 22940 active+clean, 3 > stale+active+degraded" > and I don't understand something occuring? > Please, help me!!! > > Thanks to everyone. > > --tuantaba > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to repair laggy storage cluster
Can you attach the output of ceph -s? -Sam On Fri, Aug 9, 2013 at 11:10 AM, Suresh Sadhu wrote: > how to repair laggy storage cluster,able to create images on the pools even > if HEATH state shows WARN, > > > > sudo ceph > > HEALTH_WARN 181 pgs degraded; 676 pgs stuck unclean; recovery 2/107 degraded > (1.869%); mds ceph@ubuntu3 is laggy > > > > Regards > > Sadhu > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Keep Crashing
Can you post more of the log? There should be a line towards the bottom indicating the line with the failed assert. Can you also attach ceph pg dump, ceph osd dump, ceph osd tree? -Sam On Mon, Aug 12, 2013 at 11:54 AM, John Wilkins wrote: > Stephane, > > You should post any crash bugs with stack trace to ceph-devel > ceph-de...@vger.kernel.org. > > > On Mon, Aug 12, 2013 at 9:02 AM, Stephane Boisvert < > stephane.boisv...@gameloft.com> wrote: > >> Hi, >> It seems my OSD processes keep crashing randomly and I don't know >> why. It seems to happens when the cluster is trying to re-balance... In >> normal usange I didn't notice any crash like that. >> >> We running ceph 0.61.7 on an up to date ubuntu 12.04 (all packages >> including kernel are current). >> >> >> Anyone have an idea ? >> >> >> TRACE: >> >> >> ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) >> 1: /usr/bin/ceph-osd() [0x79219a] >> 2: (()+0xfcb0) [0x7fd692da1cb0] >> 3: (gsignal()+0x35) [0x7fd69155a425] >> 4: (abort()+0x17b) [0x7fd69155db8b] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd691eac69d] >> 6: (()+0xb5846) [0x7fd691eaa846] >> 7: (()+0xb5873) [0x7fd691eaa873] >> 8: (()+0xb596e) [0x7fd691eaa96e] >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x1df) [0x84303f] >> 10: >> (PG::RecoveryState::Recovered::Recovered(boost::statechart::state> PG::RecoveryState::Active, boost::mpl::list> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na>, >> (boost::statechart::history_mode)0>::my_context)+0x38f) [0x6d932f] >> 11: (boost::statechart::state> PG::RecoveryState::Active, boost::mpl::list> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na>, >> (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr >> const&, >> boost::statechart::state_machine> PG::RecoveryState::Initial, std::allocator, >> boost::statechart::null_exception_translator>&)+0x5c) [0x6f270c] >> 12: (PG::RecoveryState::Recovering::react(PG::AllReplicasRecovered >> const&)+0xb4) [0x6d9454] >> 13: (boost::statechart::simple_state> PG::RecoveryState::Active, boost::mpl::list> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na>, >> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base >> const&, void const*)+0xda) [0x6f296a] >> 14: >> (boost::statechart::state_machine> PG::RecoveryState::Initial, std::allocator, >> boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base >> const&)+0x5b) [0x6e320b] >> 15: >> (boost::statechart::state_machine> PG::RecoveryState::Initial, std::allocator, >> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base >> const&)+0x11) [0x6e34e1] >> 16: (PG::handle_peering_event(std::tr1::shared_ptr, >> PG::RecoveryCtx*)+0x347) [0x69aaf7] >> 17: (OSD::process_peering_events(std::list > >> const&, ThreadPool::TPHandle&)+0x2f5) [0x632fc5] >> 18: (OSD::PeeringWQ::_process(std::list > >> const&, ThreadPool::TPHandle&)+0x12) [0x66e2d2] >> 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x838476] >> 20: (ThreadPool::WorkThread::entry()+0x10) [0x83a2a0] >> 21: (()+0x7e9a) [0x7fd692d99e9a] >> 22: (clone()+0x6d) [0x7fd691617ccd] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed >> to interpret this. >> >> --- begin dump of recent events --- >> -3> 2013-08-12 15:58:15.561005 7fd683d78700 1 -- >> 10.136.48.18:6814/21240 <== osd.56 10.136.48.14:0/17437 44 >> osd_ping(ping e8959 stamp 2013-08-12 15:58:15.556022) v2 47+0+0 >> (355096560 0 0) 0xc4e81c0 con 0x12fbeb00 >> -2> 2013-08-12 15:58:15.561038 7fd683d78700 1 -- >> 10.136.48.18:6814/21240 --> 10.136.48.14:0/17437 -- osd_ping(ping_reply >> e8959 stamp 2013-08-12 15:58:15.556022) v2 -- ?+0 0x1683ec40 con 0x12fbeb00 >> -1> 2013-08-12 15:58:15.568600 7fd67e56d700 1 -- >> 10.136.48.18:6813/21240 --> osd.44 10.136.48.15:6820/25671 -- >> osd_sub_op(osd.20.0:1293 25.328 >> 699ac328/rbd_data.ae2732ae8944a.00240828/head//25 [push] v 8424'11 >> snapset=0=[]:[] snapc=0=[]) v7 -- ?+0 0x2df0f400 >> 0> 2013-08-12 15:58:15.581608 7fd681d74700 -1 *** Caught signal >> (Aborted) ** >> in thread 7fd681d74700 >> >> ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) >> 1: /usr/bin/ceph-osd() [0x79219a] >> 2: (()+0xfcb0) [0x7fd692da1cb0] >> 3: (gsignal()+0x35) [0x7fd69155a425] >> 4: (abort()+0x17b) [0x7fd69155db8b] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd691eac69d] >> 6: (()+0xb5846) [0x7fd691eaa846] >> 7: (()+0xb5873) [0x7fd691eaa873] >> 8: (()+0xb596e) [0x7fd691eaa96e] >> 9:
Re: [ceph-users] ceph-deploy and journal on separate disk
Did you try using ceph-deploy disk zap ceph001:sdaa first? -Sam On Mon, Aug 12, 2013 at 6:21 AM, Pavel Timoschenkov wrote: > Hi. > > I have some problems with create journal on separate disk, using ceph-deploy > osd prepare command. > > When I try execute next command: > > ceph-deploy osd prepare ceph001:sdaa:sda1 > > where: > > sdaa – disk for ceph data > > sda1 – partition on ssd drive for journal > > I get next errors: > > > > ceph@ceph-admin:~$ ceph-deploy osd prepare ceph001:sdaa:sda1 > > ceph-disk-prepare -- /dev/sdaa /dev/sda1 returned 1 > > Information: Moved requested sector from 34 to 2048 in > > order to align on 2048-sector boundaries. > > The operation has completed successfully. > > meta-data=/dev/sdaa1 isize=2048 agcount=32, agsize=22892700 > blks > > = sectsz=512 attr=2, projid32bit=0 > > data = bsize=4096 blocks=732566385, imaxpct=5 > > = sunit=0 swidth=0 blks > > naming =version 2 bsize=4096 ascii-ci=0 > > log =internal log bsize=4096 blocks=357698, version=2 > > = sectsz=512 sunit=0 blks, lazy-count=1 > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same > device as the osd data > > mount: /dev/sdaa1: more filesystems detected. This should not happen, > >use -t to explicitly specify the filesystem type or > >use wipefs(8) to clean up the device. > > > > mount: you must specify the filesystem type > > ceph-disk: Mounting filesystem failed: Command '['mount', '-o', 'noatime', > '--', '/dev/sdaa1', '/var/lib/ceph/tmp/mnt.ek6mog']' returned non-zero exit > status 32 > > > > Someone had a similar problem? > > Thanks for the help > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mounting a pool via fuse
Can you elaborate on what behavior you are looking for? -Sam On Fri, Aug 9, 2013 at 4:37 AM, Georg Höllrigl wrote: > Hi, > > I'm using ceph 0.61.7. > > When using ceph-fuse, I couldn't find a way, to only mount one pool. > > Is there a way to mount a pool - or is it simply not supported? > > > > Kind Regards, > Georg > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)
Can you attach the output of ceph osd tree? Also, can you run ceph osd getmap -o /tmp/osdmap and attach /tmp/osdmap? -Sam On Fri, Aug 9, 2013 at 4:28 AM, Jeff Moskow wrote: > Thanks for the suggestion. I had tried stopping each OSD for 30 seconds, > then restarting it, waiting 2 minutes and then doing the next one (all OSD's > eventually restarted). I tried this twice. > > -- > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] run ceph without auth
I have referred you to someone more conversant with the details of mkcephfs, but for dev purposes, most of us use the vstart.sh script in src/ (http://ceph.com/docs/master/dev/). -Sam On Fri, Aug 9, 2013 at 2:59 AM, Nulik Nol wrote: > Hi, > I am configuring a single node for developing purposes, but ceph asks > me for keyring. Here is what I do: > > [root@localhost ~]# mkcephfs -c /usr/local/etc/ceph/ceph.conf > --prepare-monmap -d /tmp/foo > preparing monmap in /tmp/foo/monmap > /usr/local/bin/monmaptool --create --clobber --add a 127.0.0.1:6789 > --print /tmp/foo/monmap > /usr/local/bin/monmaptool: monmap file /tmp/foo/monmap > /usr/local/bin/monmaptool: generated fsid 7bd045a6-ca45-4f12-b9f3-e0c76718859a > epoch 0 > fsid 7bd045a6-ca45-4f12-b9f3-e0c76718859a > last_changed 2013-08-09 04:51:06.921996 > created 2013-08-09 04:51:06.921996 > 0: 127.0.0.1:6789/0 mon.a > /usr/local/bin/monmaptool: writing epoch 0 to /tmp/foo/monmap (1 monitors) > \nWARNING: mkcephfs is now deprecated in favour of ceph-deploy. Please > see: \n http://github.com/ceph/ceph-deploy > [root@localhost ~]# mkcephfs --init-local-daemons osd -d /tmp/foo > \nWARNING: mkcephfs is now deprecated in favour of ceph-deploy. Please > see: \n http://github.com/ceph/ceph-deploy > [root@localhost ~]# mkcephfs --init-local-daemons mds -d /tmp/foo > \nWARNING: mkcephfs is now deprecated in favour of ceph-deploy. Please > see: \n http://github.com/ceph/ceph-deploy > [root@localhost ~]# mkcephfs --prepare-mon -d /tmp/foo > Building generic osdmap from /tmp/foo/conf > /usr/local/bin/osdmaptool: osdmap file '/tmp/foo/osdmap' > /usr/local/bin/osdmaptool: writing epoch 1 to /tmp/foo/osdmap > Generating admin key at /tmp/foo/keyring.admin > creating /tmp/foo/keyring.admin > Building initial monitor keyring > cat: /tmp/foo/key.*: No such file or directory > \nWARNING: mkcephfs is now deprecated in favour of ceph-deploy. Please > see: \n http://github.com/ceph/ceph-deploy > [root@localhost ~]# > > How can I tell ceph to do not use keyring ? > > This is my config file: > > [global] > auth cluster required = none > auth service required = none > auth client required = none > debug filestore = 20 > [mon] > mon data = /data/mon > > [mon.a] > host = s1 > mon addr = 127.0.0.1:6789 > > [osd] > osd journal size = 1000 > filestore_xattr_use_omap = true > > [osd.0] > host = s1 > osd data = /data/osd/osd1 > osd mkfs type = bttr > osd journal = /data/journal/log > devs = /dev/loop0 > > [mds.a] > host = s1 > > > TIA > Nulik > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)
Are you using any kernel clients? Will osds 3,14,16 be coming back? -Sam On Mon, Aug 12, 2013 at 2:26 PM, Jeff Moskow wrote: > Sam, > > I've attached both files. > > Thanks! > Jeff > > On Mon, Aug 12, 2013 at 01:46:57PM -0700, Samuel Just wrote: >> Can you attach the output of ceph osd tree? >> >> Also, can you run >> >> ceph osd getmap -o /tmp/osdmap >> >> and attach /tmp/osdmap? >> -Sam >> >> On Fri, Aug 9, 2013 at 4:28 AM, Jeff Moskow wrote: >> > Thanks for the suggestion. I had tried stopping each OSD for 30 seconds, >> > then restarting it, waiting 2 minutes and then doing the next one (all >> > OSD's >> > eventually restarted). I tried this twice. >> > >> > -- >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to set Object Size/Stripe Width/Stripe Count?
I think the docs you are looking for are http://ceph.com/docs/master/man/8/cephfs/ (specifically the set_layout command). -Sam On Thu, Aug 8, 2013 at 7:48 AM, Da Chun wrote: > Hi list, > I saw the info about data striping in > http://ceph.com/docs/master/architecture/#data-striping . > But couldn't find the way to set these values. > > Could you please tell me how to that or give me a link? Thanks! > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph pgs stuck unclean
Can you attach the output of: ceph -s ceph pg dump ceph osd dump and run ceph osd getmap -o /tmp/osdmap and attach /tmp/osdmap/ -Sam On Wed, Aug 7, 2013 at 1:58 AM, Howarth, Chris wrote: > Hi, > > One of our OSD disks failed on a cluster and I replaced it, but when it > failed it did not completely recover and I have a number of pgs which are > stuck unclean: > > > > # ceph health detail > > HEALTH_WARN 7 pgs stuck unclean > > pg 3.5a is stuck unclean for 335339.172516, current state active, last > acting [5,4] > > pg 3.54 is stuck unclean for 335339.157608, current state active, last > acting [15,7] > > pg 3.55 is stuck unclean for 335339.167154, current state active, last > acting [16,9] > > pg 3.1c is stuck unclean for 335339.174150, current state active, last > acting [8,16] > > pg 3.a is stuck unclean for 335339.177001, current state active, last acting > [0,8] > > pg 3.4 is stuck unclean for 335339.165377, current state active, last acting > [17,4] > > pg 3.5 is stuck unclean for 335339.149507, current state active, last acting > [2,6] > > > > Does anyone know how to fix these ? I tried the following, but this does not > seem to work: > > > > # ceph pg 3.5 mark_unfound_lost revert > > pg has no unfound objects > > > > thanks > > > > Chris > > __ > > Chris Howarth > > OS Platforms Engineering > > Citi Architecture & Technology Engineering > > (e) chris.howa...@citi.com > > (t) +44 (0) 20 7508 3848 > > (f) +44 (0) 20 7508 0964 > > (mail-drop) CGC-06-3A > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] could not generate the bootstrap key
Can you give a step by step account of what you did prior to the error? -Sam On Tue, Aug 6, 2013 at 10:52 PM, 於秀珠 wrote: > using the ceph-deploy to manage a existing cluster,i follow the steps in the > document ,but there is some errors that i can not gather the keys. > when i run the command "ceph-deploy gatherkeys PS-16",the logs show below: > > 2013-08-07 10:14:08,579 ceph_deploy.gatherkeys DEBUG Have > ceph.client.admin.keyring > 2013-08-07 10:14:08,579 ceph_deploy.gatherkeys DEBUG Checking PS-16 for > /var/lib/ceph/mon/ceph-{hostname}/keyring > 2013-08-07 10:14:08,674 ceph_deploy.gatherkeys DEBUG Got ceph.mon.keyring > key from PS-16. > 2013-08-07 10:14:08,674 ceph_deploy.gatherkeys DEBUG Checking PS-16 for > /var/lib/ceph/bootstrap-osd/ceph.keyring > 2013-08-07 10:14:08,774 ceph_deploy.gatherkeys WARNING Unable to find > /var/lib/ceph/bootstrap-osd/ceph.keyring on ['PS-16'] > 2013-08-07 10:14:08,774 ceph_deploy.gatherkeys DEBUG Checking PS-16 for > /var/lib/ceph/bootstrap-mds/ceph.keyring > 2013-08-07 10:14:08,874 ceph_deploy.gatherkeys WARNING Unable to find > /var/lib/ceph/bootstrap-mds/ceph.keyring on ['PS-16'] > > > and i try to deploy a new ceph cluster,i meet the same problem,when i create > the mon ,and then gather the key ,but i also can not gather the bootstrap > keys, > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)
Ok, your best bet is to remove osds 3,14,16: ceph auth del osd.3 ceph osd crush rm osd.3 ceph osd rm osd.3 for each of them. Each osd you remove may cause some data re balancing, so you should be ready for that. -Sam On Mon, Aug 12, 2013 at 3:01 PM, Jeff Moskow wrote: > Sam, > > 3, 14 and 16 have been down for a while and I'll eventually replace > those drives (I could do it now) > but didn't want to introduce more variables. > > We are using RBD with Proxmox, so I think the answer about kernel > clients is yes > > Jeff > > On Mon, Aug 12, 2013 at 02:41:11PM -0700, Samuel Just wrote: >> Are you using any kernel clients? Will osds 3,14,16 be coming back? >> -Sam >> >> On Mon, Aug 12, 2013 at 2:26 PM, Jeff Moskow wrote: >> > Sam, >> > >> > I've attached both files. >> > >> > Thanks! >> > Jeff >> > >> > On Mon, Aug 12, 2013 at 01:46:57PM -0700, Samuel Just wrote: >> >> Can you attach the output of ceph osd tree? >> >> >> >> Also, can you run >> >> >> >> ceph osd getmap -o /tmp/osdmap >> >> >> >> and attach /tmp/osdmap? >> >> -Sam >> >> >> >> On Fri, Aug 9, 2013 at 4:28 AM, Jeff Moskow wrote: >> >> > Thanks for the suggestion. I had tried stopping each OSD for 30 >> >> > seconds, >> >> > then restarting it, waiting 2 minutes and then doing the next one (all >> >> > OSD's >> >> > eventually restarted). I tried this twice. >> >> > >> >> > -- >> >> > >> >> > ___ >> >> > ceph-users mailing list >> >> > ceph-users@lists.ceph.com >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] one pg stuck with 2 unfound pieces
You can run 'ceph pg 0.cfa mark_unfound_lost revert'. (Revert Lost section of http://ceph.com/docs/master/rados/operations/placement-groups/). -Sam On Tue, Aug 13, 2013 at 6:50 AM, Jens-Christian Fischer wrote: > We have a cluster with 10 servers, 64 OSDs and 5 Mons on them. The OSDs are > 3TB disk, formatted with btrfs and the servers are either on Ubuntu 12.10 or > 13.04. > > Recently one of the servers (13.04) stood still (due to problems with btrfs > - something we have seen a few times). I decided to not try to recover the > disks, but reformat them with XFS. I removed the OSDs, reformatted, and > re-created them (they got the same OSD numbers) > > I redid this twice (because I wrongly partioned the disks in the first > place) and I ended up with 2 unfound "pieces" in one pg: > > root@s2:~# ceph health details > HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 1 pgs stuck unclean; recovery > 4448/28915270 degraded (0.015%); 2/9854766 unfound (0.000%) > pg 0.cfa is stuck unclean for 1004252.309704, current state > active+recovering+degraded+remapped, last acting [23,50] > pg 0.cfa is active+recovering+degraded+remapped, acting [23,50], 2 unfound > recovery 4448/28915270 degraded (0.015%); 2/9854766 unfound (0.000%) > > > root@s2:~# ceph pg 0.cfa query > > { "state": "active+recovering+degraded+remapped", > "epoch": 28197, > "up": [ > 23, > 50, > 18], > "acting": [ > 23, > 50], > "info": { "pgid": "0.cfa", > "last_update": "28082'7774", > "last_complete": "23686'7083", > "log_tail": "14360'4061", > "last_backfill": "MAX", > "purged_snaps": "[]", > "history": { "epoch_created": 1, > "last_epoch_started": 28197, > "last_epoch_clean": 24810, > "last_epoch_split": 0, > "same_up_since": 28195, > "same_interval_since": 28196, > "same_primary_since": 26036, > "last_scrub": "20585'6801", > "last_scrub_stamp": "2013-07-28 15:40:53.298786", > "last_deep_scrub": "20585'6801", > "last_deep_scrub_stamp": "2013-07-28 15:40:53.298786", > "last_clean_scrub_stamp": "2013-07-28 15:40:53.298786"}, > "stats": { "version": "28082'7774", > "reported": "28197'41950", > "state": "active+recovering+degraded+remapped", > "last_fresh": "2013-08-13 14:34:33.057271", > "last_change": "2013-08-13 14:34:33.057271", > "last_active": "2013-08-13 14:34:33.057271", > "last_clean": "2013-08-01 23:50:18.414082", > "last_became_active": "2013-05-29 13:10:51.366237", > "last_unstale": "2013-08-13 14:34:33.057271", > "mapping_epoch": 28195, > "log_start": "14360'4061", > "ondisk_log_start": "14360'4061", > "created": 1, > "last_epoch_clean": 24810, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "20585'6801", > "last_scrub_stamp": "2013-07-28 15:40:53.298786", > "last_deep_scrub": "20585'6801", > "last_deep_scrub_stamp": "2013-07-28 15:40:53.298786", > "last_clean_scrub_stamp": "2013-07-28 15:40:53.298786", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": "0", > "stat_sum": { "num_bytes": 145307402, > "num_objects": 2234, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_degraded": 0, > "num_objects_unfound": 0, > "num_read": 744, > "num_read_kb": 410184, > "num_write": 7774, > "num_write_kb": 1155438, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 3998, > "num_bytes_recovered": 278803622, > "num_keys_recovered": 0}, > "stat_cat_sum": {}, > "up": [ > 23, > 50, > 18], > "acting": [ > 23, > 50]}, > "empty": 0, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 28197}, > "recovery_state": [ > { "name": "Started\/Primary\/Active", > "enter_time": "2013-08-13 14:34:33.026698", > "might_have_unfound": [ > { "osd": 9, > "status": "querying"}, > { "osd": 18, > "status": "querying"}, > { "osd": 50, > "status": "already probed"}], > "recovery_progress": { "backfill_target": 50, > "waiting_on_backfill": 0, > "backfill_pos": "96220cfa\/1799e82.\/head\/\/0", > "backfill_info": { "begin": "0\/\/0\/\/-1", > "end": "0
Re: [ceph-users] Ceph pgs stuck unclean
"num_deep_scrub_errors": 0, > "num_objects_recovered": 45, > "num_bytes_recovered": 188743680, > "num_keys_recovered": 0}, > "stat_cat_sum": {}, > "up": [ > 5, > 4], > "acting": [ > 5, > 4]}, > "empty": 0, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 644}, > "recovery_state": [ > { "name": "Started\/Primary\/Active", > "enter_time": "2013-08-02 09:49:56.504882", > "might_have_unfound": [], > "recovery_progress": { "backfill_target": -1, > "waiting_on_backfill": 0, > "backfill_pos": "0\/\/0\/\/-1", > "backfill_info": { "begin": "0\/\/0\/\/-1", > "end": "0\/\/0\/\/-1", > "objects": []}, > "peer_backfill_info": { "begin": "0\/\/0\/\/-1", > "end": "0\/\/0\/\/-1", > "objects": []}, > "backfills_in_flight": [], > "pull_from_peer": [], > "pushing": []}, > "scrub": { "scrubber.epoch_start": "0", > "scrubber.active": 0, > "scrubber.block_writes": 0, > "scrubber.finalizing": 0, > "scrubber.waiting_on": 0, > "scrubber.waiting_on_whom": []}}, > { "name": "Started", > "enter_time": "2013-08-02 09:49:55.501261"}]} > > -Original Message- > From: Samuel Just [mailto:sam.j...@inktank.com] > Sent: 12 August 2013 22:52 > To: Howarth, Chris [CCC-OT_IT] > Cc: ceph-us...@ceph.com > Subject: Re: [ceph-users] Ceph pgs stuck unclean > > Can you attach the output of: > > ceph -s > ceph pg dump > ceph osd dump > > and run > > ceph osd getmap -o /tmp/osdmap > > and attach /tmp/osdmap/ > -Sam > > On Wed, Aug 7, 2013 at 1:58 AM, Howarth, Chris wrote: >> Hi, >> >> One of our OSD disks failed on a cluster and I replaced it, but >> when it failed it did not completely recover and I have a number of >> pgs which are stuck unclean: >> >> >> >> # ceph health detail >> >> HEALTH_WARN 7 pgs stuck unclean >> >> pg 3.5a is stuck unclean for 335339.172516, current state active, last >> acting [5,4] >> >> pg 3.54 is stuck unclean for 335339.157608, current state active, last >> acting [15,7] >> >> pg 3.55 is stuck unclean for 335339.167154, current state active, last >> acting [16,9] >> >> pg 3.1c is stuck unclean for 335339.174150, current state active, last >> acting [8,16] >> >> pg 3.a is stuck unclean for 335339.177001, current state active, last >> acting [0,8] >> >> pg 3.4 is stuck unclean for 335339.165377, current state active, last >> acting [17,4] >> >> pg 3.5 is stuck unclean for 335339.149507, current state active, last >> acting [2,6] >> >> >> >> Does anyone know how to fix these ? I tried the following, but this >> does not seem to work: >> >> >> >> # ceph pg 3.5 mark_unfound_lost revert >> >> pg has no unfound objects >> >> >> >> thanks >> >> >> >> Chris >> >> __ >> >> Chris Howarth >> >> OS Platforms Engineering >> >> Citi Architecture & Technology Engineering >> >> (e) chris.howa...@citi.com >> >> (t) +44 (0) 20 7508 3848 >> >> (f) +44 (0) 20 7508 0964 >> >> (mail-drop) CGC-06-3A >> >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean -- how to fix? (fwd)
Cool! -Sam On Tue, Aug 13, 2013 at 4:49 AM, Jeff Moskow wrote: > Sam, > > Thanks that did it :-) > >health HEALTH_OK >monmap e17: 5 mons at > {a=172.16.170.1:6789/0,b=172.16.170.2:6789/0,c=172.16.170.3:6789/0,d=172.16.170.4:6789/0,e=172.16.170.5:6789/0}, > election epoch 9794, quorum 0,1,2,3,4 a,b,c,d,e >osdmap e23445: 14 osds: 13 up, 13 in > pgmap v13552855: 2102 pgs: 2102 active+clean; 531 GB data, 1564 GB used, > 9350 GB / 10914 GB avail; 13104KB/s rd, 4007KB/s wr, 560op/s >mdsmap e3: 0/0/1 up > > > -- > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com