Re: [ceph-users] Cephfs upon Tiering

2014-09-15 Thread Berant Lemmenes
Greg,

So is the consensus that the appropriate way to implement this scenario is
to have the fs created on the EC backing pool vs. the cache pool but that
the UI check needs to be tweaked to distinguish between this scenario and
just trying to use a EC pool alone?

I'm also interested in the scenario of having a EC backed pool fronted by a
replicated cache for use with cephfs.

Thanks,
Berant

On Fri, Sep 12, 2014 at 12:37 PM, Gregory Farnum  wrote:

> On Fri, Sep 12, 2014 at 1:53 AM, Kenneth Waegeman <
> kenneth.waege...@ugent.be> wrote:
> >
> > - Message from Sage Weil  -
> >Date: Thu, 11 Sep 2014 14:10:46 -0700 (PDT)
> >From: Sage Weil 
> > Subject: Re: [ceph-users] Cephfs upon Tiering
> >  To: Gregory Farnum 
> >  Cc: Kenneth Waegeman , ceph-users
> > 
> >
> >
> >
> >> On Thu, 11 Sep 2014, Gregory Farnum wrote:
> >>>
> >>> On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil  wrote:
> >>> > On Thu, 11 Sep 2014, Gregory Farnum wrote:
> >>> >> On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
> >>> >>  wrote:
> >>> >> > Hi all,
> >>> >> >
> >>> >> > I am testing the tiering functionality with cephfs. I used a
> >>> >> > replicated
> >>> >> > cache with an EC data pool, and a replicated metadata pool like
> >>> >> > this:
> >>> >> >
> >>> >> >
> >>> >> > ceph osd pool create cache 1024 1024
> >>> >> > ceph osd pool set cache size 2
> >>> >> > ceph osd pool set cache min_size 1
> >>> >> > ceph osd erasure-code-profile set profile11 k=8 m=3
> >>> >> > ruleset-failure-domain=osd
> >>> >> > ceph osd pool create ecdata 128 128 erasure profile11
> >>> >> > ceph osd tier add ecdata cache
> >>> >> > ceph osd tier cache-mode cache writeback
> >>> >> > ceph osd tier set-overlay ecdata cache
> >>> >> > ceph osd pool set cache hit_set_type bloom
> >>> >> > ceph osd pool set cache hit_set_count 1
> >>> >> > ceph osd pool set cache hit_set_period 3600
> >>> >> > ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
> >>> >> > ceph osd pool create metadata 128 128
> >>> >> > ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
> >>> >> > ceph fs new ceph_fs metadata cache  <-- wrong ?
> >>> >> >
> >>> >> > I started testing with this, and this worked, I could write to it
> >>> >> > with
> >>> >> > cephfs and the cache was flushing to the ecdata pool as expected.
> >>> >> > But now I notice I made the fs right upon the cache, instead of
> the
> >>> >> > underlying data pool. I suppose I should have done this:
> >>> >> >
> >>> >> > ceph fs new ceph_fs metadata ecdata
> >>> >> >
> >>> >> > So my question is: Was this wrong and not doing the things I
> thought
> >>> >> > it did,
> >>> >> > or was this somehow handled by ceph and didn't it matter I
> specified
> >>> >> > the
> >>> >> > cache instead of the data pool?
> >>> >>
> >>> >> Well, it's sort of doing what you want it to. You've told the
> >>> >> filesystem to use the "cache" pool as the location for all of its
> >>> >> data. But RADOS is pushing everything in the "cache" pool down to
> the
> >>> >> "ecdata" pool.
> >>> >> So it'll work for now as you want. But if in future you wanted to
> stop
> >>> >> using the caching pool, or switch it out for a different pool
> >>> >> entirely, that wouldn't work (whereas it would if the fs was using
> >>> >> "ecdata").
> >
> >
> > After this I tried with the 'ecdata' pool, which is not working because
> > itself is an EC pool.
> > So I guess specifying the cache pool is then indeed the only way, but
> that's
> > ok then if that works.
> > It is just a bit confusing to specify the cache pool rather than the
> data:)
>
> *blinks*
> Uh, yeah. I forgot about that check, which was added because somebody
> tried to use CephFS on an EC pool without a cache on top. We've obviously
> got some UI work to do. Thanks for the reminder!
> -Greg
>
>
> --
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD unable to start (giant -> hammer)

2015-05-18 Thread Berant Lemmenes
Hello all,

I've encountered a problem when upgrading my single node home cluster from
giant to hammer, and I would greatly appreciate any insight.

I upgraded the packages like normal, then proceeded to restart the mon and
once that came back restarted the first OSD (osd.3). However it
subsequently won't start and crashes with the following failed assertion:

osd/OSD.h: 716: FAILED assert(ret)

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7f) [0xb1784f]

 2: (OSD::load_pgs()+0x277b) [0x6850fb]

 3: (OSD::init()+0x1448) [0x6930b8]

 4: (main()+0x26b9) [0x62fd89]

 5: (__libc_start_main()+0xed) [0x7f2345bc976d]

 6: ceph-osd() [0x635679]

 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.


--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 1 buffer

   0/ 1 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_replay

   0/ 5 journaler

   0/ 5 objectcacher

   0/ 5 client

   0/ 5 osd

   0/ 5 optracker

   0/ 5 objclass

   1/ 3 filestore

   1/ 3 keyvaluestore

   1/ 3 journal

   0/ 5 ms

   1/ 5 mon

   0/10 monc

   1/ 5 paxos

   0/ 5 tp

   1/ 5 auth

   1/ 5 crypto

   1/ 1 finisher

   1/ 5 heartbeatmap

   1/ 5 perfcounter

   1/ 5 rgw

   1/10 civetweb

   1/ 5 javaclient

   1/ 5 asok

   1/ 1 throttle

   0/ 0 refs

   1/ 5 xio

  -2/-2 (syslog threshold)

  99/99 (stderr threshold)

  max_recent 1

  max_new 1000

  log_file

--- end dump of recent events ---

terminate called after throwing an instance of 'ceph::FailedAssertion'

*** Caught signal (Aborted) **

 in thread 7f2347f71780

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

 1: ceph-osd() [0xa1fe55]

 2: (()+0xfcb0) [0x7f2346fb1cb0]

 3: (gsignal()+0x35) [0x7f2345bde0d5]

 4: (abort()+0x17b) [0x7f2345be183b]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]

 6: (()+0xb5846) [0x7f234652d846]

 7: (()+0xb5873) [0x7f234652d873]

 8: (()+0xb596e) [0x7f234652d96e]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]

 10: (OSD::load_pgs()+0x277b) [0x6850fb]

 11: (OSD::init()+0x1448) [0x6930b8]

 12: (main()+0x26b9) [0x62fd89]

 13: (__libc_start_main()+0xed) [0x7f2345bc976d]

 14: ceph-osd() [0x635679]

2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted) **

 in thread 7f2347f71780


 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

 1: ceph-osd() [0xa1fe55]

 2: (()+0xfcb0) [0x7f2346fb1cb0]

 3: (gsignal()+0x35) [0x7f2345bde0d5]

 4: (abort()+0x17b) [0x7f2345be183b]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]

 6: (()+0xb5846) [0x7f234652d846]

 7: (()+0xb5873) [0x7f234652d873]

 8: (()+0xb596e) [0x7f234652d96e]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]

 10: (OSD::load_pgs()+0x277b) [0x6850fb]

 11: (OSD::init()+0x1448) [0x6930b8]

 12: (main()+0x26b9) [0x62fd89]

 13: (__libc_start_main()+0xed) [0x7f2345bc976d]

 14: ceph-osd() [0x635679]

 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.


--- begin dump of recent events ---

 0> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal
(Aborted) **

 in thread 7f2347f71780


 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

 1: ceph-osd() [0xa1fe55]

 2: (()+0xfcb0) [0x7f2346fb1cb0]

 3: (gsignal()+0x35) [0x7f2345bde0d5]

 4: (abort()+0x17b) [0x7f2345be183b]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]

 6: (()+0xb5846) [0x7f234652d846]

 7: (()+0xb5873) [0x7f234652d873]

 8: (()+0xb596e) [0x7f234652d96e]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]

 10: (OSD::load_pgs()+0x277b) [0x6850fb]

 11: (OSD::init()+0x1448) [0x6930b8]

 12: (main()+0x26b9) [0x62fd89]

 13: (__libc_start_main()+0xed) [0x7f2345bc976d]

 14: ceph-osd() [0x635679]

 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.


--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 1 buffer

   0/ 1 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_replay

   0/ 5 journaler

   0/ 5 objectcacher

   0/ 5 client

   0/ 5 osd

   0/ 5 optracker

   0/ 5 objclass

   1/ 3 filestore

   1/ 3 keyvaluestore

   1/ 3 journal

   0/ 5 ms

   1/ 5 mon

   0/10 monc

   1/ 5 paxos

   0/ 5 tp

   1/ 5 auth

   1/ 5 crypto

   1/ 1 finisher

   1/ 5 heartbeatmap

   1/ 5 perfcounter

   1/ 5 rgw

   1/10 civetweb

   1/ 5 javaclient

   1/ 5 asok

   

Re: [ceph-users] OSD unable to start (giant -> hammer)

2015-05-18 Thread Berant Lemmenes
Sam,

Thanks for taking a look. It does seem to fit my issue. Would just removing
the 5.0_head directory be appropriate or would using ceph-objectstore-tool
be better?

Thanks,
Berant

On Mon, May 18, 2015 at 1:47 PM, Samuel Just  wrote:

> You have most likely hit http://tracker.ceph.com/issues/11429.  There are
> some workarounds in the bugs marked as duplicates of that bug, or you can
> wait for the next hammer point release.
> -Sam
>
> - Original Message -----
> From: "Berant Lemmenes" 
> To: ceph-users@lists.ceph.com
> Sent: Monday, May 18, 2015 10:24:38 AM
> Subject: [ceph-users] OSD unable to start (giant -> hammer)
>
> Hello all,
>
> I've encountered a problem when upgrading my single node home cluster from
> giant to hammer, and I would greatly appreciate any insight.
>
> I upgraded the packages like normal, then proceeded to restart the mon and
> once that came back restarted the first OSD (osd.3). However it
> subsequently won't start and crashes with the following failed assertion:
>
>
>
> osd/OSD.h: 716: FAILED assert(ret)
>
> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7f) [0xb1784f]
>
> 2: (OSD::load_pgs()+0x277b) [0x6850fb]
>
> 3: (OSD::init()+0x1448) [0x6930b8]
>
> 4: (main()+0x26b9) [0x62fd89]
>
> 5: (__libc_start_main()+0xed) [0x7f2345bc976d]
>
> 6: ceph-osd() [0x635679]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
>
>
>
> --- logging levels ---
>
> 0/ 5 none
>
> 0/ 1 lockdep
>
> 0/ 1 context
>
> 1/ 1 crush
>
> 1/ 5 mds
>
> 1/ 5 mds_balancer
>
> 1/ 5 mds_locker
>
> 1/ 5 mds_log
>
> 1/ 5 mds_log_expire
>
> 1/ 5 mds_migrator
>
> 0/ 1 buffer
>
> 0/ 1 timer
>
> 0/ 1 filer
>
> 0/ 1 striper
>
> 0/ 1 objecter
>
> 0/ 5 rados
>
> 0/ 5 rbd
>
> 0/ 5 rbd_replay
>
> 0/ 5 journaler
>
> 0/ 5 objectcacher
>
> 0/ 5 client
>
> 0/ 5 osd
>
> 0/ 5 optracker
>
> 0/ 5 objclass
>
> 1/ 3 filestore
>
> 1/ 3 keyvaluestore
>
> 1/ 3 journal
>
> 0/ 5 ms
>
> 1/ 5 mon
>
> 0/10 monc
>
> 1/ 5 paxos
>
> 0/ 5 tp
>
> 1/ 5 auth
>
> 1/ 5 crypto
>
> 1/ 1 finisher
>
> 1/ 5 heartbeatmap
>
> 1/ 5 perfcounter
>
> 1/ 5 rgw
>
> 1/10 civetweb
>
> 1/ 5 javaclient
>
> 1/ 5 asok
>
> 1/ 1 throttle
>
> 0/ 0 refs
>
> 1/ 5 xio
>
> -2/-2 (syslog threshold)
>
> 99/99 (stderr threshold)
>
> max_recent 1
>
> max_new 1000
>
> log_file
>
> --- end dump of recent events ---
>
> terminate called after throwing an instance of 'ceph::FailedAssertion'
>
> *** Caught signal (Aborted) **
>
> in thread 7f2347f71780
>
> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>
> 1: ceph-osd() [0xa1fe55]
>
> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>
> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>
> 4: (abort()+0x17b) [0x7f2345be183b]
>
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>
> 6: (()+0xb5846) [0x7f234652d846]
>
> 7: (()+0xb5873) [0x7f234652d873]
>
> 8: (()+0xb596e) [0x7f234652d96e]
>
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x259) [0xb17a29]
>
> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>
> 11: (OSD::init()+0x1448) [0x6930b8]
>
> 12: (main()+0x26b9) [0x62fd89]
>
> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>
> 14: ceph-osd() [0x635679]
>
> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted) **
>
> in thread 7f2347f71780
>
>
>
>
> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>
> 1: ceph-osd() [0xa1fe55]
>
> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>
> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>
> 4: (abort()+0x17b) [0x7f2345be183b]
>
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>
> 6: (()+0xb5846) [0x7f234652d846]
>
> 7: (()+0xb5873) [0x7f234652d873]
>
> 8: (()+0xb596e) [0x7f234652d96e]
>
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x259) [0xb17a29]
>
> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>
> 11: (OSD::init()+0x1448) [0x6930b8]
>
> 12: (main()+0x26b9) [0x62fd89]
>
> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>
> 14: ceph-osd() [0x635679]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
>
>
>
> --- begin dump of recent events ---
>
> 0> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal 

Re: [ceph-users] OSD unable to start (giant -> hammer)

2015-05-19 Thread Berant Lemmenes
uot;last_deep_scrub_stamp": "2015-05-10 10:30:24.933431",

  "last_clean_scrub_stamp": "2015-05-12 22:50:16.011867",

  "log_size": 3001,

  "ondisk_log_size": 3001,

  "stats_invalid": "0",

  "stat_sum": { "num_bytes": 441982976,

  "num_objects": 106,

  "num_object_clones": 0,

  "num_object_copies": 315,

  "num_objects_missing_on_primary": 0,

  "num_objects_degraded": 0,

  "num_objects_misplaced": 0,

  "num_objects_unfound": 0,

      "num_objects_dirty": 11,

  "num_whiteouts": 0,

  "num_read": 61157,

  "num_read_kb": 1281187,

  "num_write": 135192,

  "num_write_kb": 2422029,

  "num_scrub_errors": 0,

  "num_shallow_scrub_errors": 0,

  "num_deep_scrub_errors": 0,

  "num_objects_recovered": 79,

  "num_bytes_recovered": 329883648,

  "num_keys_recovered": 0,

  "num_objects_omap": 0,

  "num_objects_hit_set_archive": 0,

  "num_bytes_hit_set_archive": 0},

  "stat_cat_sum": {},

  "up": [

8,

7],

  "acting": [

8,

7],

  "blocked_by": [],

  "up_primary": 8,

  "acting_primary": 8},

  "empty": 0,

  "dne": 0,

  "incomplete": 0,

  "last_epoch_started": 39536,

  "hit_set_history": { "current_last_update": "0'0",

  "current_last_stamp": "0.00",

  "current_info": { "begin": "0.00",

  "end": "0.00",

  "version": "0'0"},

  "history": []}}],

  "recovery_state": [

{ "name": "Started\/Primary\/Active",

  "enter_time": "2015-05-18 10:18:37.449561",

  "might_have_unfound": [],

  "recovery_progress": { "backfill_targets": [],

  "waiting_on_backfill": [],

  "last_backfill_started": "0\/\/0\/\/-1",

  "backfill_info": { "begin": "0\/\/0\/\/-1",

  "end": "0\/\/0\/\/-1",

  "objects": []},

  "peer_backfill_info": [],

  "backfills_in_flight": [],

  "recovering": [],

  "pg_backend": { "pull_from_peer": [],

  "pushing": []}},

  "scrub": { "scrubber.epoch_start": "39527",

  "scrubber.active": 0,

  "scrubber.block_writes": 0,

  "scrubber.waiting_on": 0,

  "scrubber.waiting_on_whom": []}},

{ "name": "Started",

  "enter_time": "2015-05-18 10:18:05.335040"}],

  "agent_state": {}}

On Mon, May 18, 2015 at 2:34 PM, Berant Lemmenes 
wrote:

> Sam,
>
> Thanks for taking a look. It does seem to fit my issue. Would just
> removing the 5.0_head directory be appropriate or would using
> ceph-objectstore-tool be better?
>
> Thanks,
> Berant
>
> On Mon, May 18, 2015 at 1:47 PM, Samuel Just  wrote:
>
>> You have most likely hit http://tracker.ceph.com/issues/11429.  There
>> are some workarounds in the bugs marked as duplicates of that bug, or you
>> can wait for the next hammer point release.
>> -Sam
>>
>> - Original Message -
>> From: "Berant Lemmenes" 
>> To: ceph-users@lists.ceph.com
>> Sent: Monday, May 18, 2015 10:24:38 AM
>> Subject: [ceph-users] OSD unable to start (giant -> hammer)
>>
>> Hello all,
>>
>> I've encountered a problem when upgrading my single node home cluster
>> from giant to hammer, and I would greatly appreciate any insight.
>>
>> I upgraded the packages like normal, then proceeded to restart the mon
>> and once that came b

Re: [ceph-users] OSD unable to start (giant -> hammer)

2015-05-19 Thread Berant Lemmenes
Sam,

It is for a valid pool, however the up and acting sets for 2.14 both show
OSDs 8 & 7. I'll take a look at 7 &  8 and see if they are good.

If so, it seems like it being present on osd.3 could be an artifact from
previous topologies and I could mv it off old.3

Thanks very much for the assistance!

Berant

On Tuesday, May 19, 2015, Samuel Just  wrote:

> If 2.14 is part of a non-existent pool, you should be able to rename it
> out of current/ in the osd directory to prevent the osd from seeing it on
> startup.
> -Sam
>
> - Original Message -
> From: "Berant Lemmenes" >
> To: "Samuel Just" >
> Cc: ceph-users@lists.ceph.com 
> Sent: Tuesday, May 19, 2015 12:58:30 PM
> Subject: Re: [ceph-users] OSD unable to start (giant -> hammer)
>
> Hello,
>
> So here are the steps I performed and where I sit now.
>
> Step 1) Using 'ceph-objectstore-tool list' to create a list of all PGs not
> associated with the 3 pools (rbd, data, metadata) that are actually in use
> on this cluster.
>
> Step 2) I then did a 'ceph-objectstore-tool remove' of those PGs
>
> Then when starting the OSD it would complain about PGs that were NOT in the
> list of 'ceph-objectstore-tool list' but WERE present on the filesystem of
> the OSD in question.
>
> Step 3) Iterating over all of the PGs that were on disk and using
> 'ceph-objectstore-tool info' I made a list of all PGs that returned ENOENT,
>
> Step 4) 'ceph-objectstore-tool remove' to remove all those as well.
>
> Now when starting osd.3 I get an "unable to load metadata' error for a PG
> that according to 'ceph pg 2.14 query' is not present (and shouldn't be) on
> osd.3. Shown below with OSD debugging at 20:
>
> 
>
>-23> 2015-05-19 15:15:12.712036 7fb079a20780 20 read_log 39533'174051
> (39533'174050) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2811937 2015-05-18 07:18:42.859501
>
>-22> 2015-05-19 15:15:12.712066 7fb079a20780 20 read_log 39533'174052
> (39533'174051) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2812374 2015-05-18 07:33:21.973157
>
>-21> 2015-05-19 15:15:12.712096 7fb079a20780 20 read_log 39533'174053
> (39533'174052) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2812861 2015-05-18 07:48:23.098343
>
>-20> 2015-05-19 15:15:12.712127 7fb079a20780 20 read_log 39533'174054
> (39533'174053) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2813371 2015-05-18 08:03:54.226512
>
>-19> 2015-05-19 15:15:12.712157 7fb079a20780 20 read_log 39533'174055
> (39533'174054) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2813922 2015-05-18 08:18:20.351421
>
>-18> 2015-05-19 15:15:12.712187 7fb079a20780 20 read_log 39533'174056
> (39533'174055) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2814396 2015-05-18 08:33:56.476035
>
>-17> 2015-05-19 15:15:12.712221 7fb079a20780 20 read_log 39533'174057
> (39533'174056) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2814971 2015-05-18 08:48:22.605674
>
>-16> 2015-05-19 15:15:12.712252 7fb079a20780 20 read_log 39533'174058
> (39533'174057) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2815407 2015-05-18 09:02:48.720181
>
>-15> 2015-05-19 15:15:12.712282 7fb079a20780 20 read_log 39533'174059
> (39533'174058) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2815434 2015-05-18 09:03:43.727839
>
>-14> 2015-05-19 15:15:12.712312 7fb079a20780 20 read_log 39533'174060
> (39533'174059) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2815889 2015-05-18 09:17:49.846406
>
>-13> 2015-05-19 15:15:12.712342 7fb079a20780 20 read_log 39533'174061
> (39533'174060) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2816358 2015-05-18 09:32:50.969457
>
>-12> 2015-05-19 15:15:12.712372 7fb079a20780 20 read_log 39533'174062
> (39533'174061) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2816840 2015-05-18 09:47:52.091524
>
>-11> 2015-05-19 15:15:12.712403 7fb079a20780 20 read_log 39533'174063
> (39533'174062) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2 by
> client.18119.0:2816861 2015-05-18 09:48:22.096309
>
>-10> 2015-05-19 15:15:12.712433 7fb079a20780 20 read_log 39533&#

Re: [ceph-users] OSD unable to start (giant -> hammer)

2015-05-20 Thread Berant Lemmenes
Ok, just to update everyone, after moving out all the pg directories on the
OSD that were no longer valid PGs I was able to start it and the cluster is
back to healthy.

I'm going to trigger a deep scrub of osd.3 to be safe prior to deleting any
of those PGs though.

If I understand the gist of how 11429 is going to be addressed in 94.2, it
is going to disregard such "dead" PGs and complain in the logs. As far as
cleaning those up would a procedure similar to mine be appropriate (either
before or after 94.2).

Thank you Sam for your help, I greatly appreciate it!
-Berant

On Tue, May 19, 2015 at 7:13 PM, Berant Lemmenes 
wrote:

> Sam,
>
> It is for a valid pool, however the up and acting sets for 2.14 both show
> OSDs 8 & 7. I'll take a look at 7 &  8 and see if they are good.
>
> If so, it seems like it being present on osd.3 could be an artifact from
> previous topologies and I could mv it off old.3
>
> Thanks very much for the assistance!
>
> Berant
>
>
> On Tuesday, May 19, 2015, Samuel Just  wrote:
>
>> If 2.14 is part of a non-existent pool, you should be able to rename it
>> out of current/ in the osd directory to prevent the osd from seeing it on
>> startup.
>> -Sam
>>
>> - Original Message -
>> From: "Berant Lemmenes" 
>> To: "Samuel Just" 
>> Cc: ceph-users@lists.ceph.com
>> Sent: Tuesday, May 19, 2015 12:58:30 PM
>> Subject: Re: [ceph-users] OSD unable to start (giant -> hammer)
>>
>> Hello,
>>
>> So here are the steps I performed and where I sit now.
>>
>> Step 1) Using 'ceph-objectstore-tool list' to create a list of all PGs not
>> associated with the 3 pools (rbd, data, metadata) that are actually in use
>> on this cluster.
>>
>> Step 2) I then did a 'ceph-objectstore-tool remove' of those PGs
>>
>> Then when starting the OSD it would complain about PGs that were NOT in
>> the
>> list of 'ceph-objectstore-tool list' but WERE present on the filesystem of
>> the OSD in question.
>>
>> Step 3) Iterating over all of the PGs that were on disk and using
>> 'ceph-objectstore-tool info' I made a list of all PGs that returned
>> ENOENT,
>>
>> Step 4) 'ceph-objectstore-tool remove' to remove all those as well.
>>
>> Now when starting osd.3 I get an "unable to load metadata' error for a PG
>> that according to 'ceph pg 2.14 query' is not present (and shouldn't be)
>> on
>> osd.3. Shown below with OSD debugging at 20:
>>
>> 
>>
>>-23> 2015-05-19 15:15:12.712036 7fb079a20780 20 read_log 39533'174051
>> (39533'174050) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2
>> by
>> client.18119.0:2811937 2015-05-18 07:18:42.859501
>>
>>-22> 2015-05-19 15:15:12.712066 7fb079a20780 20 read_log 39533'174052
>> (39533'174051) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2
>> by
>> client.18119.0:2812374 2015-05-18 07:33:21.973157
>>
>>-21> 2015-05-19 15:15:12.712096 7fb079a20780 20 read_log 39533'174053
>> (39533'174052) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2
>> by
>> client.18119.0:2812861 2015-05-18 07:48:23.098343
>>
>>-20> 2015-05-19 15:15:12.712127 7fb079a20780 20 read_log 39533'174054
>> (39533'174053) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2
>> by
>> client.18119.0:2813371 2015-05-18 08:03:54.226512
>>
>>-19> 2015-05-19 15:15:12.712157 7fb079a20780 20 read_log 39533'174055
>> (39533'174054) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2
>> by
>> client.18119.0:2813922 2015-05-18 08:18:20.351421
>>
>>-18> 2015-05-19 15:15:12.712187 7fb079a20780 20 read_log 39533'174056
>> (39533'174055) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2
>> by
>> client.18119.0:2814396 2015-05-18 08:33:56.476035
>>
>>-17> 2015-05-19 15:15:12.712221 7fb079a20780 20 read_log 39533'174057
>> (39533'174056) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2
>> by
>> client.18119.0:2814971 2015-05-18 08:48:22.605674
>>
>>-16> 2015-05-19 15:15:12.712252 7fb079a20780 20 read_log 39533'174058
>> (39533'174057) modify   49277412/rb.0.100f.2ae8944a.00029945/head//2
>> by
>> client.18119.0:2815407 2015-05-18 09:02:48.720181
>>
>>-15> 2015-05-19 15:15:12.712282 7fb079a20780 20 read_log 39533'174059
>> (39533'174058) modify   49277412/r

Re: [ceph-users] No monitor sockets after upgrading to Emperor

2013-11-11 Thread Berant Lemmenes
I noticed the same behavior on my dumpling cluster. They wouldn't show up
after boot, but after a service restart they were there.

I haven't tested a node reboot since I upgraded to emperor today. I'll give
it a shot tomorrow.

Thanks,
Berant
On Nov 11, 2013 9:29 PM, "Peter Matulis" 
wrote:

> After upgrading from Dumpling to Emperor on Ubuntu 12.04 I noticed the
> admin sockets for each of my monitors were missing although the cluster
> seemed to continue running fine.  There wasn't anything under
> /var/run/ceph.  After restarting the service on each monitor node they
> reappeared.  Anyone?
>
> ~pmatulis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No monitor sockets after upgrading to Emperor

2013-11-12 Thread Berant Lemmenes
50 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP
ioctl is supported and appears to work
2013-11-12 09:56:37.561360 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2013-11-12 09:56:37.562357 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2013-11-12 09:56:37.571030 7f3793c21780  0
filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2013-11-12 09:56:37.574273 7f3793c21780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size 4096
bytes, directio = 1, aio = 1
2013-11-12 09:56:37.578189 7f3793c21780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size 4096
bytes, directio = 1, aio = 1
2013-11-12 09:56:37.578854 7f3793c21780  1 journal close
/var/lib/ceph/osd/ceph-19/journal
2013-11-12 09:56:37.579638 7f3793c21780  1
filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs
2013-11-12 09:56:37.581110 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP
ioctl is supported and appears to work
2013-11-12 09:56:37.581118 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2013-11-12 09:56:37.582014 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2013-11-12 09:56:37.583365 7f3793c21780  0
filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2013-11-12 09:56:37.585765 7f3793c21780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 24: 10239344640 bytes, block size 4096
bytes, directio = 1, aio = 1
2013-11-12 09:56:37.588281 7f3793c21780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 24: 10239344640 bytes, block size 4096
bytes, directio = 1, aio = 1
2013-11-12 09:56:37.589782 7f3793c21780  0 
cls/hello/cls_hello.cc:271: loading cls_hello
2013-11-12 09:56:39.723134 7f377488b700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.56:6806/563 pipe(0xc87ca00 sd=155 :38290 s=1 pgs=17864 cs=2 l=0
c=0xc893160).fault
2013-11-12 09:56:39.728798 7f3775194700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.52:6808/14464 pipe(0xc811000 sd=52 :51030 s=1 pgs=7473 cs=6 l=0
c=0xc7fbb00).fault
2013-11-12 09:56:39.807114 7f37787ca700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.52:6805/14449 pipe(0xc756280 sd=72 :46552 s=1 pgs=10912 cs=96 l=0
c=0xc740420).fault
2013-11-12 09:56:39.852465 7f3778ccf700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.57:6804/8226 pipe(0x2427780 sd=83 :48234 s=1 pgs=17251 cs=128 l=0
c=0x2406dc0).fault
2013-11-12 09:56:39.898327 7f377488b700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.56:6806/563 pipe(0xc87ca00 sd=42 :40942 s=1 pgs=17945 cs=164 l=0
c=0xc893160).fault
2013-11-12 09:56:40.738437 7f3775ea1700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.60:6810/32089 pipe(0xc7c2500 sd=72 :40289 s=2 pgs=33225 cs=109 l=0
c=0xc7fb840).fault with nothing to send, going to standby
2013-11-12 09:56:40.740185 7f376b2fd700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.60:6810/32089 pipe(0xcd66a00 sd=279 :6807 s=0 pgs=0 cs=0 l=0
c=0xc79d000).accept connect_seq 0 vs existing 109 state standby
2013-11-12 09:56:40.740201 7f376b2fd700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.60:6810/32089 pipe(0xcd66a00 sd=279 :6807 s=0 pgs=0 cs=0 l=0
c=0xc79d000).accept peer reset, then tried to connect to us, replacing
2013-11-12 09:56:41.639911 7f376fd47700  0 -- 192.168.200.54:6806/13723 >>
192.168.48.127:0/234188561 pipe(0xcf87a00 sd=127 :6806 s=0 pgs=0 cs=0 l=0
c=0xcb80580).accept peer addr is really 192.168.48.127:0/234188561 (socket
is 192.168.48.127:60893/0)
2013-11-12 09:56:44.394952 7f37657a3700  0 -- 10.200.1.54:6807/13723 >>
10.200.1.54:6810/13792 pipe(0xcee7c80 sd=160 :6807 s=0 pgs=0 cs=0 l=0
c=0xd0d7160).accept connect_seq 0 vs existing 0 state connecting
2013-11-12 09:56:59.334100 7f3764396700  0 -- 192.168.200.54:6806/13723 >>
192.168.48.102:0/663636012 pipe(0xdbb9280 sd=197 :6806 s=0 pgs=0 cs=0 l=0
c=0xdbbc000).accept peer addr is really 192.168.48.102:0/663636012 (socket
is 192.168.48.102:35496/0)
2013-11-12 09:57:45.805456 7f3764194700  0 -- 192.168.200.54:6806/13723 >>
192.168.48.103:0/1090276439 pipe(0xdbb9000 sd=180 :6806 s=0 pgs=0 cs=0 l=0
c=0xce83dc0).accept peer addr is really 192.168.48.103:0/1090276439 (socket
is 192.168.48.103:41220/0)

After the 'restart ceph-osd-all' the admin sockets for all 4 OSDs on this
host are present.

Let me know if there is additional logging or assistance I can provide to
narrow it down.

Thanks,
Berant



On Tue, Nov 12, 2013 at 4:03 AM, Joao Luis  wrote:

>
> On Nov 12, 2013 2:38 AM, "Berant Lemm

Re: [ceph-users] No monitor sockets after upgrading to Emperor

2013-11-12 Thread Berant Lemmenes
On Tue, Nov 12, 2013 at 7:28 PM, Joao Eduardo Luis wrote:

>
> This looks an awful lot like you started another instance of an OSD with
> the same ID while another was running.  I'll walk you through the log lines
> that point me towards this conclusion.  Would still be weird if the admin
> sockets vanished because of that, so maybe that's a different issue.  Are
> you able to reproduce the admin socket issue often?
>
> Walking through:
>

Thanks for taking the time to walk through these logs, I appreciate the
explanation.

2013-11-12 09:47:09.670813 7f8151b5f780  0 ceph version 0.72
>> (5832e2603c7db5d40b433d0953408993a9b7c217), process ceph-osd, pid 2769
>> 2013-11-12 09:47:09.673789 7f8151b5f780  0
>> filestore(/var/lib/ceph/osd/ceph-19) lock_fsid failed to lock
>> /var/lib/ceph/osd/ceph-19/fsid, is another ceph-osd still running? (11)
>> Resource temporarily unavailable
>>
>
> This last line tells us that ceph-osd believes another instance is
> running, so you should first find out whether there's actually another
> instance being run somewhere, somehow.  How did you start these daemons?
>

That proved to be the crux of it, both upstart and the Sys V init scripts
were trying to start the ceph daemons. Looking in /etc/rc2.d there are
symlinks from S20ceph to ../init.d/ceph

Upstart thought it was controlling things - doing an 'initctl list | grep
ceph' would show the correct PIDs, and 'service ceph status' thought they
were not running.

So that would seem to inidcate that sys V was trying to start it first, and
upstart was the one that had started the instance that generated those logs.

The part that doesn't make sense is that is if the Sys V init script was
starting before upstart, why wouldn't it be the one that was writing to
/var/log/ceph/?

After running 'update-rc.d ceph disable', the admin sockets were present
after a system reboot.

I wonder, was the Sys V init scripts being enabled a ceph-deploy artifact
or a issue with the packages?

Thanks for pointing me in the right direction!

Thanks,
Berant
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No monitor sockets after upgrading to Emperor

2013-11-14 Thread Berant Lemmenes
On Thu, Nov 14, 2013 at 12:11 PM, Alfredo Deza wrote:
>
>
> From your logs, it looks like you have a lot of errors from attempting
> to get monitors running and I can't see how/where ceph-deploy can
> be causing them. Are those errors known issues for you? Where they
> fixed or those are part of the overall issue?
>

Those were just false starts learning ceph-deploy as the previous cluster I
had deployed was done using mkcephfs on bobtail. I had zero issues
associated with using ceph-deploy and getting the cluster going.

The socket issue only became apparent when i started scraping metrics out
of them to push to an opentsdb install, and I noticed that on reboot I
stopped getting metrics even though the poller and the ceph daemons were
running.

I just commented on Peter's thread (didn't mean to hijack it) as I was
experiencing what seemed to be a similar issue.

>
> This is just one of the items that caught my attention:
>
> 2013-10-29 15:56:03,982 [ceph11][ERROR ] 2013-10-29 15:56:06.303009
> 7feaafe8e780 -1 unable to find any IP address in networks:
> 192.168.200.0/24
>
>
> >
> > Thanks,
> > Berant
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No monitor sockets after upgrading to Emperor

2013-11-14 Thread Berant Lemmenes
Argh, premature email sending!

On Thu, Nov 14, 2013 at 12:28 PM, Berant Lemmenes wrote:

>
> On Thu, Nov 14, 2013 at 12:11 PM, Alfredo Deza 
> wrote:
>
>> This is just one of the items that caught my attention:
>>
>> 2013-10-29 15:56:03,982 [ceph11][ERROR ] 2013-10-29 15:56:06.303009
>> 7feaafe8e780 -1 unable to find any IP address in networks:
>> 192.168.200.0/24
>
>
That was from me trying (and failing) to setup a monitor in a subnet
outside of the defined 'public network' (to provide a 3rd mon outside of
the failure domain of the two switches serving the cluster), without
setting the 'public addr' in ceph.conf under the [mon.ceph11] heading.

Thanks for your work on ceph-deploy!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster unable to finish balancing

2013-05-06 Thread Berant Lemmenes
TL;DR

bobtail Ceph cluster unable to finish rebalance after drive failure, usage
increasing even with no clients connected.


I've been running a test bobtail cluster for a couple of months and it's
been working great. Last week I had a drive die and rebalance; durring that
time another OSD crashed. All was still well, however as the second osd had
just crashed I restarted made sure that it re-entered properly and
rebalancing continued and then I went to bed.

Waking up in the morning I found 2 OSDs were 100% full and two more were
almost full. To get out of the situation I decreased the replication size
from 3 to 2, and then also carefully (I believe carefully enough) remove
some PGs in order to start things up again.

I got things going again and things appeared to be rebalancing correctly;
however it got to the point were it stopped at 1420 PGs active+clean and
the rest were stuck backfilling.

Looking at the PG dump, all of the PGs that were having issues were on
osd.1. So I stopped it, verified things were continuing to rebalance after
it was down/out and then formated osd.1's disk and put it back in.

Since then I've not been able to get the cluster back to HEALTHY, due to a
combination of OSDs dying while recovering (not due to disk failure, just
crashes) as well as the used space in the cluster increasing abnormally.

Right now I have all the clients disconnected and just the cluster
rebalancing and the usage is increasing to the point where I have 12TB used
when I have only < 3TB in cephfs and 2TB in a single RBD image
(replication size 2). I've since shutdown the cluster so I don't fill it up.

My crushmap is the default, here is the usual suspects. I'm happy to
provide additional information.

pg dump: http://pastebin.com/LUyu6Z09

ceph osd tree:
osd.8 is the failed drive (I will be replacing tonight), weight on osd.1
and osd.6 was done via reweight-by-utilization

# id weight type name up/down reweight
-1 19.5 root default
-3 19.5 rack unknownrack
-2 19.5 host ceph-test
0 1.5 osd.0 up 1
1 1.5 osd.1 up 0.6027
2 1.5 osd.2 up 1
3 1.5 osd.3 up 1
4 1.5 osd.4 up 1
5 2 osd.5 up 1
6 2 osd.6 up 0.6676
7 2 osd.7 up 1
8 2 osd.8 down 0
9 2 osd.9 up 1
10 2 osd.10 up 1


ceph -s:

   health HEALTH_WARN 24 pgs backfill; 85 pgs backfill_toofull; 29 pgs
backfilling; 40 pgs degraded; 1 pgs recovery_wait; 121 pgs stuck unclean;
recovery 109306/2091318 degraded (5.227%);  recovering 3 o/s, 43344KB/s; 2
near full osd(s); noout flag(s) set
   monmap e2: 1 mons at {a=10.200.200.21:6789/0}, election epoch 1, quorum
0 a
   osdmap e16251: 11 osds: 10 up, 10 in
pgmap v3145187: 1536 pgs: 1414 active+clean, 6
active+remapped+wait_backfill, 10
active+remapped+wait_backfill+backfill_toofull, 4
active+degraded+wait_backfill+backfill_toofull, 22
active+remapped+backfilling, 42 active+remapped+backfill_toofull, 7
active+degraded+backfilling, 17 active+degraded+backfill_toofull, 1
active+recovery_wait+remapped, 4
active+degraded+remapped+wait_backfill+backfill_toofull, 8
active+degraded+remapped+backfill_toofull, 1 active+clean+scrubbing+deep;
31607 GB data, 12251 GB used, 4042 GB / 16293 GB avail; 109306/2091318
degraded (5.227%);  recovering 3 o/s, 43344KB/s
   mdsmap e3363: 1/1/1 up {0=a=up:active}

rep size:
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 384
pgp_num 384 last_change 897 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num
384 pgp_num 384 last_change 13364 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 384
pgp_num 384 last_change 13208 owner 0
pool 4 'media_video' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
384 pgp_num 384 last_change 890 owner 0

ceph.conf:
[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

osd pool default size = 3
osd pool default min size = 1
 osd pool default pg num = 366
osd pool default pgp num = 366

[osd]
osd journal size = 1000
journal_aio = true
#osd recovery max active = 10

osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = inode64,noatime

[mon.a]

host = ceph01
mon addr = 10.200.200.21:6789

[osd.0]
# 1.5 TB SATA
host = ceph01
devs = /dev/sdc
weight = 1.5

[osd.1]
# 1.5 TB SATA
host = ceph01
devs = /dev/sdd
weight = 1.5

[osd.2]
# 1.5 TB SATA
host = ceph01
devs = /dev/sdg
weight = 1.5

[osd.3]
# 1.5 TB SATA
host = ceph01
devs = /dev/sdj
weight = 1.5

[osd.4]
# 1.5 TB SATA
host = ceph01
devs = /dev/sdk
weight = 1.5

[osd.5]
# 2 TB SAS
host = ceph01
devs = /dev/sdf
weight = 2

[osd.6]
# 2 TB SAS
host = ceph01
devs = /dev/sdh
weight = 2

[osd.7]
# 2 TB SAS
host = ceph01
devs = /dev/sda
weight = 2

[osd.8]
# 2 TB SAS
host = ceph01
devs = /dev/sdb
weight = 2

[osd.9]
# 2 TB SAS
host = ceph01
devs = /dev/sdi
weight = 2

[osd.10]
# 2 TB SAS
host = ceph01
devs = /dev/sde
weight = 2

[mds.a]
host = ceph01
___
ceph-users mailing list
ceph-users@

Re: [ceph-users] Cluster unable to finish balancing

2013-05-07 Thread Berant Lemmenes
So just a little update... after replacing the original failed drive things
seem to be progressing a little better however I noticed something else
odd. Looking at a 'rados df' it looks like the system thinks that the data
pool has 32 TB of data, this is only a 18TB raw system.

pool name   category KB  objects   clones
degraded  unfound   rdrd KB   wrwr KB
data-32811540110   8949270
  240445   010  2720415   4223435021
media_video -  110
   0   021  2611361   1177389479
metadata- 210246184820
4592   1 6970   561296  1253955 19500149
rbd -  330731965820180
   19584   026295  1612689 54606042   2127030019
  total used 10915771968   995428
  total avail 6657285104
  total space17573057072


Any recommendations on how I can sort out why it thinks it has way more
data in that pool than it actually does?

Thanks in advance.
Berant


On Mon, May 6, 2013 at 4:43 PM, Berant Lemmenes  wrote:

> TL;DR
>
> bobtail Ceph cluster unable to finish rebalance after drive failure, usage
> increasing even with no clients connected.
>
>
> I've been running a test bobtail cluster for a couple of months and it's
> been working great. Last week I had a drive die and rebalance; durring that
> time another OSD crashed. All was still well, however as the second osd had
> just crashed I restarted made sure that it re-entered properly and
> rebalancing continued and then I went to bed.
>
> Waking up in the morning I found 2 OSDs were 100% full and two more were
> almost full. To get out of the situation I decreased the replication size
> from 3 to 2, and then also carefully (I believe carefully enough) remove
> some PGs in order to start things up again.
>
> I got things going again and things appeared to be rebalancing correctly;
> however it got to the point were it stopped at 1420 PGs active+clean and
> the rest were stuck backfilling.
>
> Looking at the PG dump, all of the PGs that were having issues were on
> osd.1. So I stopped it, verified things were continuing to rebalance after
> it was down/out and then formated osd.1's disk and put it back in.
>
> Since then I've not been able to get the cluster back to HEALTHY, due to a
> combination of OSDs dying while recovering (not due to disk failure, just
> crashes) as well as the used space in the cluster increasing abnormally.
>
> Right now I have all the clients disconnected and just the cluster
> rebalancing and the usage is increasing to the point where I have 12TB used
> when I have only < 3TB in cephfs and 2TB in a single RBD image
> (replication size 2). I've since shutdown the cluster so I don't fill it up.
>
> My crushmap is the default, here is the usual suspects. I'm happy to
> provide additional information.
>
> pg dump: http://pastebin.com/LUyu6Z09
>
> ceph osd tree:
> osd.8 is the failed drive (I will be replacing tonight), weight on osd.1
> and osd.6 was done via reweight-by-utilization
>
> # id weight type name up/down reweight
> -1 19.5 root default
> -3 19.5 rack unknownrack
> -2 19.5 host ceph-test
> 0 1.5 osd.0 up 1
> 1 1.5 osd.1 up 0.6027
> 2 1.5 osd.2 up 1
> 3 1.5 osd.3 up 1
> 4 1.5 osd.4 up 1
> 5 2 osd.5 up 1
> 6 2 osd.6 up 0.6676
> 7 2 osd.7 up 1
> 8 2 osd.8 down 0
> 9 2 osd.9 up 1
> 10 2 osd.10 up 1
>
>
> ceph -s:
>
>health HEALTH_WARN 24 pgs backfill; 85 pgs backfill_toofull; 29 pgs
> backfilling; 40 pgs degraded; 1 pgs recovery_wait; 121 pgs stuck unclean;
> recovery 109306/2091318 degraded (5.227%);  recovering 3 o/s, 43344KB/s; 2
> near full osd(s); noout flag(s) set
>monmap e2: 1 mons at {a=10.200.200.21:6789/0}, election epoch 1,
> quorum 0 a
>osdmap e16251: 11 osds: 10 up, 10 in
> pgmap v3145187: 1536 pgs: 1414 active+clean, 6
> active+remapped+wait_backfill, 10
> active+remapped+wait_backfill+backfill_toofull, 4
> active+degraded+wait_backfill+backfill_toofull, 22
> active+remapped+backfilling, 42 active+remapped+backfill_toofull, 7
> active+degraded+backfilling, 17 active+degraded+backfill_toofull, 1
> active+recovery_wait+remapped, 4
> active+degraded+remapped+wait_backfill+backfill_toofull, 8
> active+degraded+remapped+backfill_toofull, 1 active+clean+scrubbing+deep;
> 31607 GB data, 12251 GB used, 4042 GB / 16293 GB avail; 109306/2091318
> degraded (5.227%);  recovering 3 o/s, 43344KB/s
>mdsmap e3363: 1/1/1 up {0=a=up:active}
>

Re: [ceph-users] Prometheus RADOSGW usage exporter

2018-03-21 Thread Berant Lemmenes
My apologies, I don't seem to be getting notifications on PRs. I'll review
this week.

Thanks,
Berant

On Mon, Mar 19, 2018 at 5:55 AM, Konstantin Shalygin  wrote:

> Hi Berant
>
>
> I've created prometheus exporter that scrapes the RADOSGW Admin Ops API and
>> exports the usage information for all users and buckets. This is my first
>> prometheus exporter so if anyone has feedback I'd greatly appreciate it.
>> I've tested it against Hammer, and will shortly test against Jewel; though
>> looking at the docs it should work fine for Jewel as well.
>>
>> https://github.com/blemmenes/radosgw_usage_exporter
>>
>
>
> It would be nice, if you take a look on PR's.
>
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Prometheus RADOSGW usage exporter

2017-05-25 Thread Berant Lemmenes
Hello all,

I've created prometheus exporter that scrapes the RADOSGW Admin Ops API and
exports the usage information for all users and buckets. This is my first
prometheus exporter so if anyone has feedback I'd greatly appreciate it.
I've tested it against Hammer, and will shortly test against Jewel; though
looking at the docs it should work fine for Jewel as well.

https://github.com/blemmenes/radosgw_usage_exporter


Sample output:
radosgw_usage_successful_ops_total{bucket="shard0",category="create_bucket",owner="testuser"}
1.0
radosgw_usage_successful_ops_total{bucket="shard0",category="delete_obj",owner="testuser"}
1094978.0
radosgw_usage_successful_ops_total{bucket="shard0",category="list_bucket",owner="testuser"}
2276.0
radosgw_usage_successful_ops_total{bucket="shard0",category="put_obj",owner="testuser"}
1094978.0
radosgw_usage_successful_ops_total{bucket="shard0",category="stat_bucket",owner="testuser"}
20.0
radosgw_usage_received_bytes_total{bucket="shard0",category="create_bucket",owner="testuser"}
0.0
radosgw_usage_received_bytes_total{bucket="shard0",category="delete_obj",owner="testuser"}
0.0
radosgw_usage_received_bytes_total{bucket="shard0",category="list_bucket",owner="testuser"}
0.0
radosgw_usage_received_bytes_total{bucket="shard0",category="put_obj",owner="testuser"}
6352678.0
radosgw_usage_received_bytes_total{bucket="shard0",category="stat_bucket",owner="testuser"}
0.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="create_bucket",owner="testuser"}
19.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="delete_obj",owner="testuser"}
0.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="list_bucket",owner="testuser"}
638339458.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="put_obj",owner="testuser"}
79.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="stat_bucket",owner="testuser"}
380.0
radosgw_usage_ops_total{bucket="shard0",category="create_bucket",owner="testuser"}
1.0
radosgw_usage_ops_total{bucket="shard0",category="delete_obj",owner="testuser"}
1094978.0
radosgw_usage_ops_total{bucket="shard0",category="list_bucket",owner="testuser"}
2276.0
radosgw_usage_ops_total{bucket="shard0",category="put_obj",owner="testuser"}
1094979.0
radosgw_usage_ops_total{bucket="shard0",category="stat_bucket",owner="testuser"}
20.0


Thanks,
Berant
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prometheus RADOSGW usage exporter

2017-05-30 Thread Berant Lemmenes
Ben,

Thanks for taking a look at it and trying it out! Hmm looks like at some
point where the bucket owner is in the JSON changed. Later in the week I'll
take a look at adding something to take either location into account.

Thanks,
Berant

On Tue, May 30, 2017 at 3:54 AM, Ben Morrice  wrote:

> Hello Berant,
>
> This is very nice! I've had a play with this against our installation of
> Ceph which is Kraken. We had to change the bucket_owner variable to be
> inside the for loop [1] and we are currently not getting any bytes
> sent/received statistics - though this is not an issue with your code, as
> these values are not updated via radosgw-admin either. I think i'm hitting
> this bug http://tracker.ceph.com/issues/19194
>
> [1] for bucket in entry['buckets']:
> print bucket
> bucket_owner = bucket['owner']
>
> Kind regards,
>
> Ben Morrice
>
> __
> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670 
> <+41%2021%20693%2096%2070>
> EPFL / BBP
> Biotech Campus
> Chemin des Mines 9
> 1202 Geneva
> Switzerland
>
> On 25/05/17 16:25, Berant Lemmenes wrote:
>
> Hello all,
>
> I've created prometheus exporter that scrapes the RADOSGW Admin Ops API and
> exports the usage information for all users and buckets. This is my first
> prometheus exporter so if anyone has feedback I'd greatly appreciate it.
> I've tested it against Hammer, and will shortly test against Jewel; though
> looking at the docs it should work fine for Jewel as well.
> https://github.com/blemmenes/radosgw_usage_exporter
>
>
> Sample output:
> radosgw_usage_successful_ops_total{bucket="shard0",category="create_bucket",owner="testuser"}
> 1.0
> radosgw_usage_successful_ops_total{bucket="shard0",category="delete_obj",owner="testuser"}
> 1094978.0
> radosgw_usage_successful_ops_total{bucket="shard0",category="list_bucket",owner="testuser"}
> 2276.0
> radosgw_usage_successful_ops_total{bucket="shard0",category="put_obj",owner="testuser"}
> 1094978.0
> radosgw_usage_successful_ops_total{bucket="shard0",category="stat_bucket",owner="testuser"}
> 20.0
> radosgw_usage_received_bytes_total{bucket="shard0",category="create_bucket",owner="testuser"}
> 0.0
> radosgw_usage_received_bytes_total{bucket="shard0",category="delete_obj",owner="testuser"}
> 0.0
> radosgw_usage_received_bytes_total{bucket="shard0",category="list_bucket",owner="testuser"}
> 0.0
> radosgw_usage_received_bytes_total{bucket="shard0",category="put_obj",owner="testuser"}
> 6352678.0
> radosgw_usage_received_bytes_total{bucket="shard0",category="stat_bucket",owner="testuser"}
> 0.0
> radosgw_usage_sent_bytes_total{bucket="shard0",category="create_bucket",owner="testuser"}
> 19.0
> radosgw_usage_sent_bytes_total{bucket="shard0",category="delete_obj",owner="testuser"}
> 0.0
> radosgw_usage_sent_bytes_total{bucket="shard0",category="list_bucket",owner="testuser"}
> 638339458.0
> radosgw_usage_sent_bytes_total{bucket="shard0",category="put_obj",owner="testuser"}
> 79.0
> radosgw_usage_sent_bytes_total{bucket="shard0",category="stat_bucket",owner="testuser"}
> 380.0
> radosgw_usage_ops_total{bucket="shard0",category="create_bucket",owner="testuser"}
> 1.0
> radosgw_usage_ops_total{bucket="shard0",category="delete_obj",owner="testuser"}
> 1094978.0
> radosgw_usage_ops_total{bucket="shard0",category="list_bucket",owner="testuser"}
> 2276.0
> radosgw_usage_ops_total{bucket="shard0",category="put_obj",owner="testuser"}
> 1094979.0
> radosgw_usage_ops_total{bucket="shard0",category="stat_bucket",owner="testuser"}
> 20.0
>
>
> Thanks,
> Berant
>
>
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com